linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] Ext3 latency fixes
@ 2009-04-03  7:01 Theodore Ts'o
  2009-04-03  7:01 ` [PATCH 1/4] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks Theodore Ts'o
  2009-04-03 18:24 ` [GIT PULL] Ext3 latency fixes Linus Torvalds
  0 siblings, 2 replies; 69+ messages in thread
From: Theodore Ts'o @ 2009-04-03  7:01 UTC (permalink / raw)
  To: torvalds; +Cc: Linux Kernel Developers List, Ext4 Developers List

Hi Linus, 

Please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext3-latency-fixes

I posted these patches a while back, and with your "overwrite.c" test
case, I decided to see how they did.  The results were spectacular
enough (see below) that I've decided to request that they be included
in 2.6.30.  I've posted the critical patches below for review before,
and Jan Kara has acked them, and there have been no complaints about
them.

I've also added two patches which add replace-via-truncate and
replace-via-rename workarounds to ext3's data=writeback mode.  They
only change the behavior in the (currently non-default) data=writeback
mode.

The benchmark which I used was Linus's overwrite.c as the background
workload, and my fsync-tester as the foreground tester.  The
fsync-tester writes a megabyte to a file and then times how long it
takes to fsync that file, and then sleeps a second before repeating.

Using an unpatched 2.6.29, fsync-tester shows the following times:

fsync time: 3.4732
fsync time: 2.4338
fsync time: 5.9496
fsync time: 6.2402
fsync time: 4.3375
fsync time: 6.3283
fsync time: 3.6930
fsync time: 3.1848
fsync time: 3.3231

The final report of overwrite.c is:

       1.984 GB written in 82.75 (24 MB/s)           

With these patches applied, the fsync-tester times are:

fsync time: 1.4538
fsync time: 1.6328
fsync time: 1.4632
fsync time: 1.4550
fsync time: 0.2932
fsync time: 1.6986
fsync time: 0.3787
fsync time: 1.3380
fsync time: 1.8145
fsync time: 0.4050
fsync time: 1.3880

... and the final report of overwrite.c is:

       1.984 GB written in 93.77 (21 MB/s)           

By having the fsync-related I/O fixed to be posted using WRITE_SYNC,
instead of WRITE, it prioritizes the fsync-related I/O so that it gets
done ahead of the streaming write.  This does slow down the background
write process, but it speeds up the worst-case fsync() latency from
6.2 seconds to 1.8 seconds.  (Measurements done on a 5400 rpm laptop
drive.)

All aside from the benchmark improvements, if the writes are coming
from fsync(), they really are synchronous operations, and they should
be marked that way just from a correctness point of view.

	       		 	       	    - Ted

Theodore Ts'o (4):
      block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks
      ext3: Use WRITE_SYNC for commits which are caused by fsync()
      ext3: Add replace-on-truncate hueristics for data=writeback mode
      ext3: Add replace-on-rename hueristics for data=writeback mode

 fs/buffer.c             |    5 +++--
 fs/ext3/file.c          |    4 ++++
 fs/ext3/inode.c         |    3 +++
 fs/ext3/namei.c         |    6 +++++-
 fs/jbd/commit.c         |   23 +++++++++++++++--------
 fs/jbd/transaction.c    |    2 ++
 include/linux/ext3_fs.h |    1 +
 include/linux/jbd.h     |    5 +++++
 8 files changed, 38 insertions(+), 11 deletions(-)





^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/4] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks
  2009-04-03  7:01 [GIT PULL] Ext3 latency fixes Theodore Ts'o
@ 2009-04-03  7:01 ` Theodore Ts'o
  2009-04-03  7:01   ` [PATCH 2/4] ext3: Use WRITE_SYNC for commits which are caused by fsync() Theodore Ts'o
  2009-04-03 18:24 ` [GIT PULL] Ext3 latency fixes Linus Torvalds
  1 sibling, 1 reply; 69+ messages in thread
From: Theodore Ts'o @ 2009-04-03  7:01 UTC (permalink / raw)
  To: torvalds
  Cc: Linux Kernel Developers List, Ext4 Developers List, Theodore Ts'o

When doing synchronous writes because wbc->sync_mode is set to
WBC_SYNC_ALL, send the write request using WRITE_SYNC, so that we
don't unduly block system calls such as fsync().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 891e1c7..e7ebd95 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1714,6 +1714,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	struct buffer_head *bh, *head;
 	const unsigned blocksize = 1 << inode->i_blkbits;
 	int nr_underway = 0;
+	int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
 
 	BUG_ON(!PageLocked(page));
 
@@ -1805,7 +1806,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
-			submit_bh(WRITE, bh);
+			submit_bh(write_op, bh);
 			nr_underway++;
 		}
 		bh = next;
@@ -1859,7 +1860,7 @@ recover:
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
 			clear_buffer_dirty(bh);
-			submit_bh(WRITE, bh);
+			submit_bh(write_op, bh);
 			nr_underway++;
 		}
 		bh = next;
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 2/4] ext3: Use WRITE_SYNC for commits which are caused by fsync()
  2009-04-03  7:01 ` [PATCH 1/4] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks Theodore Ts'o
@ 2009-04-03  7:01   ` Theodore Ts'o
  2009-04-03  7:01     ` [PATCH 3/4] ext3: Add replace-on-truncate hueristics for data=writeback mode Theodore Ts'o
  0 siblings, 1 reply; 69+ messages in thread
From: Theodore Ts'o @ 2009-04-03  7:01 UTC (permalink / raw)
  To: torvalds
  Cc: Linux Kernel Developers List, Ext4 Developers List, Theodore Ts'o

If a commit is triggered by fsync(), set a flag indicating the journal
blocks associated with the transaction should be flushed out using
WRITE_SYNC.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
---
 fs/jbd/commit.c      |   23 +++++++++++++++--------
 fs/jbd/transaction.c |    2 ++
 include/linux/jbd.h  |    5 +++++
 3 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 3fbffb1..f8077b9 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -20,6 +20,7 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 #include <linux/pagemap.h>
+#include <linux/bio.h>
 
 /*
  * Default IO end handler for temporary BJ_IO buffer_heads.
@@ -171,14 +172,15 @@ static int journal_write_commit_record(journal_t *journal,
 	return (ret == -EIO);
 }
 
-static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
+static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
+				   int write_op)
 {
 	int i;
 
 	for (i = 0; i < bufs; i++) {
 		wbuf[i]->b_end_io = end_buffer_write_sync;
 		/* We use-up our safety reference in submit_bh() */
-		submit_bh(WRITE, wbuf[i]);
+		submit_bh(write_op, wbuf[i]);
 	}
 }
 
@@ -186,7 +188,8 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
  *  Submit all the data buffers to disk
  */
 static int journal_submit_data_buffers(journal_t *journal,
-				transaction_t *commit_transaction)
+				       transaction_t *commit_transaction,
+				       int write_op)
 {
 	struct journal_head *jh;
 	struct buffer_head *bh;
@@ -225,7 +228,7 @@ write_out_data:
 				BUFFER_TRACE(bh, "needs blocking lock");
 				spin_unlock(&journal->j_list_lock);
 				/* Write out all data to prevent deadlocks */
-				journal_do_submit_data(wbuf, bufs);
+				journal_do_submit_data(wbuf, bufs, write_op);
 				bufs = 0;
 				lock_buffer(bh);
 				spin_lock(&journal->j_list_lock);
@@ -256,7 +259,7 @@ write_out_data:
 			jbd_unlock_bh_state(bh);
 			if (bufs == journal->j_wbufsize) {
 				spin_unlock(&journal->j_list_lock);
-				journal_do_submit_data(wbuf, bufs);
+				journal_do_submit_data(wbuf, bufs, write_op);
 				bufs = 0;
 				goto write_out_data;
 			}
@@ -286,7 +289,7 @@ write_out_data:
 		}
 	}
 	spin_unlock(&journal->j_list_lock);
-	journal_do_submit_data(wbuf, bufs);
+	journal_do_submit_data(wbuf, bufs, write_op);
 
 	return err;
 }
@@ -315,6 +318,7 @@ void journal_commit_transaction(journal_t *journal)
 	int first_tag = 0;
 	int tag_flag;
 	int i;
+	int write_op = WRITE;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -347,6 +351,8 @@ void journal_commit_transaction(journal_t *journal)
 	spin_lock(&journal->j_state_lock);
 	commit_transaction->t_state = T_LOCKED;
 
+	if (commit_transaction->t_synchronous_commit)
+		write_op = WRITE_SYNC;
 	spin_lock(&commit_transaction->t_handle_lock);
 	while (commit_transaction->t_updates) {
 		DEFINE_WAIT(wait);
@@ -431,7 +437,8 @@ void journal_commit_transaction(journal_t *journal)
 	 * Now start flushing things to disk, in the order they appear
 	 * on the transaction lists.  Data blocks go first.
 	 */
-	err = journal_submit_data_buffers(journal, commit_transaction);
+	err = journal_submit_data_buffers(journal, commit_transaction,
+					  write_op);
 
 	/*
 	 * Wait for all previously submitted IO to complete.
@@ -660,7 +667,7 @@ start_journal_io:
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
 				bh->b_end_io = journal_end_buffer_io_sync;
-				submit_bh(WRITE, bh);
+				submit_bh(write_op, bh);
 			}
 			cond_resched();
 
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index e6a1174..ed886e6 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1440,6 +1440,8 @@ int journal_stop(handle_t *handle)
 		}
 	}
 
+	if (handle->h_sync)
+		transaction->t_synchronous_commit = 1;
 	current->journal_info = NULL;
 	spin_lock(&journal->j_state_lock);
 	spin_lock(&transaction->t_handle_lock);
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index 64246dc..2c69431 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -552,6 +552,11 @@ struct transaction_s
 	 */
 	int t_handle_count;
 
+	/*
+	 * This transaction is being forced and some process is
+	 * waiting for it to finish.
+	 */
+	int t_synchronous_commit:1;
 };
 
 /**
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 3/4] ext3: Add replace-on-truncate hueristics for data=writeback mode
  2009-04-03  7:01   ` [PATCH 2/4] ext3: Use WRITE_SYNC for commits which are caused by fsync() Theodore Ts'o
@ 2009-04-03  7:01     ` Theodore Ts'o
  2009-04-03  7:01       ` [PATCH 4/4] ext3: Add replace-on-rename " Theodore Ts'o
  0 siblings, 1 reply; 69+ messages in thread
From: Theodore Ts'o @ 2009-04-03  7:01 UTC (permalink / raw)
  To: torvalds
  Cc: Linux Kernel Developers List, Ext4 Developers List, Theodore Ts'o

In data=writeback mode, start an asynchronous flush when closing a
file which had been previously truncated down to zero.  This lowers
the probability of data loss in the case of applications that attempt
to replace a file using truncate.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext3/file.c          |    4 ++++
 fs/ext3/inode.c         |    3 +++
 include/linux/ext3_fs.h |    1 +
 3 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 3be1e06..4a04cbb 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -33,6 +33,10 @@
  */
 static int ext3_release_file (struct inode * inode, struct file * filp)
 {
+	if (EXT3_I(inode)->i_state & EXT3_STATE_FLUSH_ON_CLOSE) {
+		filemap_flush(inode->i_mapping);
+		EXT3_I(inode)->i_state &= ~EXT3_STATE_FLUSH_ON_CLOSE;
+	}
 	/* if we are the last writer on the inode, drop the block reservation */
 	if ((filp->f_mode & FMODE_WRITE) &&
 			(atomic_read(&inode->i_writecount) == 1))
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 5fa453b..0f5bca0 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -2346,6 +2346,9 @@ void ext3_truncate(struct inode *inode)
 	if (!ext3_can_truncate(inode))
 		return;
 
+	if (inode->i_size == 0 && ext3_should_writeback_data(inode))
+		ei->i_state |= EXT3_STATE_FLUSH_ON_CLOSE;
+
 	/*
 	 * We have to lock the EOF page here, because lock_page() nests
 	 * outside journal_start().
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index dd495b8..d2630c5 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -208,6 +208,7 @@ static inline __u32 ext3_mask_flags(umode_t mode, __u32 flags)
 #define EXT3_STATE_JDATA		0x00000001 /* journaled data exists */
 #define EXT3_STATE_NEW			0x00000002 /* inode is newly created */
 #define EXT3_STATE_XATTR		0x00000004 /* has in-inode xattrs */
+#define EXT3_STATE_FLUSH_ON_CLOSE	0x00000008
 
 /* Used to pass group descriptor data when online resize is done */
 struct ext3_new_group_input {
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 4/4] ext3: Add replace-on-rename hueristics for data=writeback mode
  2009-04-03  7:01     ` [PATCH 3/4] ext3: Add replace-on-truncate hueristics for data=writeback mode Theodore Ts'o
@ 2009-04-03  7:01       ` Theodore Ts'o
  0 siblings, 0 replies; 69+ messages in thread
From: Theodore Ts'o @ 2009-04-03  7:01 UTC (permalink / raw)
  To: torvalds
  Cc: Linux Kernel Developers List, Ext4 Developers List, Theodore Ts'o

In data=writeback mode, start an asynchronous flush when renaming a
file on top of an already-existing file.  This lowers the probability
of data loss in the case of applications that attempt to replace a
file via using rename().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext3/namei.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 4db4ffa..ab98a66 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2265,7 +2265,7 @@ static int ext3_rename (struct inode * old_dir, struct dentry *old_dentry,
 	struct inode * old_inode, * new_inode;
 	struct buffer_head * old_bh, * new_bh, * dir_bh;
 	struct ext3_dir_entry_2 * old_de, * new_de;
-	int retval;
+	int retval, flush_file = 0;
 
 	old_bh = new_bh = dir_bh = NULL;
 
@@ -2401,6 +2401,8 @@ static int ext3_rename (struct inode * old_dir, struct dentry *old_dentry,
 		ext3_mark_inode_dirty(handle, new_inode);
 		if (!new_inode->i_nlink)
 			ext3_orphan_add(handle, new_inode);
+		if (ext3_should_writeback_data(new_inode))
+			flush_file = 1;
 	}
 	retval = 0;
 
@@ -2409,6 +2411,8 @@ end_rename:
 	brelse (old_bh);
 	brelse (new_bh);
 	ext3_journal_stop(handle);
+	if (retval == 0 && flush_file)
+		filemap_flush(old_inode->i_mapping);
 	return retval;
 }
 
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-03  7:01 [GIT PULL] Ext3 latency fixes Theodore Ts'o
  2009-04-03  7:01 ` [PATCH 1/4] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks Theodore Ts'o
@ 2009-04-03 18:24 ` Linus Torvalds
  2009-04-03 18:47   ` Jens Axboe
                     ` (2 more replies)
  1 sibling, 3 replies; 69+ messages in thread
From: Linus Torvalds @ 2009-04-03 18:24 UTC (permalink / raw)
  To: Theodore Ts'o, Jens Axboe
  Cc: Linux Kernel Developers List, Ext4 Developers List



On Fri, 3 Apr 2009, Theodore Ts'o wrote:
> 
> Please pull from:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext3-latency-fixes

Thanks, pulled. I'll be interested to see how it feels. Will report back 
after I've rebuild and gone through a few more emails.

One thing I started wondering about in your changes to start using 
WRITE_SYNC is that I'm getting closer to thinking that we did the whole 
WRITE-vs-WRITE_SYNC thing the wrong way around.

Now, it's clearly true that non-synchronous writes are hopefully always 
the common case, so in that sense it makes sense to think of "WRITE" as 
the default non-critical case, and then make the (fewer) WRITE_SYNC cases 
be the special case.

But at the same time, I now suspect that we could actually have solved 
this problem more easily by just doing things the other way around: make 
the default "WRITE" be the high-priority one (to match "READ"), and then 
just explicitly marking the data writes with "WRITE_ASYNC".

Why? Because I think that with all the writes sprinkled around in random 
places, it's probably _easier_ to find the bulk writes that cause the 
biggest issues, and just fix _those_ to be WRITE_ASYNC. They may be bulk, 
they may be the common case, but they also tend to be the case where we 
write with generic routines (eg the whole "do_writepages()" thing).

So the VFS layer tends to already do much of the bulk writeout, and maybe 
we would have been better off just changing those to ASYNC and leaving any 
more specialized cases as the SYNC case? That would have avoided a lot of 
this effort at the filesystem level. We'd just assume that the default 
filesystem-specific writes tend to all be SYNC.

Comments?

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-03 18:24 ` [GIT PULL] Ext3 latency fixes Linus Torvalds
@ 2009-04-03 18:47   ` Jens Axboe
  2009-04-03 19:13     ` Theodore Tso
  2009-04-03 21:01     ` Chris Mason
  2009-04-03 19:02   ` Linus Torvalds
  2009-04-03 19:54   ` Theodore Tso
  2 siblings, 2 replies; 69+ messages in thread
From: Jens Axboe @ 2009-04-03 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Ts'o, Linux Kernel Developers List, Ext4 Developers List

On Fri, Apr 03 2009, Linus Torvalds wrote:
> 
> 
> On Fri, 3 Apr 2009, Theodore Ts'o wrote:
> > 
> > Please pull from:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext3-latency-fixes
> 
> Thanks, pulled. I'll be interested to see how it feels. Will report back 
> after I've rebuild and gone through a few more emails.

I have one question, didn't see this series before... Ted, what kind of
tests did you run with this and on what? Currently one has to be careful
with WRITE_SYNC, as it also implies an immediate unplug of the device.
So not only does it flag the priority as sync, it'll also kick things
off immediately. We had a nasty regression in performance in a few
revisions of the kernel due to this, sqlite performance was basically 4
times as bad as before we did WRITE_SYNC in sync_dirty_buffer(). So I'd
be curious what kind of testing was done with the patch series before
submitting it.

We should probably just dump the unplug bit from WRITE_SYNC and make
sure we do those explicitly after submission instead.

> One thing I started wondering about in your changes to start using 
> WRITE_SYNC is that I'm getting closer to thinking that we did the whole 
> WRITE-vs-WRITE_SYNC thing the wrong way around.
> 
> Now, it's clearly true that non-synchronous writes are hopefully always 
> the common case, so in that sense it makes sense to think of "WRITE" as 
> the default non-critical case, and then make the (fewer) WRITE_SYNC cases 
> be the special case.
> 
> But at the same time, I now suspect that we could actually have solved 
> this problem more easily by just doing things the other way around: make 
> the default "WRITE" be the high-priority one (to match "READ"), and then 
> just explicitly marking the data writes with "WRITE_ASYNC".
> 
> Why? Because I think that with all the writes sprinkled around in random 
> places, it's probably _easier_ to find the bulk writes that cause the 
> biggest issues, and just fix _those_ to be WRITE_ASYNC. They may be bulk, 
> they may be the common case, but they also tend to be the case where we 
> write with generic routines (eg the whole "do_writepages()" thing).
> 
> So the VFS layer tends to already do much of the bulk writeout, and maybe 
> we would have been better off just changing those to ASYNC and leaving any 
> more specialized cases as the SYNC case? That would have avoided a lot of 
> this effort at the filesystem level. We'd just assume that the default 
> filesystem-specific writes tend to all be SYNC.

Makes some sense, but we have to be really careful with SYNC writes.
It's important that it really be things that are immediately waited upon
and not "important" writes, since otherwise it'll wreak havoc with the
responsiveness of our reads.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-03 18:24 ` [GIT PULL] Ext3 latency fixes Linus Torvalds
  2009-04-03 18:47   ` Jens Axboe
@ 2009-04-03 19:02   ` Linus Torvalds
  2009-04-03 20:41     ` Linus Torvalds
  2009-04-03 19:54   ` Theodore Tso
  2 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2009-04-03 19:02 UTC (permalink / raw)
  To: Theodore Ts'o, Jens Axboe
  Cc: Linux Kernel Developers List, Ext4 Developers List



On Fri, 3 Apr 2009, Linus Torvalds wrote:

> 
> 
> On Fri, 3 Apr 2009, Theodore Ts'o wrote:
> > 
> > Please pull from:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext3-latency-fixes
> 
> Thanks, pulled. I'll be interested to see how it feels. Will report back 
> after I've rebuild and gone through a few more emails.

Hmm.

The "overwrite" behavior may well be better, but it was smooth enough 
beforehand too (never having more than ~8MB dirty). The "create big file 
and sync" workload causes huge fsync pauses, though. IOW, try with

	while :
	do
		time sh -c "dd if=/dev/zero of=bigfile bs=8M count=256 ; sync"
	done

and even really small fsync's end up being at the end of all that 
unrelated activity, and you see things like

    fsync(7)                                = 0 <32.756308>

(that was my "switch email folders with update" test case, the full trace 
for that file descriptor is

    open("/home/torvalds/mail/git-list", O_RDWR) = 7 <0.000010>
    fstatfs(7, {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=19230104, f_bfree=13853292, f_bavail=12876440, f_files=4890624
    flock(7, LOCK_EX)                       = 0 <0.000009>
    fstat(7, {st_mode=S_IFREG|0600, st_size=54231534, ...}) = 0 <0.000005>
    lseek(7, 0, SEEK_SET)                   = 0 <0.000006>
    write(7, "From MAILER-DAEMON Fri Apr  3 11:"..., 554) = 554 <0.000012>
    lseek(7, 54202529, SEEK_SET)            = 54202529 <0.000007>
    read(7, "From torvalds@linux-foundation.or"..., 66) = 66 <0.000008>
    lseek(7, 54202595, SEEK_SET)            = 54202595 <0.000006>
    read(7, "Return-Path: <git-owner@vger.kern"..., 2915) = 2915 <0.000007>
    lseek(7, 54202529, SEEK_SET)            = 54202529 <0.000005>
    write(7, "From torvalds@linux-foundation.or"..., 2981) = 2981 <0.000009>
    ftruncate(7, 54231534)                  = 0 <0.000008>
    fsync(7)                                = 0 <32.756308>
    close(7)                                = 0 <0.000006>

so it had done just a few kB of writes, but because it ended up behind
the humongous backlog of 'bigfile' it didn't much help.

Also, it's maybe worth noting that you don't actually need a 2GB file to 
trigger this behavior. Change that "count=256" into a "count=16", and you 
now have a simulation of just writing 128MB at a time, with a "sync" in 
between to make sure it hits the disk.  It makes the pauses smaller, but 
they are still several seconds.

(That, btw, is probably more the kind of thing I see when doign a "yum 
update". I assume a package manager would do exactly that kind of "unpack 
files and sync" in a loop).

Btw, I assume this same thing holds true for ext4 too? Because it shows 
how two different "sync" operations interact, and one kills the 
performance of the other one. So as long as there is a _single_ fsync() 
user, you're fine. It's when you get more than one...

Again, I have that Intel SSD that should do pretty reliably 40+MB/s even 
with really nasty write patterns, so I do need several hundred megs to 
really see painful pauses. On a slower disk you'd need much less).

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-03 18:47   ` Jens Axboe
@ 2009-04-03 19:13     ` Theodore Tso
  2009-04-03 21:01     ` Chris Mason
  1 sibling, 0 replies; 69+ messages in thread
From: Theodore Tso @ 2009-04-03 19:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Linux Kernel Developers List, Ext4 Developers List

On Fri, Apr 03, 2009 at 08:47:29PM +0200, Jens Axboe wrote:
> 
> I have one question, didn't see this series before... Ted, what kind of
> tests did you run with this and on what? 

Well, I've been using it in my default kernel for about a month, and
I've noticed any massive performance issues.  (4GB X61s, with an SSD
as my primary drive and a 5400 rpm disk as my secondary drive).  I
also did testing on my crash and burn machine N270 ATOM netbook, 512
MB memory, 5400rpm drive.

Most of my testing has been on the fsync latency issue plus my
standard filesystem regression for stability (dbench, fsx, etc.)  I
have not done an exhaustive series of performance tests, so you bring
up a good point.

> Currently one has to be careful with WRITE_SYNC, as it also implies
> an immediate unplug of the device.

Yeah, I can definitely see how this could be problematic.  What I
really wanted was to separate out the priority of "things which cause
the user to wait and fume" from "things that are just being dumped out
to disk in the background".  Unplugging the device is definitely *not*
a good thing here, and you're right, it would be better if we fixed
the relatively few places in the code which use WRITE_SYNC to unplug
the queue after all of the writes are submitted.

    	  						- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-03 18:24 ` [GIT PULL] Ext3 latency fixes Linus Torvalds
  2009-04-03 18:47   ` Jens Axboe
  2009-04-03 19:02   ` Linus Torvalds
@ 2009-04-03 19:54   ` Theodore Tso
  2 siblings, 0 replies; 69+ messages in thread
From: Theodore Tso @ 2009-04-03 19:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Linux Kernel Developers List, Ext4 Developers List

On Fri, Apr 03, 2009 at 11:24:50AM -0700, Linus Torvalds wrote:
> But at the same time, I now suspect that we could actually have solved 
> this problem more easily by just doing things the other way around: make 
> the default "WRITE" be the high-priority one (to match "READ"), and then 
> just explicitly marking the data writes with "WRITE_ASYNC".
> 
> Why? Because I think that with all the writes sprinkled around in random 
> places, it's probably _easier_ to find the bulk writes that cause the 
> biggest issues, and just fix _those_ to be WRITE_ASYNC. They may be bulk, 
> they may be the common case, but they also tend to be the case where we 
> write with generic routines (eg the whole "do_writepages()" thing).
> 
> So the VFS layer tends to already do much of the bulk writeout, and maybe 
> we would have been better off just changing those to ASYNC and leaving any 
> more specialized cases as the SYNC case? That would have avoided a lot of 
> this effort at the filesystem level. We'd just assume that the default 
> filesystem-specific writes tend to all be SYNC.

Well, most filesystem-specific writes tend all to be ASYNC; it's only
those related to commits and fsync() which are SYNC.  Ext3 is unusual
in that data=ordered and the physical-block journalling design of the
jbd layer means that we actually have a much larger number of blocks
that need to be written out synchronously than most other filesystems.
But even so, the number of callsites that I needed to change weren't
that large; in fact, over half of them weren't in the filesystem at
all, but in the page writeback code, since fsync() and data=ordered
both need to wait for the inodes's pages to be flushed out to disk,
and that's all done in common code.  The other 40% was in the jbd's
commit code, while we are writing out the journal buffers.

I suspect the more important thing to address is the fact that
WRITE_SYNC unplugs the block device queue, and we would be better off
separating marking a particular I/O as "a user is waiting for this"
from "unplug the device queue now".  That will hopefully improve
things even more.

					- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-03 19:02   ` Linus Torvalds
@ 2009-04-03 20:41     ` Linus Torvalds
  2009-04-04 13:57       ` Theodore Tso
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2009-04-03 20:41 UTC (permalink / raw)
  To: Theodore Ts'o, Jens Axboe
  Cc: Linux Kernel Developers List, Ext4 Developers List



On Fri, 3 Apr 2009, Linus Torvalds wrote:
> 
> The "overwrite" behavior may well be better, but it was smooth enough 
> beforehand too (never having more than ~8MB dirty). The "create big file 
> and sync" workload causes huge fsync pauses, though. IOW, try with
> 
> 	while :
> 	do
> 		time sh -c "dd if=/dev/zero of=bigfile bs=8M count=256 ; sync"
> 	done
> 
> and even really small fsync's end up being at the end of all that 
> unrelated activity, and you see things like
> 
>     fsync(7)                                = 0 <32.756308>

Hmm. So I decided to try with "data=writeback" to see if it really makes 
that big of a difference. It does help, but I still easily trigger 
multi-second pauses:

   fsync(4)                                = 0 <2.447926>
   fsync(4)                                = 0 <4.275472>
   fsync(4)                                = 0 <3.731948>
   fsync(4)                                = 0 <4.020839>
   fsync(6)                                = 0 <3.482735>
   fsync(6)                                = 0 <5.819923>

even though the system _should_ be able to write back the 'bigfile' 
datablocks without any ordering constraint on the fsync.

So at a guess, it now avoids some nasty journal writing ordering issue 
where it has to wait for the previous transaction, and it's probably now 
purely an IO ordering issue.

This is all with your ext3 work, btw. But I also added "rm bigfile" at the 
end of the loop (so that it shouldn't trigge any "write out bigfile early" 
logic), and that didn't seem to make any difference. 

Are we perhaps ending up doing those regular 'bigfile' writes as 
WRITE_SYNC, just because of the global "sync()" call? That's probably a 
bad idea. A "sync" is about pure throughput. It's not about latency like 
"fsync()" is.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-03 18:47   ` Jens Axboe
  2009-04-03 19:13     ` Theodore Tso
@ 2009-04-03 21:01     ` Chris Mason
  1 sibling, 0 replies; 69+ messages in thread
From: Chris Mason @ 2009-04-03 21:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Theodore Ts'o, Linux Kernel Developers List,
	Ext4 Developers List

On Fri, 2009-04-03 at 20:47 +0200, Jens Axboe wrote:
> On Fri, Apr 03 2009, Linus Torvalds wrote:
> > 
> > 
> > On Fri, 3 Apr 2009, Theodore Ts'o wrote:
> > > 
> > > Please pull from:
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext3-latency-fixes
> > 
> > Thanks, pulled. I'll be interested to see how it feels. Will report back 
> > after I've rebuild and gone through a few more emails.
> 
> I have one question, didn't see this series before... Ted, what kind of
> tests did you run with this and on what? Currently one has to be careful
> with WRITE_SYNC, as it also implies an immediate unplug of the device.
> So not only does it flag the priority as sync, it'll also kick things
> off immediately. We had a nasty regression in performance in a few
> revisions of the kernel due to this, sqlite performance was basically 4
> times as bad as before we did WRITE_SYNC in sync_dirty_buffer(). So I'd
> be curious what kind of testing was done with the patch series before
> submitting it.
> 

I had run a few tests against this...the assumption was that since ext3
doesn't have writepages I should be able to easily show using WRITE_SYNC
made for fewer merges and lower performance.  But, the write cache on my
array happily ate the smaller requests and came out just as fast.

So, your sqlite example made me think of something different:

mkfs.ext3 /dev/sdb
mount /dev/sdb /mnt
cd /mnt
tar xf /local/linux-2.6.tar
sync
sync
sync
echo 3 > /proc/sys/vm/drop_caches

# step one, start an O_SYNC writer to trigger lots of WRITE_SYNC
# this 32GB file is big enough that my drive can't finish it before the
# tar is done.
dd if=/dev/zero of=/mnt/foo bs=1M count=32000 oflag=sync &

# step two, time how long it takes to read the linux-2.6 dir
# from above
time tar cf /dev/zero /mnt/linux-2.6 &
wait %2
kill %1

/local/linux-2.6.tar is a full kernel tree after compiling with
debuginfo turned on and including the mercurial history files.  It has
2.2GB of files.  The goal here is to time how long it takes to read the
big directory tree in the face of a writer doing constant WRITE_SYNC
writes.  Time to read the directory tree:

With Ted's ext3 patches: 8m2s
Plain 2.6.29:            4m9s

-chris



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-03 20:41     ` Linus Torvalds
@ 2009-04-04 13:57       ` Theodore Tso
  2009-04-04 15:16         ` Jens Axboe
  2009-04-04 22:13         ` Linus Torvalds
  0 siblings, 2 replies; 69+ messages in thread
From: Theodore Tso @ 2009-04-04 13:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Linux Kernel Developers List, Ext4 Developers List

On Fri, Apr 03, 2009 at 01:41:54PM -0700, Linus Torvalds wrote:
> 
> Hmm. So I decided to try with "data=writeback" to see if it really makes 
> that big of a difference. It does help, but I still easily trigger 
> multi-second pauses

Using ext3 data=writeback, your "write big (2GB) file and sync" as the
background workload, and fsync-tester, I was able to reduce the
latency down to under 1 second...

fsync time: 0.1717
fsync time: 0.0205
fsync time: 0.1814
fsync time: 0.7408
fsync time: 0.1955
fsync time: 0.0327
fsync time: 0.0456
fsync time: 0.0563

...by doing two things: (1) Applying the attached patch, which fixes
one last critical place where we were using WRITE instead of
WRITE_SYNC --- the commit record was being written by
sync_dirty_buffer(), and this of _all_ places needs to use WRITE_SYNC,
since it waits for the buffer to be written write after submitting the
I/O, and (2) using the anticipatory I/O scheduler instead of the cfq
I/O scheduler.

(1) brought things down from 2-3.5 seconds on my system to 1-2
seconds, and (2) brought things down to what you see above.  I think
what is going on with the cfq scheduler is that it's using time slices
to make sure sync and async operations never completely starve each
other out, and in this case we need to tune the I/O scheduling
parameters so that for this workload, the synchronous operations don't
end up getting impeded by the asynchronous writes caused by background
"write big file and sync" task.

In any case, Jens may want to look at this test case (ext3
data=writeback, 'while : ; do time sh -c "dd if=/dev/zero of=bigfile
bs=8M count=256 ; sync"; done', and fsync-tester) as a good way to see
how cfq might be improved.

On another thread, it's been commented that there are still workloads
for which people are quietly switching from CFQ to AS, and this is
bad, because it causes us not to try to figure out why our default I/O
scheduler still as these 1% of cases where people need to use another
scheduler.  Well, here's one such case which is relatively easy to
reproduce.

> Are we perhaps ending up doing those regular 'bigfile' writes as 
> WRITE_SYNC, just because of the global "sync()" call? That's probably a 
> bad idea. A "sync" is about pure throughput. It's not about latency like 
> "fsync()" is.

Well, at least on my system where I did this round of testing (4gig
X61s with a 5400 RPM laptop drive), most of the time we weren't
writing the bigfile writes because of the sync, but because dd and
pdflush processes was trying to flush out the dirty pages from the big
write operation.  At the moment where "dd" completes and the "sync"
command is executed, the fsync latency jumped up to about 4-5 seconds
before this last round of changes.  After adding the attached patch
and switching to the AS I/O scheduler, at the moment of the sync the
fsync latency was just over a second (1.1 to 1.2 seconds).  The rest
of the time we are averaging between a 1/4 and a 1/3 of a second, with
rare fsync latency spikes up to about 3/4 of a second, as show at the
beginning of this message.

(Maybe on a system with a lot more memory, the dirty pages don't start
getting flushed to disk until the sync command, but that's not what
I'm seeing on my 4 gig laptop.)

In answer to your question, "sync" does the writes in two passes;
first it pushes out writes with wbc.sync_mode set to WB_SYNC_NONE, and
then it calls the page writeback routines a second time with
WB_SYNC_ALL.  So most of the writes should go out with WRITE, except
that the page writeback routines aren't as aggressive about pushing
out _all_ pages in WB_SYNC_NONE, so I believe some of the pages would
still be written on the WB_SYNC_ALL, and thus would go out using
WRITE_SYNc.  This is based on 2-3 month old memory of how things
worked in the page-writeback routines, which is the last time I traced
the very deep call trees involved in this area.  I'd have to run a
blktrace experiment to see for sure how many of the writes were going
out as WRITE vs. WRITE_SYNC in the case of the 'sync' command.

In any case, I recommend you take the following attached patch, and
then try out ext3 data=writeback with anticipatory I/O scheduler.
Hopefully you'll be pleased with the results.

							- Ted

>From 6d293d2aa42d43c120f113bde55f7b0d6f3f35ae Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Sat, 4 Apr 2009 09:17:38 -0400
Subject: [PATCH] sync_dirty_buffer: Use WRITE_SYNC instead of WRITE

The sync_dirty_buffer() function submits a buffer for write, and then
synchronously waits for it.  It clearly should use WRITE_SYNC instead
of WRITE.  This significantly reduces ext3's fsync() latency when
there is a huge background task writing data asyncronously in the
background, since ext3 uses sync_dirty_buffer() to write the commit
block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/buffer.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 2ed4b68..78ed086 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3038,7 +3038,7 @@ int sync_dirty_buffer(struct buffer_head *bh)
 	if (test_clear_buffer_dirty(bh)) {
 		get_bh(bh);
 		bh->b_end_io = end_buffer_write_sync;
-		ret = submit_bh(WRITE, bh);
+		ret = submit_bh(WRITE_SYNC, bh);
 		wait_on_buffer(bh);
 		if (buffer_eopnotsupp(bh)) {
 			clear_buffer_eopnotsupp(bh);
-- 
1.5.6.3


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 13:57       ` Theodore Tso
@ 2009-04-04 15:16         ` Jens Axboe
  2009-04-04 15:57           ` Linus Torvalds
  2009-04-04 19:18           ` Theodore Tso
  2009-04-04 22:13         ` Linus Torvalds
  1 sibling, 2 replies; 69+ messages in thread
From: Jens Axboe @ 2009-04-04 15:16 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Linux Kernel Developers List, Ext4 Developers List

On Sat, Apr 04 2009, Theodore Tso wrote:
> On Fri, Apr 03, 2009 at 01:41:54PM -0700, Linus Torvalds wrote:
> > 
> > Hmm. So I decided to try with "data=writeback" to see if it really makes 
> > that big of a difference. It does help, but I still easily trigger 
> > multi-second pauses
> 
> Using ext3 data=writeback, your "write big (2GB) file and sync" as the
> background workload, and fsync-tester, I was able to reduce the
> latency down to under 1 second...
> 
> fsync time: 0.1717
> fsync time: 0.0205
> fsync time: 0.1814
> fsync time: 0.7408
> fsync time: 0.1955
> fsync time: 0.0327
> fsync time: 0.0456
> fsync time: 0.0563
> 
> ...by doing two things: (1) Applying the attached patch, which fixes
> one last critical place where we were using WRITE instead of
> WRITE_SYNC --- the commit record was being written by
> sync_dirty_buffer(), and this of _all_ places needs to use WRITE_SYNC,
> since it waits for the buffer to be written write after submitting the
> I/O, and (2) using the anticipatory I/O scheduler instead of the cfq
> I/O scheduler.
> 
> (1) brought things down from 2-3.5 seconds on my system to 1-2
> seconds, and (2) brought things down to what you see above.  I think
> what is going on with the cfq scheduler is that it's using time slices
> to make sure sync and async operations never completely starve each
> other out, and in this case we need to tune the I/O scheduling
> parameters so that for this workload, the synchronous operations don't
> end up getting impeded by the asynchronous writes caused by background
> "write big file and sync" task.
> 
> In any case, Jens may want to look at this test case (ext3
> data=writeback, 'while : ; do time sh -c "dd if=/dev/zero of=bigfile
> bs=8M count=256 ; sync"; done', and fsync-tester) as a good way to see
> how cfq might be improved.
> 
> On another thread, it's been commented that there are still workloads
> for which people are quietly switching from CFQ to AS, and this is
> bad, because it causes us not to try to figure out why our default I/O
> scheduler still as these 1% of cases where people need to use another
> scheduler.  Well, here's one such case which is relatively easy to
> reproduce.
> 
> > Are we perhaps ending up doing those regular 'bigfile' writes as 
> > WRITE_SYNC, just because of the global "sync()" call? That's probably a 
> > bad idea. A "sync" is about pure throughput. It's not about latency like 
> > "fsync()" is.
> 
> Well, at least on my system where I did this round of testing (4gig
> X61s with a 5400 RPM laptop drive), most of the time we weren't
> writing the bigfile writes because of the sync, but because dd and
> pdflush processes was trying to flush out the dirty pages from the big
> write operation.  At the moment where "dd" completes and the "sync"
> command is executed, the fsync latency jumped up to about 4-5 seconds
> before this last round of changes.  After adding the attached patch
> and switching to the AS I/O scheduler, at the moment of the sync the
> fsync latency was just over a second (1.1 to 1.2 seconds).  The rest
> of the time we are averaging between a 1/4 and a 1/3 of a second, with
> rare fsync latency spikes up to about 3/4 of a second, as show at the
> beginning of this message.
> 
> (Maybe on a system with a lot more memory, the dirty pages don't start
> getting flushed to disk until the sync command, but that's not what
> I'm seeing on my 4 gig laptop.)
> 
> In answer to your question, "sync" does the writes in two passes;
> first it pushes out writes with wbc.sync_mode set to WB_SYNC_NONE, and
> then it calls the page writeback routines a second time with
> WB_SYNC_ALL.  So most of the writes should go out with WRITE, except
> that the page writeback routines aren't as aggressive about pushing
> out _all_ pages in WB_SYNC_NONE, so I believe some of the pages would
> still be written on the WB_SYNC_ALL, and thus would go out using
> WRITE_SYNc.  This is based on 2-3 month old memory of how things
> worked in the page-writeback routines, which is the last time I traced
> the very deep call trees involved in this area.  I'd have to run a
> blktrace experiment to see for sure how many of the writes were going
> out as WRITE vs. WRITE_SYNC in the case of the 'sync' command.
> 
> In any case, I recommend you take the following attached patch, and
> then try out ext3 data=writeback with anticipatory I/O scheduler.
> Hopefully you'll be pleased with the results.
> 
> 							- Ted
> 
> From 6d293d2aa42d43c120f113bde55f7b0d6f3f35ae Mon Sep 17 00:00:00 2001
> From: Theodore Ts'o <tytso@mit.edu>
> Date: Sat, 4 Apr 2009 09:17:38 -0400
> Subject: [PATCH] sync_dirty_buffer: Use WRITE_SYNC instead of WRITE
> 
> The sync_dirty_buffer() function submits a buffer for write, and then
> synchronously waits for it.  It clearly should use WRITE_SYNC instead
> of WRITE.  This significantly reduces ext3's fsync() latency when
> there is a huge background task writing data asyncronously in the
> background, since ext3 uses sync_dirty_buffer() to write the commit
> block.

Sorry for not commenting on the rest of your email, I don't have a lot
of spare time on my hands. I'll get to that.

Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
write regressions (sqlite performance drops by a factor of 4-5). Do a
git log on fs/buffer.c and see the original patch (which does what your
patch does) and the later revert. No idea why you are now suggestion
making that exact change?!

Low latency is nice, but not at the cost of 4-5x throughput for real
world cases. Lets please try and keep a reasonable focus here, things
need to be tested widely before we go and sprinkle sync magic
everywhere. I can't rule out that CFQ will need adjusting, cutting down
to basics it looks at sync vs async like AS does.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 15:16         ` Jens Axboe
@ 2009-04-04 15:57           ` Linus Torvalds
  2009-04-04 16:06             ` Linus Torvalds
                               ` (2 more replies)
  2009-04-04 19:18           ` Theodore Tso
  1 sibling, 3 replies; 69+ messages in thread
From: Linus Torvalds @ 2009-04-04 15:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Theodore Tso, Linux Kernel Developers List, Ext4 Developers List



On Sat, 4 Apr 2009, Jens Axboe wrote:
> 
> Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
> write regressions (sqlite performance drops by a factor of 4-5). Do a
> git log on fs/buffer.c and see the original patch (which does what your
> patch does) and the later revert. No idea why you are now suggestion
> making that exact change?!

Jens, if I can re-create the 'fsync' times (I haven't yet), then the 
default scheduler _will_ be switched to AS.

> Low latency is nice, but not at the cost of 4-5x throughput for real
> world cases.

I'm sorry, but that fsync thing _is_ a real-world case, and it's the one 
that a hell of a lot more people care about than some idiotic sqlite 
throughput issue.

You have a test-case now. Consider it a priority, or consider CFQ to be a 
"for crazy servers that only care about throughput".

Quite frankly, the fact that I can see _seconds_ of latencies with a 
really good SSD is not acceptable. The fact that it is by design is even 
less so.

Latency is more important than throughput. It's that simple.

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 15:57           ` Linus Torvalds
@ 2009-04-04 16:06             ` Linus Torvalds
  2009-04-04 17:36               ` Jens Axboe
  2009-04-04 17:34             ` Jens Axboe
  2009-04-07 15:39             ` Indan Zupancic
  2 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2009-04-04 16:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Theodore Tso, Linux Kernel Developers List, Ext4 Developers List



On Sat, 4 Apr 2009, Linus Torvalds wrote:

> 
> 
> On Sat, 4 Apr 2009, Jens Axboe wrote:
> > 
> > Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
> > write regressions (sqlite performance drops by a factor of 4-5). Do a
> > git log on fs/buffer.c and see the original patch (which does what your
> > patch does) and the later revert. No idea why you are now suggestion
> > making that exact change?!
> 
> Jens, if I can re-create the 'fsync' times (I haven't yet), then the 
> default scheduler _will_ be switched to AS.

Btw, that patch is "obviously correct".

That write we're submitting is very much a synchronous write. After all, 
the code is literally

	ret = submit_bh(WRITE, bh);
	wait_on_buffer(bh);

and it just doesn't get any more synchronous than that. If we don't start 
the IO immediately (since we're _waiting_ for it immediately), we're 
broken. 

Now, if we need to fix some mysql throughput issue as a result, then I'd 
suggest that we look at whether "sync_dirty_buffer()" is sometimes called 
when it doesn't need to be od (b) whether perhaps the unplugging behavior 
is simply buggy in some other way.

But Ted's patch makes so much sense on a purely conceptual level, that 
when you look at the patch, you should almost not even need to see the 
performance numbers to know it's right. But together with the numbers Ted 
posted, it's a total no-brainer. CFQ is clearly broken here, and it's 
pretty clear that apparently CFQ has been tuned (improperly) purely for 
throughput.

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 15:57           ` Linus Torvalds
  2009-04-04 16:06             ` Linus Torvalds
@ 2009-04-04 17:34             ` Jens Axboe
  2009-04-04 17:44               ` Linus Torvalds
                                 ` (2 more replies)
  2009-04-07 15:39             ` Indan Zupancic
  2 siblings, 3 replies; 69+ messages in thread
From: Jens Axboe @ 2009-04-04 17:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Linux Kernel Developers List, Ext4 Developers List

On Sat, Apr 04 2009, Linus Torvalds wrote:
> 
> 
> On Sat, 4 Apr 2009, Jens Axboe wrote:
> > 
> > Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
> > write regressions (sqlite performance drops by a factor of 4-5). Do a
> > git log on fs/buffer.c and see the original patch (which does what your
> > patch does) and the later revert. No idea why you are now suggestion
> > making that exact change?!
> 
> Jens, if I can re-create the 'fsync' times (I haven't yet), then the 
> default scheduler _will_ be switched to AS.

Linus, I'm not aware of a difference here between AS and CFQ. If there
is, it's surely a bug and it will be fixed ASAP. The email I wrote has
nothing to do with CFQ performance, it was a general observation on what
happened with an identical patch.

> > Low latency is nice, but not at the cost of 4-5x throughput for real
> > world cases.
> 
> I'm sorry, but that fsync thing _is_ a real-world case, and it's the one 
> that a hell of a lot more people care about than some idiotic sqlite 
> throughput issue.

sqlite is just one case, I'm sure there are others. My point is that we
should make sure that we don't regress on the throughput side. It's a
trade off, we don't want throughput to fall through the floor either.

> You have a test-case now. Consider it a priority, or consider CFQ to be a 
> "for crazy servers that only care about throughput".

CFQ was never for crazy servers, it was very much about interactiveness
from the very beginning. So if it's broken on CFQ, it will of course get
fixed right away.

> Quite frankly, the fact that I can see _seconds_ of latencies with a 
> really good SSD is not acceptable. The fact that it is by design is even 
> less so.

Agree, multi-second latencies is not acceptable.

> Latency is more important than throughput. It's that simple.

It's really not that simple, otherwise the schedulers would be much
simpler. It's pretty easy to get good latency if you disregard any
throughput concerns, that'll never work in the real world. Latency is
definitely extremely important, but simply stating that latency is the
one and only factor of importance is just too simplistic.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 16:06             ` Linus Torvalds
@ 2009-04-04 17:36               ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2009-04-04 17:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Linux Kernel Developers List, Ext4 Developers List

On Sat, Apr 04 2009, Linus Torvalds wrote:
> 
> 
> On Sat, 4 Apr 2009, Linus Torvalds wrote:
> 
> > 
> > 
> > On Sat, 4 Apr 2009, Jens Axboe wrote:
> > > 
> > > Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
> > > write regressions (sqlite performance drops by a factor of 4-5). Do a
> > > git log on fs/buffer.c and see the original patch (which does what your
> > > patch does) and the later revert. No idea why you are now suggestion
> > > making that exact change?!
> > 
> > Jens, if I can re-create the 'fsync' times (I haven't yet), then the 
> > default scheduler _will_ be switched to AS.
> 
> Btw, that patch is "obviously correct".
> 
> That write we're submitting is very much a synchronous write. After all, 
> the code is literally
> 
> 	ret = submit_bh(WRITE, bh);
> 	wait_on_buffer(bh);
> 
> and it just doesn't get any more synchronous than that. If we don't start 
> the IO immediately (since we're _waiting_ for it immediately), we're 
> broken. 
> 
> Now, if we need to fix some mysql throughput issue as a result, then I'd 
> suggest that we look at whether "sync_dirty_buffer()" is sometimes called 
> when it doesn't need to be od (b) whether perhaps the unplugging behavior 
> is simply buggy in some other way.
> 
> But Ted's patch makes so much sense on a purely conceptual level, that 
> when you look at the patch, you should almost not even need to see the 
> performance numbers to know it's right. But together with the numbers Ted 
> posted, it's a total no-brainer. CFQ is clearly broken here, and it's 
> pretty clear that apparently CFQ has been tuned (improperly) purely for 
> throughput.

I agree, hence I previously wrote and submitted an IDENTICAL patch. I'll
go and test things, not sure why you think that AS and CFQ perform very
differently here, to my knowledge no such postings exist for this test
case. And I'll state again that if they do, of course I'll look into
that and fix it.

One thing that may be of concern is the immediate unplug, but on the
other hand, we do an immediate wait which would unplug anyway. So if we
just look at the sqlite regression, I'm sure it'll be pretty easy to pin
point with some blktrace data.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 17:34             ` Jens Axboe
@ 2009-04-04 17:44               ` Linus Torvalds
  2009-04-04 18:00                 ` Trenton D. Adams
  2009-04-04 18:01                 ` Jens Axboe
  2009-04-04 20:18               ` Ingo Molnar
  2009-04-04 20:56               ` Arjan van de Ven
  2 siblings, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2009-04-04 17:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Theodore Tso, Linux Kernel Developers List, Ext4 Developers List



On Sat, 4 Apr 2009, Jens Axboe wrote:
> > 
> > I'm sorry, but that fsync thing _is_ a real-world case, and it's the one 
> > that a hell of a lot more people care about than some idiotic sqlite 
> > throughput issue.
> 
> sqlite is just one case, I'm sure there are others. My point is that we
> should make sure that we don't regress on the throughput side. It's a
> trade off, we don't want throughput to fall through the floor either.

Jens, we _have_ regressed on the latency side. Everybody agrees. 

Also, I may be odd, but I really do think latency is more important than 
throughput. When my disk has latencies in the sub-milliseconds, I simply 
do not think it is _acceptable_ to have hickups that affect my workload in 
human-visible terms. 

You say sqlite might regress by 4-5x. But Ted's numbers improve latencies 
by mor than that. I haven't re-created them yet myself (still reading 
email), but the point is, 4-5x may sound bad to you, but turn it around: 
the current latency situation is _really_ bad. If we can fix it, we 
definitely should.

> > Quite frankly, the fact that I can see _seconds_ of latencies with a 
> > really good SSD is not acceptable. The fact that it is by design is even 
> > less so.
> 
> Agree, multi-second latencies is not acceptable.

I can literally send you strace output from my MUA, where it pauses for 
ten seconds after it has written about 5kB (that's _kilobytes_) of data 
and does a 'fsync'.

That's the load that Ted worked on and has a solution for.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 17:44               ` Linus Torvalds
@ 2009-04-04 18:00                 ` Trenton D. Adams
  2009-04-04 18:01                 ` Jens Axboe
  1 sibling, 0 replies; 69+ messages in thread
From: Trenton D. Adams @ 2009-04-04 18:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Theodore Tso, Linux Kernel Developers List,
	Ext4 Developers List

Hi Guys,

Sorry for this "quick" interruption.  Jens doesn't like the sqllite
problem.  Ted and Linus don't like the latency problem.  Why not make
that SYNC_ALL option a 1/0 option in /proc/sys/vm/sync_all or
something of the like?  That sort of change is pretty small, and quick
to make, right?  This would help, at least until the real problem is
found.

Sorry if I'm stepping out of bounds. :P

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 17:44               ` Linus Torvalds
  2009-04-04 18:00                 ` Trenton D. Adams
@ 2009-04-04 18:01                 ` Jens Axboe
  2009-04-04 18:10                   ` Linus Torvalds
  2009-04-04 23:22                   ` Theodore Tso
  1 sibling, 2 replies; 69+ messages in thread
From: Jens Axboe @ 2009-04-04 18:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Linux Kernel Developers List, Ext4 Developers List

On Sat, Apr 04 2009, Linus Torvalds wrote:
> 
> 
> On Sat, 4 Apr 2009, Jens Axboe wrote:
> > > 
> > > I'm sorry, but that fsync thing _is_ a real-world case, and it's the one 
> > > that a hell of a lot more people care about than some idiotic sqlite 
> > > throughput issue.
> > 
> > sqlite is just one case, I'm sure there are others. My point is that we
> > should make sure that we don't regress on the throughput side. It's a
> > trade off, we don't want throughput to fall through the floor either.
> 
> Jens, we _have_ regressed on the latency side. Everybody agrees. 

It appears so, yes.

> Also, I may be odd, but I really do think latency is more important than 
> throughput. When my disk has latencies in the sub-milliseconds, I simply 
> do not think it is _acceptable_ to have hickups that affect my workload in 
> human-visible terms. 

Not everyone has an Intel SSD. But yes, latency is definitely more
important than throughput. That's not the same as saying that throughput
doesn't matter, because it definitely does.

> You say sqlite might regress by 4-5x. But Ted's numbers improve latencies 
> by mor than that. I haven't re-created them yet myself (still reading 
> email), but the point is, 4-5x may sound bad to you, but turn it around: 
> the current latency situation is _really_ bad. If we can fix it, we 
> definitely should.

I haven't either. On monday I'll throw some testing and patches on the
boxes here. We can get the latency right, I want that as much as the
next guy. I just want to make sure it doesn't become too one-sided.

> > > Quite frankly, the fact that I can see _seconds_ of latencies with a 
> > > really good SSD is not acceptable. The fact that it is by design is even 
> > > less so.
> > 
> > Agree, multi-second latencies is not acceptable.
> 
> I can literally send you strace output from my MUA, where it pauses for 
> ten seconds after it has written about 5kB (that's _kilobytes_) of data 
> and does a 'fsync'.

Unless you make all journal writes sync, ext3 fsync will always suck big
time. But I get your point.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 18:01                 ` Jens Axboe
@ 2009-04-04 18:10                   ` Linus Torvalds
  2009-04-04 23:22                   ` Theodore Tso
  1 sibling, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2009-04-04 18:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Theodore Tso, Linux Kernel Developers List, Ext4 Developers List



On Sat, 4 Apr 2009, Jens Axboe wrote:
> 
> > Also, I may be odd, but I really do think latency is more important than 
> > throughput. When my disk has latencies in the sub-milliseconds, I simply 
> > do not think it is _acceptable_ to have hickups that affect my workload in 
> > human-visible terms. 
> 
> Not everyone has an Intel SSD.

The thing is, I know for a fact that the hickups were much worse _before_ 
I had the Intel SSD.

With the SSD, I have to work a lot more to get them. Back when I had 
regular rotational disks, I definitely did not need to do any 2GB file 
creation at all. Just standard "yum update" would make my mail reader 
pause.

Yeah, it's just for a short time, but it's very annoying. One second is 
quite noticeable when I type something, and it doesn't show up on the 
screen ("What? My Nehalem cannot keep up with my _typing_?"). 2-3 seconds 
it really feels bad. And by the time it takes 5+ seconds, I start moving 
my mouse around just to see whether the computer crashed.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 15:16         ` Jens Axboe
  2009-04-04 15:57           ` Linus Torvalds
@ 2009-04-04 19:18           ` Theodore Tso
  2009-04-06  8:12             ` Jens Axboe
  1 sibling, 1 reply; 69+ messages in thread
From: Theodore Tso @ 2009-04-04 19:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Linux Kernel Developers List, Ext4 Developers List

On Sat, Apr 04, 2009 at 05:16:50PM +0200, Jens Axboe wrote:
> Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
> write regressions (sqlite performance drops by a factor of 4-5). Do a
> git log on fs/buffer.c and see the original patch (which does what your
> patch does) and the later revert. 

You mean this revert, right?

commit 78f707bfc723552e8309b7c38a8d0cc51012e813
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Tue Feb 17 13:59:08 2009 +0100

    block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb
    
    The above commit added WRITE_SYNC and switched various places to using
    that for committing writes that will be waited upon immediately after
    submission. However, this causes a performance regression with AS and CFQ
    for ext3 at least, since sync_dirty_buffer() will submit some writes with
    WRITE_SYNC while ext3 has sumitted others dependent writes without the sync
    flag set. This causes excessive anticipation/idling in the IO scheduler
    because sync and async writes get interleaved, causing a big performance
    regression for the below test case (which is meant to simulate sqlite
    like behaviour)....

OK, let me test things out first, but note that with the changes that
Linus has already accepted, this may not be an issue --- since we've
now fixed ext3 to submit those dependent writes with the SYNC flag
now.  So I'm not sure the performance regression still applies, but
I'll test using the test case supplied in the rest of the commit log
and get back to you.

						- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 17:34             ` Jens Axboe
  2009-04-04 17:44               ` Linus Torvalds
@ 2009-04-04 20:18               ` Ingo Molnar
  2009-04-06 21:50                 ` Lennart Sorensen
  2009-04-04 20:56               ` Arjan van de Ven
  2 siblings, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2009-04-04 20:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Theodore Tso, Linux Kernel Developers List,
	Ext4 Developers List


* Jens Axboe <jens.axboe@oracle.com> wrote:

> > Quite frankly, the fact that I can see _seconds_ of latencies 
> > with a really good SSD is not acceptable. The fact that it is by 
> > design is even less so.
> 
> Agree, multi-second latencies is not acceptable.

btw., just to insert some hard data: usability studies place the 
acceptance threshold for delays to around 300 milliseconds.

If the computer does not 'respond' (in a real or a fake, visible or 
audible way) to their input within 0.3 seconds they get annoyed 
emotionally.

It does not matter how complex it is for the kernel to solve this 
problem, it does not matter how far away and difficult to access the 
data is. If we are not absolutely latency centric in Linux, if we 
spuriously delay apps that do supposedly simple-looking things the 
user _will_ get annoyed and _will_ pick another OS.

All things considered it is certainly a combined kernel and app 
problem space to get there, but not being completely aware of its 
importance in the kernel kills any chance of us ever having a 
long-term solution.

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 17:34             ` Jens Axboe
  2009-04-04 17:44               ` Linus Torvalds
  2009-04-04 20:18               ` Ingo Molnar
@ 2009-04-04 20:56               ` Arjan van de Ven
  2009-04-06  7:06                 ` Jens Axboe
  2 siblings, 1 reply; 69+ messages in thread
From: Arjan van de Ven @ 2009-04-04 20:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Theodore Tso, Linux Kernel Developers List,
	Ext4 Developers List

On Sat, 4 Apr 2009 19:34:12 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> > Latency is more important than throughput. It's that simple.
> 
> It's really not that simple, otherwise the schedulers would be much
> simpler. It's pretty easy to get good latency if you disregard any
> throughput concerns, 

I'd be very interested in a scheduler like that.....
How much work would it be to make it ?

(if nothing else it would be a good number to have "should be within
50% of the perfect one for the tradeoff")


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 13:57       ` Theodore Tso
  2009-04-04 15:16         ` Jens Axboe
@ 2009-04-04 22:13         ` Linus Torvalds
  2009-04-04 22:19           ` Linus Torvalds
  2009-04-05  0:20           ` Theodore Tso
  1 sibling, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2009-04-04 22:13 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jens Axboe, Linux Kernel Developers List, Ext4 Developers List



On Sat, 4 Apr 2009, Theodore Tso wrote:
> 
> Using ext3 data=writeback, your "write big (2GB) file and sync" as the
> background workload, and fsync-tester, I was able to reduce the
> latency down to under 1 second...

Hmm. I can certainly see a very noticeable improvement.

I could still see 1-2s fsync() delays for my MUA test (and seem to have 
seen a 4s one once too). But most were all 'quite noticeable stutters', 
and very seldom do they go into the 'ooh, that's really painful' stage.

And yes, anticipatory seems to be quite noticeably better than cfq here. 
With cfq I got a few two-second delays on 'ftruncate()' too (probably 
because of your new serialization code?), and the longest fsync() delay 
was over 7 seconds. That was definitely solidly in the "painful" category.

(Jens - my test-case is not the exact same fsycn() test that Ted uses: 
it's really just me being in my MUA editing an email, and doing that 

	while : ; do time sh -c "dd if=/dev/zero of=bigfile bs=8M count=256 ; sync; rm bigfile"; done

in another window. My MUA does these save-files ever minute or so, and 
does a 'fsync()' after writing that (small) file. If the fsync() takes 
more than half a second, I can definitely notice. If it takes 1-2 seconds, 
it's a big stutter, where keyboard auto-repeat when moving around really 
hurts (ie I don't see how it moves).

If it takes 5+ seconds, it feels really bad, and as if the whole email 
client had just died. 

Yeah, it's a nasty test. 

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 22:13         ` Linus Torvalds
@ 2009-04-04 22:19           ` Linus Torvalds
  2009-04-05  0:20           ` Theodore Tso
  1 sibling, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2009-04-04 22:19 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jens Axboe, Linux Kernel Developers List, Ext4 Developers List



On Sat, 4 Apr 2009, Linus Torvalds wrote:
> 
> Hmm. I can certainly see a very noticeable improvement.

.. and btw, just to put that into perspective - I ended up running with 
that repeated 'dd+sync+rm' in the background for quite a while, and the 
machine was in no way unusable. I don't think I saw a single "ooh, that 
feels really dead" case with anticipatory, although I saw several "bad 
stutter" cases.

Of course, with the whole "overwrite" case, things are really much 
smoother, and I can hardly notice anything happening at all. Although I 
_do_ get an occasional 200-600ms fsync hickup, and I can tell those very 
clearly. But if that was the worst I'd ever see, I'd be a very happy 
person. It's a good goal to have, I suspect.

		Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 18:01                 ` Jens Axboe
  2009-04-04 18:10                   ` Linus Torvalds
@ 2009-04-04 23:22                   ` Theodore Tso
  2009-04-04 23:33                     ` Arjan van de Ven
  2009-04-06  6:15                     ` Jens Axboe
  1 sibling, 2 replies; 69+ messages in thread
From: Theodore Tso @ 2009-04-04 23:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Linux Kernel Developers List, Ext4 Developers List

On Sat, Apr 04, 2009 at 08:01:08PM +0200, Jens Axboe wrote:
> 
> Unless you make all journal writes sync, ext3 fsync will always suck big
> time. But I get your point.
> 

We have already made all other journal writes sync when triggered by
fsync() --- except for the commit record which is being written by
sync_dirty_buffer().  I've verified this via blktrace.

However, you're right.  We do have an performance regression using the
regression test provided in commit 78f707bf.  Taking the average of at
least 5 runs, I found:

stock 2.6.29				 1687 ms
2.6.29 w/ ext3-latency-fixes		 7337 ms
2.6.29 w/ full-latency-fixes		13722 ms

"ext3-latency-fixes" are the patches which Linus has already pulled
into the 2.6.29 tree.  "full-latency-fixes" are ext3-latency-fixes
plus the sync_dirty_buffes() patch.

Looking at blktrace of stock 2.6.29 running the sqlite performance
measurement, we see this:

254,2    1       13     0.005120554  7183  Q   W 199720 + 8 [sqlite-fsync-te]
254,2    1       15     0.005666573  6947  Q   W 10277200 + 8 [kjournald]
254,2    1       16     0.005691576  6947  Q   W 10277208 + 8 [kjournald]
254,2    1       18     0.006145684  6947  Q   W 10277216 + 8 [kjournald]
254,2    1       21     0.006685348  7183  Q   W 199720 + 8 [sqlite-fsync-te]
254,2    1       24     0.007162644  6947  Q   W 10277224 + 8 [kjournald]
254,2    1       25     0.007187857  6947  Q   W 10277232 + 8 [kjournald]
254,2    0       27     0.007616473  6947  Q   W 10277240 + 8 [kjournald]

Looking at a blktrace of 2.6.29 plus the full-latency-fixes, we see this:

254,2    0       13     0.013744556  7205  Q  WS 199208 + 8 [sqlite-fsync-te]
254,2    0       16     0.019270748  6965  Q  WS 10301896 + 8 [kjournald]
254,2    0       17     0.019400024  6965  Q  WS 10301904 + 8 [kjournald]
254,2    1       23     0.019892824  6965  Q  WS 10301912 + 8 [kjournald]
254,2    0       20     0.020450995  7205  Q  WS 199208 + 8 [sqlite-fsync-te]
254,2    1       26     0.025954909  6965  Q  WS 10301920 + 8 [kjournald]
254,2    1       27     0.026084814  6965  Q  WS 10301928 + 8 [kjournald]
254,2    0       24     0.026612884  6965  Q  WS 10301936 + 8 [kjournald]

Looking at the deltas between the two iterations of the sqlite
analogue, we see that stock 2.6.29 is 1.56ms, where as with the full
latency fixes, it's 6.70ms. 

However, the full latency fixes all the writes are synchronous, so it
must be the case that the delays are caused by the fact that queue is
getting implicitly unplugged after the synchronous write, and the
problem is no longer the mixing of WRITE and WRITE_SYNC requests as
posted in the commit log for 78f707bf.  If we remove the automatic
unplug for WRITE_SYNC requests, and add an explicit unplug where it is
needed, that should fix the performance regression for this particular
sqlite test case.  (Which isn't a throughput issue, since the test is
basically fopen/write/fsync/fclose over and over again, with no
background load.)

The scenario which Linus and I had been focused on is one where there
was a heavy background load writing asynchronously.  We do want to
make sure that a series of fsync() calls in a tight loop is also
reasonable as well, so Jens, I do think you are absolutely right that
this is something that we need to pay attention to.

I had missed the commit 78f707bf when you originally submitted it, so
I didn't do this test before submitting the patch.  And I guess you
had missed my patch proposal to LKML, and I didn't think to explicitly
CC you on my patches.  Apologies for the communication faults, but
hopefully we can fix this performance issue for both cases and get
these problems behind us.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 23:22                   ` Theodore Tso
@ 2009-04-04 23:33                     ` Arjan van de Ven
  2009-04-05  0:10                       ` Theodore Tso
  2009-04-06  8:16                       ` Jens Axboe
  2009-04-06  6:15                     ` Jens Axboe
  1 sibling, 2 replies; 69+ messages in thread
From: Arjan van de Ven @ 2009-04-04 23:33 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jens Axboe, Linus Torvalds, Linux Kernel Developers List,
	Ext4 Developers List

On Sat, 4 Apr 2009 19:22:22 -0400
Theodore Tso <tytso@mit.edu> wrote:
> 
> However, the full latency fixes all the writes are synchronous, so it
> must be the case that the delays are caused by the fact that queue is
> getting implicitly unplugged after the synchronous write, and the
> problem is no longer the mixing of WRITE and WRITE_SYNC requests as
> posted in the commit log for 78f707bf.  If we remove the automatic
> unplug for WRITE_SYNC requests, and add an explicit unplug where it is
> needed, that should fix the performance regression for this particular
> sqlite test case. 

removing the unplug is bound to be bad; after all we're waiting on the
IO. But maybe it should be "make the unplug a REALLY short time".
At least for rotating storage. For non-rotating .. I'd never wait.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 23:33                     ` Arjan van de Ven
@ 2009-04-05  0:10                       ` Theodore Tso
  2009-04-05 15:05                         ` Arjan van de Ven
  2009-04-05 17:01                         ` Linus Torvalds
  2009-04-06  8:16                       ` Jens Axboe
  1 sibling, 2 replies; 69+ messages in thread
From: Theodore Tso @ 2009-04-05  0:10 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jens Axboe, Linus Torvalds, Linux Kernel Developers List,
	Ext4 Developers List

On Sat, Apr 04, 2009 at 04:33:49PM -0700, Arjan van de Ven wrote:
> > However, the full latency fixes all the writes are synchronous, so it
> > must be the case that the delays are caused by the fact that queue is
> > getting implicitly unplugged after the synchronous write, and the
> > problem is no longer the mixing of WRITE and WRITE_SYNC requests as
> > posted in the commit log for 78f707bf.  If we remove the automatic
> > unplug for WRITE_SYNC requests, and add an explicit unplug where it is
> > needed, that should fix the performance regression for this particular
> > sqlite test case. 
> 
> removing the unplug is bound to be bad; after all we're waiting on the
> IO. But maybe it should be "make the unplug a REALLY short time".
> At least for rotating storage. For non-rotating .. I'd never wait.

Ext3 needs to submit a large number of blocks for writing with
WRITE_SYNC priority, without unplugging the queue, until they are all
submitted.  Then we want to let things rip.  (Hence the "add an
explicit unplug where it is needed".)

It may be that it's easier from an less-lines-of-the-kernels-to-change
point of view to add a WRITE_SYNC_PLUGGED which doesn't do the unplug,
and keep WRITE_SNYC as having the implicit unplug.  Or maybe it would
be better to completely separate the "send a write which is marked as
SYNC" from the "implicit unplug" in the API.

Jens, do you have an opinion/preference?

					- Ted



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 22:13         ` Linus Torvalds
  2009-04-04 22:19           ` Linus Torvalds
@ 2009-04-05  0:20           ` Theodore Tso
  1 sibling, 0 replies; 69+ messages in thread
From: Theodore Tso @ 2009-04-05  0:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Linux Kernel Developers List, Ext4 Developers List

On Sat, Apr 04, 2009 at 03:13:13PM -0700, Linus Torvalds wrote:
> And yes, anticipatory seems to be quite noticeably better than cfq here. 
> With cfq I got a few two-second delays on 'ftruncate()' too (probably 
> because of your new serialization code?), and the longest fsync() delay 
> was over 7 seconds. That was definitely solidly in the "painful" category.

What is going on with in your "ftruncate()" case?  The synchronization
code I added will do call filemap_flush() on close if the file
descriptor had been previously truncated down to zero, either because
it was opened with O_TRUNCATE, or if ftruncate(fd, 0) was explicitly
called.  But it won't actually call fsync() or do anything special on
the actual ftrucate() call; it just sets a flag indicating that the
file in question should be flushed on close.

This is to make the right thing happen for applications which try to
edit a file in place via:

     fd = open("foo", O_RDWR);
     len = read(fd, buf, MAXBUF);
     <modify buf>
     ftruncate(fd, 0);
     write(fd, buf, len);
     close(fd);

Otherwise, given the lack of fsync(fd) in the above sequence, a crash
may leave he file "foo" truncated or only partially written out.

        	     	  		     - Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05  0:10                       ` Theodore Tso
@ 2009-04-05 15:05                         ` Arjan van de Ven
  2009-04-05 17:01                         ` Linus Torvalds
  1 sibling, 0 replies; 69+ messages in thread
From: Arjan van de Ven @ 2009-04-05 15:05 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jens Axboe, Linus Torvalds, Linux Kernel Developers List,
	Ext4 Developers List

On Sat, 4 Apr 2009 20:10:05 -0400
Theodore Tso <tytso@mit.edu> wrote:

> On Sat, Apr 04, 2009 at 04:33:49PM -0700, Arjan van de Ven wrote:
> > > However, the full latency fixes all the writes are synchronous,
> > > so it must be the case that the delays are caused by the fact
> > > that queue is getting implicitly unplugged after the synchronous
> > > write, and the problem is no longer the mixing of WRITE and
> > > WRITE_SYNC requests as posted in the commit log for 78f707bf.  If
> > > we remove the automatic unplug for WRITE_SYNC requests, and add
> > > an explicit unplug where it is needed, that should fix the
> > > performance regression for this particular sqlite test case. 
> > 
> > removing the unplug is bound to be bad; after all we're waiting on
> > the IO. But maybe it should be "make the unplug a REALLY short
> > time". At least for rotating storage. For non-rotating .. I'd never
> > wait.
> 
> Ext3 needs to submit a large number of blocks for writing with
> WRITE_SYNC priority, without unplugging the queue, until they are all
> submitted.  Then we want to let things rip.  (Hence the "add an
> explicit unplug where it is needed".)
> 
> It may be that it's easier from an less-lines-of-the-kernels-to-change
> point of view to add a WRITE_SYNC_PLUGGED which doesn't do the unplug,
> and keep WRITE_SNYC as having the implicit unplug.  Or maybe it would
> be better to completely separate the "send a write which is marked as
> SYNC" from the "implicit unplug" in the API.

I think there's actually 3 cases:
* Normal write with normal plugging rules
* Write that should be "sync" priority, but which should explicitly NOT
  unplug, even if it otherwise would have happened, because the caller
  will, as part of the contract of the API, do the unplug when all IO
  is submitted.
  (a bit like tcp/cork)
* Write that is sync priority and is the last one of a sequence,
  and thus should unplug immediately.

I can even imagine using a temporary/special queue for the 2nd case,
and then splice that into the regular queue when the final IO comes in,
in one go.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05  0:10                       ` Theodore Tso
  2009-04-05 15:05                         ` Arjan van de Ven
@ 2009-04-05 17:01                         ` Linus Torvalds
  2009-04-05 17:15                           ` Mark Lord
                                             ` (3 more replies)
  1 sibling, 4 replies; 69+ messages in thread
From: Linus Torvalds @ 2009-04-05 17:01 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Arjan van de Ven, Jens Axboe, Linux Kernel Developers List,
	Ext4 Developers List



On Sat, 4 Apr 2009, Theodore Tso wrote:
> 
> It may be that it's easier from an less-lines-of-the-kernels-to-change
> point of view to add a WRITE_SYNC_PLUGGED which doesn't do the unplug,
> and keep WRITE_SNYC as having the implicit unplug.  Or maybe it would
> be better to completely separate the "send a write which is marked as
> SYNC" from the "implicit unplug" in the API.

Well, the block layer internally already separates the two, it's just 
WRITE_SYNC that ties them together. So a trivial patch like the appended 
would make WRITE_SYNC just mean "this is synchronous IO" without the 
unplugging, and then WRITE_UNPLUG would mark it both as synchronous _and_ 
unplug it.

(Or you could do the WRITE_SYNC_PLUGGED model too, but I think WRITE_SYNC 
and WRITE_UNPLUG actually would be better names, since they talk about the 
effects more directly).

The question is more of a "who really wants the unplugging". Presumably 
the direct-io code-path really does (on the assumption that if you are 
doing direct-io, the caller had better done all the coalescing it wants).

Non-rotational media would tend to want the unplugging, but the block 
layer already does that (ie in __make_request() we already do:

        if (unplug || blk_queue_nonrot(q))
                __generic_unplug_device(q);

so non-rotational devices get unplugged whether the request was a 
unplugging request or not).

Of course, different IO schedulers react differently to that whole "sync 
vs unplug" thing. I think CFQ is the only one that actually cares about 
the "sync" bit (using different queues for sync vs async). The other 
schedulers only care about the plugging. So the patch below really doesn't 
make much sense as-is, because as things are right now, the scheduler 
behaviors are so different for the unplug-vs-sync thing that no sane user 
can ever know whether they should use WRITE_SYNC (== higher priority 
queueing for CFQ, no-op for others) or WRITE_UNPLUG (unplug on all, and 
additionally higher priority for CFQ).

So I'm not serious about this patch, because as things are right now, I 
don't think it's sensible, but I do think that this just generally shows 
the confusion of what we should do. When is plugging right, when isn't it?

Also, one of the issues seems to literally be that the higher-level 
request handling doesn't care AT ALL about the priority. Allocating the 
request itself does keep reads and writes separated, but if the request is 
a SYNCIO request, and non-sync writes have filled up th write requests, 
we'll have to wait for the background writes to free up requests.

That is quite possibly the longest wait we have in the system.

See get_request():

	const int rw = rw_flags & 0x01;

(That should _really_ be RW_MASK instead of 0x01, I suspect) and how the 
request allocation itself is done.

I think this is broken.

			Linus

---
 include/linux/fs.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index a09e17c..a144377 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -95,7 +95,8 @@ struct inodes_stat_t {
 #define SWRITE 3	/* for ll_rw_block() - wait for buffer lock */
 #define READ_SYNC	(READ | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
 #define READ_META	(READ | (1 << BIO_RW_META))
-#define WRITE_SYNC	(WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
+#define WRITE_SYNC	(WRITE | (1 << BIO_RW_SYNCIO))
+#define WRITE_UNPLUG	(WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
 #define SWRITE_SYNC	(SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
 #define WRITE_BARRIER	(WRITE | (1 << BIO_RW_BARRIER))
 #define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05 17:01                         ` Linus Torvalds
@ 2009-04-05 17:15                           ` Mark Lord
  2009-04-05 20:57                             ` Jeff Garzik
  2009-04-06  8:13                             ` [GIT PULL] Ext3 latency fixes Jens Axboe
  2009-04-05 18:56                           ` Arjan van de Ven
                                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 69+ messages in thread
From: Mark Lord @ 2009-04-05 17:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Arjan van de Ven, Jens Axboe,
	Linux Kernel Developers List, Ext4 Developers List

Linus Torvalds wrote:
>
> Non-rotational media would tend to want the unplugging, but the block 
> layer already does that (ie in __make_request() we already do:
> 
>         if (unplug || blk_queue_nonrot(q))
>                 __generic_unplug_device(q);
> 
> so non-rotational devices get unplugged whether the request was a 
> unplugging request or not).
..

Mmm.. except that Jeff's SSD doesn't appear to identify itself
as a non-rotating media.  I suppose there's a way to "quirk" that
somewhere, though.

:/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05 17:01                         ` Linus Torvalds
  2009-04-05 17:15                           ` Mark Lord
@ 2009-04-05 18:56                           ` Arjan van de Ven
  2009-04-05 19:34                             ` Linus Torvalds
  2009-04-06  6:05                           ` Theodore Tso
  2009-04-06  6:23                           ` Jens Axboe
  3 siblings, 1 reply; 69+ messages in thread
From: Arjan van de Ven @ 2009-04-05 18:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Jens Axboe, Linux Kernel Developers List,
	Ext4 Developers List

On Sun, 5 Apr 2009 10:01:06 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Also, one of the issues seems to literally be that the higher-level 
> request handling doesn't care AT ALL about the priority. Allocating
> the request itself does keep reads and writes separated, but if the
> request is a SYNCIO request, and non-sync writes have filled up th
> write requests, we'll have to wait for the background writes to free
> up requests.
> 
> That is quite possibly the longest wait we have in the system.

it often is; latencytop tends to point that out.
> 
> See get_request():

our default number of requests is so low that we very regularly hit the
limit. In addition to setting kjournald to higher priority, I tend to
set the number of requests to 4096 or so to improve interactive
performance on my own systems. That way at least the elevator has a
chance to see the requests ;-)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05 18:56                           ` Arjan van de Ven
@ 2009-04-05 19:34                             ` Linus Torvalds
  2009-04-05 20:06                               ` Arjan van de Ven
  2009-04-06  6:25                               ` Jens Axboe
  0 siblings, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2009-04-05 19:34 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Theodore Tso, Jens Axboe, Linux Kernel Developers List,
	Ext4 Developers List



On Sun, 5 Apr 2009, Arjan van de Ven wrote:
>
> > See get_request():
> 
> our default number of requests is so low that we very regularly hit the
> limit. In addition to setting kjournald to higher priority, I tend to
> set the number of requests to 4096 or so to improve interactive
> performance on my own systems. That way at least the elevator has a
> chance to see the requests ;-)

That's insane. Long queues make the problem harder to hit, yes. But it 
also tends to make the problem them a million times worse when you _do_ 
hit it.

I would suggest looking instead at trying to have separate allocation 
pools for bulk and "sync" IO. Instead of having just one rq->rq_pool, we 
could easily have a rq->rq_bulk_pool and rq->rq_sync_pool.

We might even _save_ memory by having two pools simply because that may 
make it much less important to have a big pool. Most subsystems don't 
really need that many requests in flight anyway, and the advantage to the 
elevator of huge pools is rather dubious.

So you obviously need more requests than the hardware can have in flight 
(since you want to be able to feed the hardware new requests and overlap 
the refill with the ones executing), but 4096 sounds excessive if you're 
doing something like SATA that can only have 32 actual outstanding 
requests at the hardware.

But yes, if a synchronous request gets blocked just because we've already 
used all the requests, latency will be suffer. 

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05 19:34                             ` Linus Torvalds
@ 2009-04-05 20:06                               ` Arjan van de Ven
  2009-04-06  6:25                               ` Jens Axboe
  1 sibling, 0 replies; 69+ messages in thread
From: Arjan van de Ven @ 2009-04-05 20:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Jens Axboe, Linux Kernel Developers List,
	Ext4 Developers List

On Sun, 5 Apr 2009 12:34:32 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Sun, 5 Apr 2009, Arjan van de Ven wrote:
> >
> > > See get_request():
> > 
> > our default number of requests is so low that we very regularly hit
> > the limit. In addition to setting kjournald to higher priority, I
> > tend to set the number of requests to 4096 or so to improve
> > interactive performance on my own systems. That way at least the
> > elevator has a chance to see the requests ;-)
> 
> That's insane. 

4096 is an absolutely insane value that hides some of the problem

> Long queues make the problem harder to hit, yes. But
> it also tends to make the problem them a million times worse when you
> _do_ hit it.

There is a dilemma though. By not having the IO needs in a queue,
to some degree, they haven't gone away; they just are invisible.

Now there is also a throttling value in having these limits, to
slow down "regular" processes that would cause too much IO.
Except that we have the dirty limit for that in the VM, and except that
most actual IO is done by pdflush and other kernel threads, with the
dirtying of data asynchronous to that.

I would contend that for most common cases, not giving callers a request
immediately does not change or throttle the actual IO that is in want
of being sent to the device. All it does is reduce visibility of the IO
need so less grouping of adjacent and prioritization can be done by the
elevator.
 
> I would suggest looking instead at trying to have separate allocation 
> pools for bulk and "sync" IO. Instead of having just one rq->rq_pool,
> we could easily have a rq->rq_bulk_pool and rq->rq_sync_pool.

Well that or have pools for a few buckets of priority level.
The risk of this is that someone like pdflush might get stuck on a low
priority queue, and thus cannot send the IO it might have wanted to
send into a higher priority queue. I fear that any such limits will in
general punish the wrong guy; after all number 129 is punished, not the
guy who put numbers 1 to 128 in the queue.

I wonder if it wouldn't be a better solution to give insight of the
queue length in use to pdflush, and have pdflush decide what kind of IO
to submit based on the length, rather than having it just block.

Just think of the sync() or fsync() cases.
The total amount of IO that those calls will cause is pretty much
fixed: the data that is "relevantly dirty" at the time of the call.
Holding things back at the request allocation level does not change
that, all it changes is that we delay merging requests that are
adjacent, sort on priority, etc.



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05 17:15                           ` Mark Lord
@ 2009-04-05 20:57                             ` Jeff Garzik
  2009-04-05 23:48                               ` Arjan van de Ven
  2009-04-06  8:13                             ` [GIT PULL] Ext3 latency fixes Jens Axboe
  1 sibling, 1 reply; 69+ messages in thread
From: Jeff Garzik @ 2009-04-05 20:57 UTC (permalink / raw)
  To: Mark Lord
  Cc: Linus Torvalds, Theodore Tso, Arjan van de Ven, Jens Axboe,
	Linux Kernel Developers List, Ext4 Developers List

Mark Lord wrote:
> Linus Torvalds wrote:
>>
>> Non-rotational media would tend to want the unplugging, but the block 
>> layer already does that (ie in __make_request() we already do:
>>
>>         if (unplug || blk_queue_nonrot(q))
>>                 __generic_unplug_device(q);
>>
>> so non-rotational devices get unplugged whether the request was a 
>> unplugging request or not).
> ..
> 
> Mmm.. except that Jeff's SSD doesn't appear to identify itself
> as a non-rotating media.  I suppose there's a way to "quirk" that
> somewhere, though.

We set it in libata-scsi.c:ata_scsi_dev_config() based on ata_id_is_ssd()

That hueristic probably assumes Intel SSDs or something :/

	Jeff



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05 20:57                             ` Jeff Garzik
@ 2009-04-05 23:48                               ` Arjan van de Ven
  2009-04-06  2:32                                 ` Mark Lord
  2009-04-06  5:47                                 ` Jeff Garzik
  0 siblings, 2 replies; 69+ messages in thread
From: Arjan van de Ven @ 2009-04-05 23:48 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Mark Lord, Linus Torvalds, Theodore Tso, Jens Axboe,
	Linux Kernel Developers List, Ext4 Developers List

On Sun, 05 Apr 2009 16:57:21 -0400
Jeff Garzik <jeff@garzik.org> wrote:
> 
> We set it in libata-scsi.c:ata_scsi_dev_config() based on
> ata_id_is_ssd()
> 
> That hueristic probably assumes Intel SSDs or something :/

you mean the "rpm" set to '1' ?
I was pretty sure that that was industry standard...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05 23:48                               ` Arjan van de Ven
@ 2009-04-06  2:32                                 ` Mark Lord
  2009-04-06  5:47                                 ` Jeff Garzik
  1 sibling, 0 replies; 69+ messages in thread
From: Mark Lord @ 2009-04-06  2:32 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jeff Garzik, Linus Torvalds, Theodore Tso, Jens Axboe,
	Linux Kernel Developers List, Ext4 Developers List

Arjan van de Ven wrote:
> On Sun, 05 Apr 2009 16:57:21 -0400
> Jeff Garzik <jeff@garzik.org> wrote:
>> We set it in libata-scsi.c:ata_scsi_dev_config() based on
>> ata_id_is_ssd()
>>
>> That hueristic probably assumes Intel SSDs or something :/
> 
> you mean the "rpm" set to '1' ?
> I was pretty sure that that was industry standard...
.. 

Or even set to zero.  I just don't recall seeing the RPM
reported by hdparm-9.12 for Jeff's SSD.

Cheers

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05 23:48                               ` Arjan van de Ven
  2009-04-06  2:32                                 ` Mark Lord
@ 2009-04-06  5:47                                 ` Jeff Garzik
  2009-04-07 18:18                                   ` Linus Torvalds
  1 sibling, 1 reply; 69+ messages in thread
From: Jeff Garzik @ 2009-04-06  5:47 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Mark Lord, Linus Torvalds, Theodore Tso, Jens Axboe,
	Linux Kernel Developers List, Ext4 Developers List

Arjan van de Ven wrote:
> On Sun, 05 Apr 2009 16:57:21 -0400
> Jeff Garzik <jeff@garzik.org> wrote:
>> We set it in libata-scsi.c:ata_scsi_dev_config() based on
>> ata_id_is_ssd()
>>
>> That hueristic probably assumes Intel SSDs or something :/
> 
> you mean the "rpm" set to '1' ?
> I was pretty sure that that was industry standard...

A -new- industry standard.  You can certainly create a compliant SSD 
while only conforming to ATA-7, for example.  Some older IDE flash 
devices pretend they are normal hard drives in almost every respect, too.

	Jeff




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05 17:01                         ` Linus Torvalds
  2009-04-05 17:15                           ` Mark Lord
  2009-04-05 18:56                           ` Arjan van de Ven
@ 2009-04-06  6:05                           ` Theodore Tso
  2009-04-06  6:23                           ` Jens Axboe
  3 siblings, 0 replies; 69+ messages in thread
From: Theodore Tso @ 2009-04-06  6:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Jens Axboe, Linux Kernel Developers List,
	Ext4 Developers List

On Sun, Apr 05, 2009 at 10:01:06AM -0700, Linus Torvalds wrote:
> Of course, different IO schedulers react differently to that whole "sync 
> vs unplug" thing. I think CFQ is the only one that actually cares about 
> the "sync" bit (using different queues for sync vs async). 

It looks liks AS and CFQ both care about the "sync" bit; they both use
rq_is_sync defined in include/lonux/blkdev.h.  Deadline apparently
only distinguishes between read and write requests, and not whether
they are considered synchronous or not.  The noop scheduler, obviously
also doesn't care.  :-)

> The other schedulers only care about the plugging. So the patch
> below really doesn't make much sense as-is, because as things are
> right now, the scheduler behaviors are so different for the
> unplug-vs-sync thing that no sane user can ever know whether they
> should use WRITE_SYNC (== higher priority queueing for CFQ, no-op
> for others) or WRITE_UNPLUG (unplug on all, and additionally higher
> priority for CFQ).

Well, if the deadline scheduler ignores the SYNC bit, it would still
make sense for it to only unplug the queue after for the commit block,
and not for any of the other writes to the journal.  Unplugging after
every synchronous write is going to lead to a performance problem,
which I've demonstrated using the fopen/fprintf/fsync/fclose scenario
that Jens pointed me at.  

						- Ted

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 23:22                   ` Theodore Tso
  2009-04-04 23:33                     ` Arjan van de Ven
@ 2009-04-06  6:15                     ` Jens Axboe
  1 sibling, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2009-04-06  6:15 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Linux Kernel Developers List, Ext4 Developers List

On Sat, Apr 04 2009, Theodore Tso wrote:
> On Sat, Apr 04, 2009 at 08:01:08PM +0200, Jens Axboe wrote:
> > 
> > Unless you make all journal writes sync, ext3 fsync will always suck big
> > time. But I get your point.
> > 
> 
> We have already made all other journal writes sync when triggered by
> fsync() --- except for the commit record which is being written by
> sync_dirty_buffer().  I've verified this via blktrace.
> 
> However, you're right.  We do have an performance regression using the
> regression test provided in commit 78f707bf.  Taking the average of at
> least 5 runs, I found:
> 
> stock 2.6.29				 1687 ms
> 2.6.29 w/ ext3-latency-fixes		 7337 ms
> 2.6.29 w/ full-latency-fixes		13722 ms
> 
> "ext3-latency-fixes" are the patches which Linus has already pulled
> into the 2.6.29 tree.  "full-latency-fixes" are ext3-latency-fixes
> plus the sync_dirty_buffes() patch.

Ugh yes, not everyone will appreciate the better latency and for them an
8x slowdown is veeery painful.

> Looking at blktrace of stock 2.6.29 running the sqlite performance
> measurement, we see this:
> 
> 254,2    1       13     0.005120554  7183  Q   W 199720 + 8 [sqlite-fsync-te]
> 254,2    1       15     0.005666573  6947  Q   W 10277200 + 8 [kjournald]
> 254,2    1       16     0.005691576  6947  Q   W 10277208 + 8 [kjournald]
> 254,2    1       18     0.006145684  6947  Q   W 10277216 + 8 [kjournald]
> 254,2    1       21     0.006685348  7183  Q   W 199720 + 8 [sqlite-fsync-te]
> 254,2    1       24     0.007162644  6947  Q   W 10277224 + 8 [kjournald]
> 254,2    1       25     0.007187857  6947  Q   W 10277232 + 8 [kjournald]
> 254,2    0       27     0.007616473  6947  Q   W 10277240 + 8 [kjournald]
> 
> Looking at a blktrace of 2.6.29 plus the full-latency-fixes, we see this:
> 
> 254,2    0       13     0.013744556  7205  Q  WS 199208 + 8 [sqlite-fsync-te]
> 254,2    0       16     0.019270748  6965  Q  WS 10301896 + 8 [kjournald]
> 254,2    0       17     0.019400024  6965  Q  WS 10301904 + 8 [kjournald]
> 254,2    1       23     0.019892824  6965  Q  WS 10301912 + 8 [kjournald]
> 254,2    0       20     0.020450995  7205  Q  WS 199208 + 8 [sqlite-fsync-te]
> 254,2    1       26     0.025954909  6965  Q  WS 10301920 + 8 [kjournald]
> 254,2    1       27     0.026084814  6965  Q  WS 10301928 + 8 [kjournald]
> 254,2    0       24     0.026612884  6965  Q  WS 10301936 + 8 [kjournald]
> 
> Looking at the deltas between the two iterations of the sqlite
> analogue, we see that stock 2.6.29 is 1.56ms, where as with the full
> latency fixes, it's 6.70ms. 

It would be interesting with the full set of information, like unplug
and merge. The above definitely shows longer between queue-to-queue, I
wonder what that is due to.

> However, the full latency fixes all the writes are synchronous, so it
> must be the case that the delays are caused by the fact that queue is
> getting implicitly unplugged after the synchronous write, and the
> problem is no longer the mixing of WRITE and WRITE_SYNC requests as
> posted in the commit log for 78f707bf.  If we remove the automatic
> unplug for WRITE_SYNC requests, and add an explicit unplug where it is
> needed, that should fix the performance regression for this particular
> sqlite test case.  (Which isn't a throughput issue, since the test is
> basically fopen/write/fsync/fclose over and over again, with no
> background load.)

We definitely should add a WRITE_SYNC variant that doesn't unplug, for
the cases where you know you are going to submit more than one IO before
waiting. Whether that should just be WRITE_SYNC or we should add a
WRITE_SYNC_UNPLUG, I don't have a really strong preference either way.

But if the delays are due to premature unplugging, it should be visible
in the trace. Care to send the "full" thing?

> The scenario which Linus and I had been focused on is one where there
> was a heavy background load writing asynchronously.  We do want to
> make sure that a series of fsync() calls in a tight loop is also
> reasonable as well, so Jens, I do think you are absolutely right that
> this is something that we need to pay attention to.

Most definitely!

> I had missed the commit 78f707bf when you originally submitted it, so
> I didn't do this test before submitting the patch.  And I guess you
> had missed my patch proposal to LKML, and I didn't think to explicitly
> CC you on my patches.  Apologies for the communication faults, but
> hopefully we can fix this performance issue for both cases and get
> these problems behind us.

We still have a long cycle ahead of us, so I'm sure we can get it nailed
down before release.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05 17:01                         ` Linus Torvalds
                                             ` (2 preceding siblings ...)
  2009-04-06  6:05                           ` Theodore Tso
@ 2009-04-06  6:23                           ` Jens Axboe
  3 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2009-04-06  6:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Arjan van de Ven, Linux Kernel Developers List,
	Ext4 Developers List

On Sun, Apr 05 2009, Linus Torvalds wrote:
> 
> 
> On Sat, 4 Apr 2009, Theodore Tso wrote:
> > 
> > It may be that it's easier from an less-lines-of-the-kernels-to-change
> > point of view to add a WRITE_SYNC_PLUGGED which doesn't do the unplug,
> > and keep WRITE_SNYC as having the implicit unplug.  Or maybe it would
> > be better to completely separate the "send a write which is marked as
> > SYNC" from the "implicit unplug" in the API.
> 
> Well, the block layer internally already separates the two, it's just 
> WRITE_SYNC that ties them together. So a trivial patch like the appended 
> would make WRITE_SYNC just mean "this is synchronous IO" without the 
> unplugging, and then WRITE_UNPLUG would mark it both as synchronous _and_ 
> unplug it.
> 
> (Or you could do the WRITE_SYNC_PLUGGED model too, but I think WRITE_SYNC 
> and WRITE_UNPLUG actually would be better names, since they talk about the 
> effects more directly).
> 
> The question is more of a "who really wants the unplugging". Presumably 
> the direct-io code-path really does (on the assumption that if you are 
> doing direct-io, the caller had better done all the coalescing it wants).
> 
> Non-rotational media would tend to want the unplugging, but the block 
> layer already does that (ie in __make_request() we already do:
> 
>         if (unplug || blk_queue_nonrot(q))
>                 __generic_unplug_device(q);
> 
> so non-rotational devices get unplugged whether the request was a 
> unplugging request or not).
> 
> Of course, different IO schedulers react differently to that whole "sync 
> vs unplug" thing. I think CFQ is the only one that actually cares about 
> the "sync" bit (using different queues for sync vs async). The other 
> schedulers only care about the plugging. So the patch below really doesn't 
> make much sense as-is, because as things are right now, the scheduler 
> behaviors are so different for the unplug-vs-sync thing that no sane user 
> can ever know whether they should use WRITE_SYNC (== higher priority 
> queueing for CFQ, no-op for others) or WRITE_UNPLUG (unplug on all, and 
> additionally higher priority for CFQ).

AS also very much cares about sync vs async. CFQ and AS both use that to
determine whether to wait for a new request from that io context, even
if we have other work to do. This is the logic that enables dependent
reads to proceed at a fast pace, instead of incurring a seek on every
read from a process reading lots of smaller files.

This could very well be what is causing a performance problem with the
writes. For most of these writes, you really don't want to idle at the
end even if they are SYNC. For O_DIRECT it's a win though, so we can't
just say 'never idle for sync writes'. Perhaps what we really need is a
specific WRITE_ODIRECT that signals sync and enables idling, while
keeping the WRITE_SYNC / WRITE_SYNC_UNPLUG like normal writes but just
at sync priority.

> So I'm not serious about this patch, because as things are right now, I 
> don't think it's sensible, but I do think that this just generally shows 
> the confusion of what we should do. When is plugging right, when isn't it?
> 
> Also, one of the issues seems to literally be that the higher-level 
> request handling doesn't care AT ALL about the priority. Allocating the 
> request itself does keep reads and writes separated, but if the request is 
> a SYNCIO request, and non-sync writes have filled up th write requests, 
> we'll have to wait for the background writes to free up requests.
> 
> That is quite possibly the longest wait we have in the system.
> 
> See get_request():
> 
> 	const int rw = rw_flags & 0x01;
> 
> (That should _really_ be RW_MASK instead of 0x01, I suspect) and how the 
> request allocation itself is done.
> 
> I think this is broken.

If we just make it (rw & REQ_RW_SYNC) then we get all sync requests from
the same pool, instead of doing the read vs write thing there. And I
agree, that makes sense.

> @@ -95,7 +95,8 @@ struct inodes_stat_t {
>  #define SWRITE 3	/* for ll_rw_block() - wait for buffer lock */
>  #define READ_SYNC	(READ | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
>  #define READ_META	(READ | (1 << BIO_RW_META))
> -#define WRITE_SYNC	(WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
> +#define WRITE_SYNC	(WRITE | (1 << BIO_RW_SYNCIO))
> +#define WRITE_UNPLUG	(WRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
>  #define SWRITE_SYNC	(SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG))
>  #define WRITE_BARRIER	(WRITE | (1 << BIO_RW_BARRIER))
>  #define DISCARD_NOBARRIER (1 << BIO_RW_DISCARD)

On the naming, after thinking a bit about it, I think it should be
WRITE_SYNC_PLUGGED and WRITE_SYNC. The reason being that if you use
WRITE_SYNC_PLUGGED, then you MUST unplug at some point in time. By
making that explicit in the name, we don't risk callers forgetting to
unplug and causing the unplug timer to do that work (and slowing things
down).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05 19:34                             ` Linus Torvalds
  2009-04-05 20:06                               ` Arjan van de Ven
@ 2009-04-06  6:25                               ` Jens Axboe
  1 sibling, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2009-04-06  6:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Theodore Tso, Linux Kernel Developers List,
	Ext4 Developers List

On Sun, Apr 05 2009, Linus Torvalds wrote:
> 
> 
> On Sun, 5 Apr 2009, Arjan van de Ven wrote:
> >
> > > See get_request():
> > 
> > our default number of requests is so low that we very regularly hit the
> > limit. In addition to setting kjournald to higher priority, I tend to
> > set the number of requests to 4096 or so to improve interactive
> > performance on my own systems. That way at least the elevator has a
> > chance to see the requests ;-)
> 
> That's insane. Long queues make the problem harder to hit, yes. But it 
> also tends to make the problem them a million times worse when you _do_ 
> hit it.

Unfortunately it is insane, because the vm will barf big time on that.
Perhaps when we do the "throttle at device IO rate" for dirtying of data
we can be more relaxed, but increasing nr_requests too much will just
cause the vm to go nuts.

> I would suggest looking instead at trying to have separate allocation 
> pools for bulk and "sync" IO. Instead of having just one rq->rq_pool, we 
> could easily have a rq->rq_bulk_pool and rq->rq_sync_pool.
> 
> We might even _save_ memory by having two pools simply because that may 
> make it much less important to have a big pool. Most subsystems don't 
> really need that many requests in flight anyway, and the advantage to the 
> elevator of huge pools is rather dubious.

The pools are not big, the number is just a maximum allocation number.
In reality we only store 4 requests in the mempool, and we should
probably just dump that down to 1.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 20:56               ` Arjan van de Ven
@ 2009-04-06  7:06                 ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2009-04-06  7:06 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Theodore Tso, Linux Kernel Developers List,
	Ext4 Developers List

On Sat, Apr 04 2009, Arjan van de Ven wrote:
> On Sat, 4 Apr 2009 19:34:12 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > > Latency is more important than throughput. It's that simple.
> > 
> > It's really not that simple, otherwise the schedulers would be much
> > simpler. It's pretty easy to get good latency if you disregard any
> > throughput concerns, 
> 
> I'd be very interested in a scheduler like that.....
> How much work would it be to make it ?
> 
> (if nothing else it would be a good number to have "should be within
> 50% of the perfect one for the tradeoff")

It'd be pretty close to the first version of CFQ. The easiest would be
to add a cfq sysfs know that basically just switches a bunch of things
off in CFQ. Never idle, always dispatch only a single request at the
time, etc. At least for test purposes it would not be that hard. CFQ
doesn't export all of the settings that allow to make this possible
right now, otherwise it could just be done with a shell script.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 19:18           ` Theodore Tso
@ 2009-04-06  8:12             ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2009-04-06  8:12 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Linux Kernel Developers List, Ext4 Developers List

On Sat, Apr 04 2009, Theodore Tso wrote:
> On Sat, Apr 04, 2009 at 05:16:50PM +0200, Jens Axboe wrote:
> > Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
> > write regressions (sqlite performance drops by a factor of 4-5). Do a
> > git log on fs/buffer.c and see the original patch (which does what your
> > patch does) and the later revert. 
> 
> You mean this revert, right?
> 
> commit 78f707bfc723552e8309b7c38a8d0cc51012e813
> Author: Jens Axboe <jens.axboe@oracle.com>
> Date:   Tue Feb 17 13:59:08 2009 +0100
> 
>     block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb
>     
>     The above commit added WRITE_SYNC and switched various places to using
>     that for committing writes that will be waited upon immediately after
>     submission. However, this causes a performance regression with AS and CFQ
>     for ext3 at least, since sync_dirty_buffer() will submit some writes with
>     WRITE_SYNC while ext3 has sumitted others dependent writes without the sync
>     flag set. This causes excessive anticipation/idling in the IO scheduler
>     because sync and async writes get interleaved, causing a big performance
>     regression for the below test case (which is meant to simulate sqlite
>     like behaviour)....
> 
> OK, let me test things out first, but note that with the changes that
> Linus has already accepted, this may not be an issue --- since we've
> now fixed ext3 to submit those dependent writes with the SYNC flag
> now.  So I'm not sure the performance regression still applies, but
> I'll test using the test case supplied in the rest of the commit log
> and get back to you.

Yep that's the one. I'll throw some testing together here, too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-05 17:15                           ` Mark Lord
  2009-04-05 20:57                             ` Jeff Garzik
@ 2009-04-06  8:13                             ` Jens Axboe
  1 sibling, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2009-04-06  8:13 UTC (permalink / raw)
  To: Mark Lord
  Cc: Linus Torvalds, Theodore Tso, Arjan van de Ven,
	Linux Kernel Developers List, Ext4 Developers List

On Sun, Apr 05 2009, Mark Lord wrote:
> Linus Torvalds wrote:
>>
>> Non-rotational media would tend to want the unplugging, but the block  
>> layer already does that (ie in __make_request() we already do:
>>
>>         if (unplug || blk_queue_nonrot(q))
>>                 __generic_unplug_device(q);
>>
>> so non-rotational devices get unplugged whether the request was a  
>> unplugging request or not).
> ..
>
> Mmm.. except that Jeff's SSD doesn't appear to identify itself
> as a non-rotating media.  I suppose there's a way to "quirk" that
> somewhere, though.

You can quirk it in userspace, just have it set the 'rotational' sysfs
file to 0 if it matches make/model/whatever.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 23:33                     ` Arjan van de Ven
  2009-04-05  0:10                       ` Theodore Tso
@ 2009-04-06  8:16                       ` Jens Axboe
  2009-04-06 14:48                         ` Linus Torvalds
  1 sibling, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2009-04-06  8:16 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Theodore Tso, Linus Torvalds, Linux Kernel Developers List,
	Ext4 Developers List

On Sat, Apr 04 2009, Arjan van de Ven wrote:
> On Sat, 4 Apr 2009 19:22:22 -0400
> Theodore Tso <tytso@mit.edu> wrote:
> > 
> > However, the full latency fixes all the writes are synchronous, so it
> > must be the case that the delays are caused by the fact that queue is
> > getting implicitly unplugged after the synchronous write, and the
> > problem is no longer the mixing of WRITE and WRITE_SYNC requests as
> > posted in the commit log for 78f707bf.  If we remove the automatic
> > unplug for WRITE_SYNC requests, and add an explicit unplug where it is
> > needed, that should fix the performance regression for this particular
> > sqlite test case. 
> 
> removing the unplug is bound to be bad; after all we're waiting on the
> IO. But maybe it should be "make the unplug a REALLY short time".
> At least for rotating storage. For non-rotating .. I'd never wait.

Well, either you are submitting a single piece of IO (in which case you
just want to unplug or directly submit as part of the submit_bio()), or
you are submitting several IOS (in which case you just want to unplug at
the end of the IO submission, before waiting).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-06  8:16                       ` Jens Axboe
@ 2009-04-06 14:48                         ` Linus Torvalds
  2009-04-06 15:09                           ` Jens Axboe
  0 siblings, 1 reply; 69+ messages in thread
From: Linus Torvalds @ 2009-04-06 14:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Arjan van de Ven, Theodore Tso, Linux Kernel Developers List,
	Ext4 Developers List



On Mon, 6 Apr 2009, Jens Axboe wrote:
> 
> Well, either you are submitting a single piece of IO (in which case you
> just want to unplug or directly submit as part of the submit_bio()), or
> you are submitting several IOS (in which case you just want to unplug at
> the end of the IO submission, before waiting).

That's not true.

The plugging is often across multiple threads. It didn't _use_ to be (we 
always unplugged at wait), but it is now. Nothing else explains why that 
patch by Ted makes such a big throughput thing, because the code did

      ret = submit_bh(WRITE_SYNC, bh);
      wait_on_buffer(bh);

ie it very much submits a _single_ IO, and waits on it. If plugging made a 
difference, that means that unplugging was delayed so long that somebody 
else does IO too - ie it gets delayed past a wait event.

So according to your own rules, that submit_bh() _should_ use WRITE_SYNC, 
but something bad happens if it does. I'm not quite seeing _what_, though, 
unless there are multiple processes trying to dirty the _same_ buffer, and 
they win if they all can dirty it without doing IO on it in between (and 
then the write turns into just one write).

		Linus


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-06 14:48                         ` Linus Torvalds
@ 2009-04-06 15:09                           ` Jens Axboe
  0 siblings, 0 replies; 69+ messages in thread
From: Jens Axboe @ 2009-04-06 15:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Theodore Tso, Linux Kernel Developers List,
	Ext4 Developers List

On Mon, Apr 06 2009, Linus Torvalds wrote:
> 
> 
> On Mon, 6 Apr 2009, Jens Axboe wrote:
> > 
> > Well, either you are submitting a single piece of IO (in which case you
> > just want to unplug or directly submit as part of the submit_bio()), or
> > you are submitting several IOS (in which case you just want to unplug at
> > the end of the IO submission, before waiting).
> 
> That's not true.
> 
> The plugging is often across multiple threads. It didn't _use_ to be (we 

It is completely true, you will very rarely see merges between
processes. The plug may be across the entire device and it'll get you
some inter-process sorting as well if you have more than one app busy,
but for merging it's usually effectively per-process.

> always unplugged at wait), but it is now. Nothing else explains why that 
> patch by Ted makes such a big throughput thing, because the code did
> 
>       ret = submit_bh(WRITE_SYNC, bh);
>       wait_on_buffer(bh);

Linus, the implementation is still basically the same. Yes it's true
that you used to do

        submit();
        unplug(;)
        wait();

and you better still be doing that or things will run at the timer
unplug speed - not very fast. The only difference is that we hide the
unplug in the wait_on_bit() callback, but it's definitely still very
much the same thing.

> ie it very much submits a _single_ IO, and waits on it. If plugging made a 
> difference, that means that unplugging was delayed so long that somebody 
> else does IO too - ie it gets delayed past a wait event.
> 
> So according to your own rules, that submit_bh() _should_ use WRITE_SYNC, 
> but something bad happens if it does. I'm not quite seeing _what_, though, 
> unless there are multiple processes trying to dirty the _same_ buffer, and 
> they win if they all can dirty it without doing IO on it in between (and 
> then the write turns into just one write).

sync_dirty_buffer() should use it, I agree, I even did the original
patch for it. And in the series posted today, it's also there and it
performs as expected now. For this particular case, it's not about
plugging. The performance penalty came from the 'sync' modifier, it'll
change how the IO schedulers look at the IO. Both AS and CFQ would have
considerably worse performance with this, as you would be mixing related
IO into both sync and async buckets. So it wasn't merging (as suspected)
or plugging.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 20:18               ` Ingo Molnar
@ 2009-04-06 21:50                 ` Lennart Sorensen
  2009-04-07 13:31                   ` Mark Lord
  0 siblings, 1 reply; 69+ messages in thread
From: Lennart Sorensen @ 2009-04-06 21:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, Linus Torvalds, Theodore Tso,
	Linux Kernel Developers List, Ext4 Developers List

On Sat, Apr 04, 2009 at 10:18:20PM +0200, Ingo Molnar wrote:
> btw., just to insert some hard data: usability studies place the 
> acceptance threshold for delays to around 300 milliseconds.
> 
> If the computer does not 'respond' (in a real or a fake, visible or 
> audible way) to their input within 0.3 seconds they get annoyed 
> emotionally.

You mean like hitting 'skip commercial' on the remote, and 2 minutes later
when the commercials are almost over, then it happens.  Yeah users get a
bit peeved at that.  Fortunately that particularly behaviour is somewhat
rare.  5 to 10 seconds is more common and sometimes it works fine.

> It does not matter how complex it is for the kernel to solve this 
> problem, it does not matter how far away and difficult to access the 
> data is. If we are not absolutely latency centric in Linux, if we 
> spuriously delay apps that do supposedly simple-looking things the 
> user _will_ get annoyed and _will_ pick another OS.

I would have to get _very_ annoyed before that happened.  I would have
to find a less annoying OS to switch to as well.

> All things considered it is certainly a combined kernel and app 
> problem space to get there, but not being completely aware of its 
> importance in the kernel kills any chance of us ever having a 
> long-term solution.

Certainly my problems are likely to a large extent mythtv's fault.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-06 21:50                 ` Lennart Sorensen
@ 2009-04-07 13:31                   ` Mark Lord
  2009-04-07 14:48                     ` Lennart Sorensen
  0 siblings, 1 reply; 69+ messages in thread
From: Mark Lord @ 2009-04-07 13:31 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Ingo Molnar, Jens Axboe, Linus Torvalds, Theodore Tso,
	Linux Kernel Developers List, Ext4 Developers List

Lennart Sorensen wrote:
> On Sat, Apr 04, 2009 at 10:18:20PM +0200, Ingo Molnar wrote:
>> btw., just to insert some hard data: usability studies place the 
>> acceptance threshold for delays to around 300 milliseconds.
>>
>> If the computer does not 'respond' (in a real or a fake, visible or 
>> audible way) to their input within 0.3 seconds they get annoyed 
>> emotionally.
> 
> You mean like hitting 'skip commercial' on the remote, and 2 minutes later
> when the commercials are almost over, then it happens.  Yeah users get a
> bit peeved at that.  Fortunately that particularly behaviour is somewhat
> rare.  5 to 10 seconds is more common and sometimes it works fine.
..

I generally don't see any delay at all when doing that,
even when the Mythtv system is very busy on multiple recordings.
But then, I also don't use the huge LIRC stack for the simple R/C either.

cheers

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-07 13:31                   ` Mark Lord
@ 2009-04-07 14:48                     ` Lennart Sorensen
  2009-04-07 19:21                       ` Mark Lord
  0 siblings, 1 reply; 69+ messages in thread
From: Lennart Sorensen @ 2009-04-07 14:48 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ingo Molnar, Jens Axboe, Linus Torvalds, Theodore Tso,
	Linux Kernel Developers List, Ext4 Developers List

On Tue, Apr 07, 2009 at 09:31:37AM -0400, Mark Lord wrote:
> I generally don't see any delay at all when doing that,
> even when the Mythtv system is very busy on multiple recordings.
> But then, I also don't use the huge LIRC stack for the simple R/C either.

But it happens with the keyboard too. :)

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-04 15:57           ` Linus Torvalds
  2009-04-04 16:06             ` Linus Torvalds
  2009-04-04 17:34             ` Jens Axboe
@ 2009-04-07 15:39             ` Indan Zupancic
  2 siblings, 0 replies; 69+ messages in thread
From: Indan Zupancic @ 2009-04-07 15:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Theodore Tso, Linux Kernel Developers List,
	Ext4 Developers List

On Sat, April 4, 2009 17:57, Linus Torvalds wrote:
>
> On Sat, 4 Apr 2009, Jens Axboe wrote:
>>
>> Big nack on this patch. Ted, this is EXACTLY where I told you we saw big
>> write regressions (sqlite performance drops by a factor of 4-5). Do a
>> git log on fs/buffer.c and see the original patch (which does what your
>> patch does) and the later revert. No idea why you are now suggestion
>> making that exact change?!
>
> Jens, if I can re-create the 'fsync' times (I haven't yet), then the
> default scheduler _will_ be switched to AS.

http://bugzilla.kernel.org/show_bug.cgi?id=5900

I can try it with a different fs, kernel or machine as well.

To me it seems this stuff is way more tricky than it should be, with
all the interaction between fs, vm, io schedulers and hardware.
Perhaps there's a need for one sophisticated io scheduler that gets
enough information from the rest to make the right decisions. Sync
or not sync isn't enough info.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-06  5:47                                 ` Jeff Garzik
@ 2009-04-07 18:18                                   ` Linus Torvalds
  2009-04-07 18:22                                     ` Linus Torvalds
  2009-04-07 19:40                                     ` [PATCH libata: add SSD detection hueristic; move SSD setup to ata_dev_configure (was Re: [GIT PULL] Ext3 latency fixes) Jeff Garzik
  0 siblings, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2009-04-07 18:18 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Arjan van de Ven, Mark Lord, Theodore Tso, Jens Axboe,
	Linux Kernel Developers List, Ext4 Developers List



On Mon, 6 Apr 2009, Jeff Garzik wrote:

> Arjan van de Ven wrote:
> > On Sun, 05 Apr 2009 16:57:21 -0400
> > Jeff Garzik <jeff@garzik.org> wrote:
> > > We set it in libata-scsi.c:ata_scsi_dev_config() based on
> > > ata_id_is_ssd()
> > > 
> > > That hueristic probably assumes Intel SSDs or something :/
> > 
> > you mean the "rpm" set to '1' ?
> > I was pretty sure that that was industry standard...
> 
> A -new- industry standard.  You can certainly create a compliant SSD while
> only conforming to ATA-7, for example.  Some older IDE flash devices pretend
> they are normal hard drives in almost every respect, too.

Something like this might be a good idea. 

I've seen several SSD's that do _not_ do that whole RPM == 1 thing, but 
they have "SSD" in their names. 

I forget how the ID is stored (I have this memory of it being big-endian 
16-bit words or something crazy like that?), but aside from fixing up that 
kind of crazyness, maybe something like this is worth it?

And making it non-inline, of course. And maybe it should use 'strstr()' 
instead of checking whether the name ends in 'SSD'. You get the idea..

		Linus

---
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -769,7 +769,14 @@ static inline int ata_id_is_cfa(const u16 *id)
 
 static inline int ata_id_is_ssd(const u16 *id)
 {
-	return id[ATA_ID_ROT_SPEED] == 0x01;
+	int len;
+	const char *model;
+
+	if (id[ATA_ID_ROT_SPEED] == 0x01)
+		return 1;
+	model = (const char *)&id[ATA_ID_PROD];
+	len = strnlen(model, ATA_ID_PROD_LEN);
+	return len > 3 && !memcmp(model+len-3, "SSD", 3);
 }
 
 static inline int ata_drive_40wire(const u16 *dev_id)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-07 18:18                                   ` Linus Torvalds
@ 2009-04-07 18:22                                     ` Linus Torvalds
  2009-04-07 19:40                                     ` [PATCH libata: add SSD detection hueristic; move SSD setup to ata_dev_configure (was Re: [GIT PULL] Ext3 latency fixes) Jeff Garzik
  1 sibling, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2009-04-07 18:22 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Arjan van de Ven, Mark Lord, Theodore Tso, Jens Axboe,
	Linux Kernel Developers List, Ext4 Developers List



On Tue, 7 Apr 2009, Linus Torvalds wrote:
> 
> Something like this might be a good idea. 

And it clearly is "something like".

> I've seen several SSD's that do _not_ do that whole RPM == 1 thing, but 
> they have "SSD" in their names. 
> 
> I forget how the ID is stored (I have this memory of it being big-endian 
> 16-bit words or something crazy like that?), but aside from fixing up that 
> kind of crazyness, maybe something like this is worth it?

Yeah, just checked. My memory wasn't wrong, and that code can not possibly 
work. Not only isn't that whole field generally NUL-terminated at all, my 
recollection of odd 16-bit byte ordering was right.

So that patch is crap. 

But the _concept_ may be worth pursuing. 

			Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-07 14:48                     ` Lennart Sorensen
@ 2009-04-07 19:21                       ` Mark Lord
  2009-04-07 19:57                         ` Lennart Sorensen
  0 siblings, 1 reply; 69+ messages in thread
From: Mark Lord @ 2009-04-07 19:21 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Ingo Molnar, Jens Axboe, Linus Torvalds, Theodore Tso,
	Linux Kernel Developers List, Ext4 Developers List

Lennart Sorensen wrote:
> On Tue, Apr 07, 2009 at 09:31:37AM -0400, Mark Lord wrote:
>> I generally don't see any delay at all when doing that,
>> even when the Mythtv system is very busy on multiple recordings.
>> But then, I also don't use the huge LIRC stack for the simple R/C either.
> 
> But it happens with the keyboard too. :)
..

Mmm.. now that could be problem.

I wonder if mythfrontend is simply getting paged out of memory,
until you press a button and then the relevant code has to all
get paged back in again before it can respond (?).


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH libata: add SSD detection hueristic; move SSD setup to ata_dev_configure (was Re: [GIT PULL] Ext3 latency fixes)
  2009-04-07 18:18                                   ` Linus Torvalds
  2009-04-07 18:22                                     ` Linus Torvalds
@ 2009-04-07 19:40                                     ` Jeff Garzik
  2009-04-09 18:21                                       ` Tejun Heo
  1 sibling, 1 reply; 69+ messages in thread
From: Jeff Garzik @ 2009-04-07 19:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Mark Lord, Theodore Tso, Jens Axboe,
	Linux Kernel Developers List, Linux IDE mailing list

[-- Attachment #1: Type: text/plain, Size: 1751 bytes --]

Linus Torvalds wrote:
> 
> On Mon, 6 Apr 2009, Jeff Garzik wrote:
> 
>> Arjan van de Ven wrote:
>>> On Sun, 05 Apr 2009 16:57:21 -0400
>>> Jeff Garzik <jeff@garzik.org> wrote:
>>>> We set it in libata-scsi.c:ata_scsi_dev_config() based on
>>>> ata_id_is_ssd()
>>>>
>>>> That hueristic probably assumes Intel SSDs or something :/
>>> you mean the "rpm" set to '1' ?
>>> I was pretty sure that that was industry standard...
>> A -new- industry standard.  You can certainly create a compliant SSD while
>> only conforming to ATA-7, for example.  Some older IDE flash devices pretend
>> they are normal hard drives in almost every respect, too.
> 
> Something like this might be a good idea. 
> 
> I've seen several SSD's that do _not_ do that whole RPM == 1 thing, but 
> they have "SSD" in their names. 
> 
> I forget how the ID is stored (I have this memory of it being big-endian 
> 16-bit words or something crazy like that?), but aside from fixing up that 
> kind of crazyness, maybe something like this is worth it?
> 
> And making it non-inline, of course. And maybe it should use 'strstr()' 
> instead of checking whether the name ends in 'SSD'. You get the idea..

ata_id_string() or ata_id_c_string() is what you want.

But yeah, we see what you're trying to illustrate.

For internal reasons, it is better to detect and set up SSD details in 
ata_dev_configure(), where we detect and configure other ATA details.

I've attached an example patch, compiled-tested only.

If we wanted to get more fancy, we could extend the strn_pattern_cmp() 
function in libata to accept wildcard '*' prefixes, as well as suffixes. 
  That would make it easy to auto-configure future ATA devices based on 
the product id (such as "G.SKILL 128GB SSD").

	Jeff



[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 1841 bytes --]

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index e7ea77c..3043a61 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -2411,6 +2411,8 @@ int ata_dev_configure(struct ata_device *dev)
 
 	/* ATA-specific feature tests */
 	if (dev->class == ATA_DEV_ATA) {
+		char *model_suffix;
+
 		if (ata_id_is_cfa(id)) {
 			if (id[162] & 1) /* CPRM may make this media unusable */
 				ata_dev_printk(dev, KERN_WARNING,
@@ -2438,6 +2440,13 @@ int ata_dev_configure(struct ata_device *dev)
 					dev->multi_count = cnt;
 		}
 
+		if (strlen(modelbuf) <= 3)
+			model_suffix = modelbuf;
+		else
+			model_suffix = modelbuf + (strlen(modelbuf) - 3);
+		if (ata_id_is_ssd(id) || !strcmp(model_suffix, "SSD"))
+			dev->flags |= ATA_DFLAG_NONROT;
+
 		if (ata_id_has_lba(id)) {
 			const char *lba_desc;
 			char ncq_desc[20];
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index b9747fa..e597a4f 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1097,7 +1097,7 @@ static int ata_scsi_dev_config(struct scsi_device *sdev,
 
 		blk_queue_dma_drain(q, atapi_drain_needed, buf, ATAPI_MAX_DRAIN);
 	} else {
-		if (ata_id_is_ssd(dev->id))
+		if (dev->flags & ATA_DFLAG_NONROT)
 			queue_flag_set_unlocked(QUEUE_FLAG_NONROT,
 						sdev->request_queue);
 
diff --git a/include/linux/libata.h b/include/linux/libata.h
index b450a26..a0fdbf0 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -147,6 +147,7 @@ enum {
 	ATA_DFLAG_SLEEPING	= (1 << 15), /* device is sleeping */
 	ATA_DFLAG_DUBIOUS_XFER	= (1 << 16), /* data transfer not verified */
 	ATA_DFLAG_NO_UNLOAD	= (1 << 17), /* device doesn't support unload */
+	ATA_DFLAG_NONROT	= (1 << 18), /* is non-rotational media, SSD */
 	ATA_DFLAG_INIT_MASK	= (1 << 24) - 1,
 
 	ATA_DFLAG_DETACH	= (1 << 24),

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-07 19:21                       ` Mark Lord
@ 2009-04-07 19:57                         ` Lennart Sorensen
  0 siblings, 0 replies; 69+ messages in thread
From: Lennart Sorensen @ 2009-04-07 19:57 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ingo Molnar, Jens Axboe, Linus Torvalds, Theodore Tso,
	Linux Kernel Developers List, Ext4 Developers List

On Tue, Apr 07, 2009 at 03:21:15PM -0400, Mark Lord wrote:
> Mmm.. now that could be problem.
>
> I wonder if mythfrontend is simply getting paged out of memory,
> until you press a button and then the relevant code has to all
> get paged back in again before it can respond (?).

While the video is playing?  Well it is possible I guess.  I haven't had
it happen in a while, so perhaps that was a mythtv bug that got fixed
since the keyboard worked fine if I switch to vt1 for example.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH libata: add SSD detection hueristic; move SSD setup to ata_dev_configure (was Re: [GIT PULL] Ext3 latency fixes)
  2009-04-07 19:40                                     ` [PATCH libata: add SSD detection hueristic; move SSD setup to ata_dev_configure (was Re: [GIT PULL] Ext3 latency fixes) Jeff Garzik
@ 2009-04-09 18:21                                       ` Tejun Heo
  2009-04-18  3:02                                         ` George Spelvin
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2009-04-09 18:21 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Arjan van de Ven, Mark Lord, Theodore Tso,
	Jens Axboe, Linux Kernel Developers List, Linux IDE mailing list,
	George Spelvin

Jeff Garzik wrote:
> ata_id_string() or ata_id_c_string() is what you want.
> 
> But yeah, we see what you're trying to illustrate.
> 
> For internal reasons, it is better to detect and set up SSD details in
> ata_dev_configure(), where we detect and configure other ATA details.
> 
> I've attached an example patch, compiled-tested only.
> 
> If we wanted to get more fancy, we could extend the strn_pattern_cmp()
> function in libata to accept wildcard '*' prefixes, as well as suffixes.
>  That would make it easy to auto-configure future ATA devices based on
> the product id (such as "G.SKILL 128GB SSD").

There was an shell globbing patch floating around which would be
pretty nice to have for pattern matchings like this.  cc'ing the patch
author.  George, what happened to the patch?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH libata: add SSD detection hueristic; move SSD setup to ata_dev_configure (was Re: [GIT PULL] Ext3 latency fixes)
  2009-04-09 18:21                                       ` Tejun Heo
@ 2009-04-18  3:02                                         ` George Spelvin
  0 siblings, 0 replies; 69+ messages in thread
From: George Spelvin @ 2009-04-18  3:02 UTC (permalink / raw)
  To: jeff, tj
  Cc: arjan, jens.axboe, linux-ide, linux-kernel, linux, lkml, torvalds, tytso

Tejun Heo wrote:
> Jeff Garzik wrote:
>> If we wanted to get more fancy, we could extend the strn_pattern_cmp()
>> function in libata to accept wildcard '*' prefixes, as well as suffixes.
>>  That would make it easy to auto-configure future ATA devices based on
>> the product id (such as "G.SKILL 128GB SSD").
>
> There was an shell globbing patch floating around which would be
> pretty nice to have for pattern matchings like this.  cc'ing the patch
> author.  George, what happened to the patch?

It wasn't a patch per se, but a simple pattern-match implementation
that could be dropped in wherever seems appropriate.  Appended below
(with test harness.)

Released to the public domain; have fun.



#include <stdbool.h>
#include <string.h>	/* For strchr */
#include <assert.h>

#define WARN_ON(x) assert(!(x))

#ifndef __pure
/* Kernel compatibility #define */
#define __pure __attribute__((pure))
#endif

/**
 * is_in_class - globmatch() helper, returns true if character is in class.
 * @c: Character to be tested.
 * @pat: Character class to test for inclusion in, terminated by ']'.
 *
 * Evaluate the "body" of a character class.  Given the class "[!a-z]",
 * this expects @pat to point to the a, and proceeds until the closing ].
 * The caller must skip the opening bracket and the complement character.
 *
 * The caller must verify that a terminating ] exists; this will NOT
 * stop at a trailing nul byte.  Also, the first character may be ];
 * that is not taken as a terminator.
 *
 * Comparison is done using unsigned bytes, 0...255.
 */
static bool __pure
is_in_class(unsigned char c, char const *pat)
{
	/*
	 * Iterate over each span in the character class.
	 * a span is either a single character x, or a
	 * range x-y.
	 */
	do {
		unsigned char c1 = *pat++, c2 = c1;

		if (pat[0] == '-' && pat[1] != ']') {
			c2 = pat[1];
			pat += 2;
		}
		/* Any special action if c1 > c2? */
		if (c1 <= c && c <= c2)
			return true;
	} while (*pat != ']');
	return false;
}

/**
 * globmatch - Shell-style pattern matching, like !fnmatch(pat, str, 0)
 * @pat: Pattern to match.  Metacharacters are ?, *, [ and \.
 * @str: String to match.  the pattern must match the entire string.
 * @maxdepth: Maximum number of (non-trailing) * permitted in pattern.
 *
 * Perform shell-style glob matching, returning true (1) if the match
 * succeeds, false (0) if it fails, and -1 if the recursion depth is
 * exceeded.
 * 
 * Like !fnmatch(@pat, @str, 0) and unlike the shell, this does NOT
 * treat / or leading . specially; it isn't actually used for pathnames.
 *
 * This is small and simple implementation intended for device blacklists
 * where a string is matched against a number of patterns.  Thus,
 * it does not proprocess the patterns.  Run-time is at most quadratic
 * in strlen(@str).
 */
static bool __pure
globmatch(char const *pat, char const *str)
{
	/*
	 * Backtrack to previous * on mismatch and retry starting one
	 * character later in the string.  It can be proved that there's
	 * never a need to backtrack multiple levels.
	 */
	char const *back_pat = 0, *back_str = back_str;
	/*
	 * Loop over each token (character or class) in pat, matching
	 * it against the remaining unmatched tail of str.  Return false
	 * on mismatch, or true after matching the trailing nul bytes.
	 */
	do {
		/*
		 * (To handle * properly, which may match zero bytes, each
		 * case is required to increment the str pointer itself.)
		 */
		switch (pat[0]) {
			char c;
		case '?':	/* Wildcard: anything but nul */
			if (!*str++)
				return false;
			break;
		case '*':	/* Any-length wildcard */
			/* This doesn't loop... */
			if (pat[1] == '\0')	/* Optimize trailing * case */
				return true;
			back_pat = pat;
			back_str = str;
			break;
		case '[': {	/* Character class */
			/* Is it an inverted character class? */
			bool inv = (pat[1] == '^' || pat[1] == '!');
			/* If no terminating ], interpret as plain text. */
			char const *end = strchr(pat+2+inv, ']');
			if (!end)
				goto def;	/* Or return -1 for malformed pattern? */
			pat += 1 + inv;
			c = *str++;
			if (is_in_class(c, pat) == inv)
				goto backtrack;
			pat = end;
			}
			break;
		case '\\':
			pat++;
			/* FALLLTHROUGH */
		default:	/* Literal character */
def:
			c = *str++;
			if (*pat == c)
				break;
backtrack:
			if (c == '\0' || !back_pat)
				return false;	/* No point continuing */
			/* Try again from last *, one character later in str. */
			pat = back_pat;
			str = ++back_str;
			break;
		}
	} while (*pat++);
	return true;
}

/* Test code */
#include <stdio.h>
#include <stdlib.h>

static void
test(char const *pat, char const *str, bool expected)
{
	bool match = globmatch(pat, str);
	printf("\"%s\" vs. \"%s\": %s %s\n", pat, str, match ? "match" : "mismatch",
		match == expected ? "OK" : "*** ERROR ***");
	if (match != expected)
		exit(1);
}

int
main(void)
{
	test("a", "a", true);
	test("a", "b", false);
	test("a", "aa", false);
	test("a", "", false);
	test("", "", true);
	test("", "a", false);
	test("?", "a", true);
	test("?", "aa", false);
	test("??", "a", false);
	test("?x?", "axb", true);
	test("?x?", "abx", false);
	test("?x?", "xab", false);
	test("*??", "a", false);
	test("*??", "ab", true);
	test("*??", "abc", true);
	test("??*", "a", false);
	test("??*", "ab", true);
	test("??*", "abc", true);
	test("?*?", "a", false);
	test("?*?", "ab", true);
	test("?*?", "abc", true);
	test("*b", "b", true);
	test("*b", "ab", true);
	test("*b", "aab", true);
	test("*bc", "abbc", true);
	test("*bc", "bc", true);
	test("*bc", "bbc", true);
	test("*ac*", "abacadaeafag", true);
	test("*ac*ae*ag*", "abacadaeafag", true);
	test("*a*b*[bc]*[ef]*g*", "abacadaeafag", true);
	test("*a*b*[ef]*[cd]*g*", "abacadaeafag", false);
	test("*abcd*", "abcabcabcabcdefg", true);
	test("*ab*cd*", "abcabcabcabcdefg", true);
	test("*abcd*abcdef*", "abcabcdabcdeabcdefg", true);
	test("*abcd*", "abcabcabcabcefg", false);
	test("*ab*cd*", "abcabcabcabcefg", false);
	test("[a]", "a", true);
	test("[^a]", "a", false);
	test("[!a]", "a", false);
	test("[!a]", "b", true);
	test("[ab]", "a", true);
	test("[ab]", "b", true);
	test("[ab]", "c", false);
	test("[^ab]", "c", true);
	test("[a-c]", "b", true);
	test("[a-c]", "d", false);
	test("[]a-ceg-ik[]", "a", true);
	test("[]a-ceg-ik[]", "]", true);
	test("[]a-ceg-ik[]", "[", true);
	test("[]a-ceg-ik[]", "h", true);
	test("[]a-ceg-ik[]", "f", false);
	test("[!]a-ceg-ik[]", "h", false);
	test("[!]a-ceg-ik[]", "]", false);
	test("[!]a-ceg-ik[]", "f", true);
	test("[!]a-ceg-ik[]", "f", true);

	return 0;
}

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-09 18:10       ` Chris Mason
@ 2009-04-09 19:04         ` Jan Kara
  0 siblings, 0 replies; 69+ messages in thread
From: Jan Kara @ 2009-04-09 19:04 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Theodore Ts'o, Linux Kernel Developers List,
	Ext4 Developers List

On Thu 09-04-09 14:10:03, Chris Mason wrote:
> On Thu, 2009-04-09 at 19:49 +0200, Jan Kara wrote:
> > > On Thu, 2009-04-09 at 08:49 -0700, Linus Torvalds wrote:
> > > > 
> > > > On Wed, 8 Apr 2009, Theodore Ts'o wrote:
> > > > > 
> > > > > One of these patches fixes a performance regression caused by a64c8610,
> > > > > which unplugged the write queue after every page write.  Now that Jens
> > > > > added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> > > > > WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
> > > > > to further improbve ext3 latency, especially during the "sync" command
> > > > > in Linus's write-big-file-and-sync workload.
> > > > 
> > > > So here's a question and a untested _conceptual_ patch. 
> > > > 
> > > > The kind of writeback mode I'd personally prefer would be more of a 
> > > > mixture of the current "data=writeback" and "data=ordered" modes, with 
> > > > something of the best of both worlds. I'd like the data writeback to get 
> > > > _started_ when the journal is written to disk, but I'd like it to not 
> > > > block journal updates.
> > > > 
> > > > IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't 
> > > > be totally unordered either.
> > > > 
> > > 
> > > I started working on the xfs style i_size updates last night, and here's
> > > my current (most definitely broken) proof of concept.  I call it
> > > data=guarded.
> > > 
> > > In guarded mode the on disk i_size is not updated until after the data
> > > writes are complete.  I've got a per FS work queue and I'm abusing
> > > bh->b_private as a list pointer.  So, what happens is:
> > > 
> > > * writepage sets up the buffer with the guarded end_io handler
> > > 
> > > * The end_io handler puts the buffer onto the per-sb list of guarded
> > > buffers and then it kicks the work queue
> > > 
> > > * The workqueue updates the on disk i_size to the min of the end of the
> > > buffer or the in-memory i_size, and then it logs the inode.
> > > 
> > > * Then the regular async bh end_io handler is called to end writeback on
> > > the page.
> > > 
> > > One big gotcha is that we starting a transaction while a page is
> > > writeback.  It means that anyone who waits for writeback to finish on
> > > the datapage with a transaction running could deadlock against the work
> > > queue func trying to start a transaction.
> >   For ext3 I don't think anyone waits for PageWriteback with a
> > transaction open. We definitely don't do it from ext3 code and generic
> > code does usually sequence like:
> >   lock_page(page);
> >   ...
> >   wait_on_page_writeback(page)
> > 
> >   and because lock ordering is page_lock < transaction start, we
> > shouldn't have transaction open at that point.
> >   But with ext4 it may be different - there, the lock ordering is
> > transaction start > page_lock and so above code could well have
> > transaction started.
> >   Wouldn't it actually be better to update i_size when the page is
> > fully written out after we clear PG_writeback as you write below?
> 
> It would, but then we have to take a ref on the inode and risk iput
> leading to inode deletion in the handler that is supposed to be doing IO
> completion.  It's icky either way ;)
  Yeah, I though something like this could happen... I had a similar
problem in JBD2 where kjournald could be the last to drop inode reference.
In the end I've solved it by waiting in clear_inode() until the commit code
is done with the inode (and the commit code does not hold any reference
against the inode).

> The nice part with doing it before writeback is that we know that when
> we wait for page writeback, we don't also have to wait for i_size update
> to be finished.
> 
> If we go this route and it gets copied to ext4, we can weigh our options
> I guess.
> 
> >   One thing which does not seem to be handled is that your code can
> > happily race with truncate. So IO completion could reset i_size which
> > has been just set by truncate. And I'm not sure how to handle this
> > effectively. Generally I'm not sure how much this is going to cost...
> > 
> 
> Yeah, I was worried about that.  What ends up happening is the setattr
> call sets the disk i_size and then calls inode_setattr, who calls
> vmtruncate who actually waits on the writeback.
> 
> Then, we wander into the ext3 truncate who resets disk i_size down
> again.  It's a pretty strange window of updates, but I was thinking it
> would work to cut down i_size, wait on IO, then cut it down again in
> setattr.
> 
> Once we wait on all IO past the new in-memory i_size, writepage won't
> send any more down.  So updating disk i_size after the wait should be
> enough.
  Yes, this should work. But it's a bit nasty...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-09 17:49     ` Jan Kara
@ 2009-04-09 18:10       ` Chris Mason
  2009-04-09 19:04         ` Jan Kara
  0 siblings, 1 reply; 69+ messages in thread
From: Chris Mason @ 2009-04-09 18:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Theodore Ts'o, Linux Kernel Developers List,
	Ext4 Developers List

On Thu, 2009-04-09 at 19:49 +0200, Jan Kara wrote:
> > On Thu, 2009-04-09 at 08:49 -0700, Linus Torvalds wrote:
> > > 
> > > On Wed, 8 Apr 2009, Theodore Ts'o wrote:
> > > > 
> > > > One of these patches fixes a performance regression caused by a64c8610,
> > > > which unplugged the write queue after every page write.  Now that Jens
> > > > added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> > > > WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
> > > > to further improbve ext3 latency, especially during the "sync" command
> > > > in Linus's write-big-file-and-sync workload.
> > > 
> > > So here's a question and a untested _conceptual_ patch. 
> > > 
> > > The kind of writeback mode I'd personally prefer would be more of a 
> > > mixture of the current "data=writeback" and "data=ordered" modes, with 
> > > something of the best of both worlds. I'd like the data writeback to get 
> > > _started_ when the journal is written to disk, but I'd like it to not 
> > > block journal updates.
> > > 
> > > IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't 
> > > be totally unordered either.
> > > 
> > 
> > I started working on the xfs style i_size updates last night, and here's
> > my current (most definitely broken) proof of concept.  I call it
> > data=guarded.
> > 
> > In guarded mode the on disk i_size is not updated until after the data
> > writes are complete.  I've got a per FS work queue and I'm abusing
> > bh->b_private as a list pointer.  So, what happens is:
> > 
> > * writepage sets up the buffer with the guarded end_io handler
> > 
> > * The end_io handler puts the buffer onto the per-sb list of guarded
> > buffers and then it kicks the work queue
> > 
> > * The workqueue updates the on disk i_size to the min of the end of the
> > buffer or the in-memory i_size, and then it logs the inode.
> > 
> > * Then the regular async bh end_io handler is called to end writeback on
> > the page.
> > 
> > One big gotcha is that we starting a transaction while a page is
> > writeback.  It means that anyone who waits for writeback to finish on
> > the datapage with a transaction running could deadlock against the work
> > queue func trying to start a transaction.
>   For ext3 I don't think anyone waits for PageWriteback with a
> transaction open. We definitely don't do it from ext3 code and generic
> code does usually sequence like:
>   lock_page(page);
>   ...
>   wait_on_page_writeback(page)
> 
>   and because lock ordering is page_lock < transaction start, we
> shouldn't have transaction open at that point.
>   But with ext4 it may be different - there, the lock ordering is
> transaction start > page_lock and so above code could well have
> transaction started.
>   Wouldn't it actually be better to update i_size when the page is
> fully written out after we clear PG_writeback as you write below?

It would, but then we have to take a ref on the inode and risk iput
leading to inode deletion in the handler that is supposed to be doing IO
completion.  It's icky either way ;)

The nice part with doing it before writeback is that we know that when
we wait for page writeback, we don't also have to wait for i_size update
to be finished.

If we go this route and it gets copied to ext4, we can weigh our options
I guess.

>   One thing which does not seem to be handled is that your code can
> happily race with truncate. So IO completion could reset i_size which
> has been just set by truncate. And I'm not sure how to handle this
> effectively. Generally I'm not sure how much this is going to cost...
> 

Yeah, I was worried about that.  What ends up happening is the setattr
call sets the disk i_size and then calls inode_setattr, who calls
vmtruncate who actually waits on the writeback.

Then, we wander into the ext3 truncate who resets disk i_size down
again.  It's a pretty strange window of updates, but I was thinking it
would work to cut down i_size, wait on IO, then cut it down again in
setattr.

Once we wait on all IO past the new in-memory i_size, writepage won't
send any more down.  So updating disk i_size after the wait should be
enough.

-chris



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-09 16:23   ` Chris Mason
@ 2009-04-09 17:49     ` Jan Kara
  2009-04-09 18:10       ` Chris Mason
  0 siblings, 1 reply; 69+ messages in thread
From: Jan Kara @ 2009-04-09 17:49 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Theodore Ts'o, Linux Kernel Developers List,
	Ext4 Developers List

> On Thu, 2009-04-09 at 08:49 -0700, Linus Torvalds wrote:
> > 
> > On Wed, 8 Apr 2009, Theodore Ts'o wrote:
> > > 
> > > One of these patches fixes a performance regression caused by a64c8610,
> > > which unplugged the write queue after every page write.  Now that Jens
> > > added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> > > WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
> > > to further improbve ext3 latency, especially during the "sync" command
> > > in Linus's write-big-file-and-sync workload.
> > 
> > So here's a question and a untested _conceptual_ patch. 
> > 
> > The kind of writeback mode I'd personally prefer would be more of a 
> > mixture of the current "data=writeback" and "data=ordered" modes, with 
> > something of the best of both worlds. I'd like the data writeback to get 
> > _started_ when the journal is written to disk, but I'd like it to not 
> > block journal updates.
> > 
> > IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't 
> > be totally unordered either.
> > 
> 
> I started working on the xfs style i_size updates last night, and here's
> my current (most definitely broken) proof of concept.  I call it
> data=guarded.
> 
> In guarded mode the on disk i_size is not updated until after the data
> writes are complete.  I've got a per FS work queue and I'm abusing
> bh->b_private as a list pointer.  So, what happens is:
> 
> * writepage sets up the buffer with the guarded end_io handler
> 
> * The end_io handler puts the buffer onto the per-sb list of guarded
> buffers and then it kicks the work queue
> 
> * The workqueue updates the on disk i_size to the min of the end of the
> buffer or the in-memory i_size, and then it logs the inode.
> 
> * Then the regular async bh end_io handler is called to end writeback on
> the page.
> 
> One big gotcha is that we starting a transaction while a page is
> writeback.  It means that anyone who waits for writeback to finish on
> the datapage with a transaction running could deadlock against the work
> queue func trying to start a transaction.
  For ext3 I don't think anyone waits for PageWriteback with a
transaction open. We definitely don't do it from ext3 code and generic
code does usually sequence like:
  lock_page(page);
  ...
  wait_on_page_writeback(page)

  and because lock ordering is page_lock < transaction start, we
shouldn't have transaction open at that point.
  But with ext4 it may be different - there, the lock ordering is
transaction start > page_lock and so above code could well have
transaction started.
  Wouldn't it actually be better to update i_size when the page is
fully written out after we clear PG_writeback as you write below?
  One thing which does not seem to be handled is that your code can
happily race with truncate. So IO completion could reset i_size which
has been just set by truncate. And I'm not sure how to handle this
effectively. Generally I'm not sure how much this is going to cost...

> I couldn't find anyone doing that, but if it matters, we can always just
> mark the inode dirty and let some other async func handle the logging.
> We could also play tricks with logging the inode after the real end_io
> handler clears PG_writeback.
> 
> This code doesn't:
> 
> * Deal with hole filling (plan is just to use the ordered code there)
> 
> * Make sure all the blocks are on disk between the new disk i_size and
> the old one.  For this, I'll add an rbtree to track BH_New buffers and
> delay updating the disk isize until the pending BH_New IO is on disk.
> Btrfs already does this, so I should have a handle on the spots I need
> to fiddle.
> 
> There's a ton of room for optimization like not doing async end_io if
> we're already inside disk i_size.

									Honza
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 891e1c7..c5e1ffd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -505,7 +505,7 @@ still_busy:
>   * Completion handler for block_write_full_page() - pages which are unlocked
>   * during I/O, and which have PageWriteback cleared upon I/O completion.
>   */
> -static void end_buffer_async_write(struct buffer_head *bh, int uptodate)
> +void end_buffer_async_write(struct buffer_head *bh, int uptodate)
>  {
>  	char b[BDEVNAME_SIZE];
>  	unsigned long flags;
> @@ -583,11 +583,17 @@ static void mark_buffer_async_read(struct buffer_head *bh)
>  	set_buffer_async_read(bh);
>  }
>  
> -void mark_buffer_async_write(struct buffer_head *bh)
> +void mark_buffer_async_write_endio(struct buffer_head *bh,
> +				   bh_end_io_t *handler)
>  {
> -	bh->b_end_io = end_buffer_async_write;
> +	bh->b_end_io = handler;
>  	set_buffer_async_write(bh);
>  }
> +
> +void mark_buffer_async_write(struct buffer_head *bh)
> +{
> +	mark_buffer_async_write_endio(bh, end_buffer_async_write);
> +}
>  EXPORT_SYMBOL(mark_buffer_async_write);
>  
>  
> @@ -1706,7 +1712,8 @@ EXPORT_SYMBOL(unmap_underlying_metadata);
>   * prevents this contention from occurring.
>   */
>  static int __block_write_full_page(struct inode *inode, struct page *page,
> -			get_block_t *get_block, struct writeback_control *wbc)
> +			get_block_t *get_block, struct writeback_control *wbc,
> +			bh_end_io_t *handler)
>  {
>  	int err;
>  	sector_t block;
> @@ -1789,7 +1796,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
>  			continue;
>  		}
>  		if (test_clear_buffer_dirty(bh)) {
> -			mark_buffer_async_write(bh);
> +			mark_buffer_async_write_endio(bh, handler);
>  		} else {
>  			unlock_buffer(bh);
>  		}
> @@ -1842,7 +1849,7 @@ recover:
>  		if (buffer_mapped(bh) && buffer_dirty(bh) &&
>  		    !buffer_delay(bh)) {
>  			lock_buffer(bh);
> -			mark_buffer_async_write(bh);
> +			mark_buffer_async_write_endio(bh, handler);
>  		} else {
>  			/*
>  			 * The buffer may have been set dirty during
> @@ -2760,7 +2767,8 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
>  out:
>  	ret = mpage_writepage(page, get_block, wbc);
>  	if (ret == -EAGAIN)
> -		ret = __block_write_full_page(inode, page, get_block, wbc);
> +		ret = __block_write_full_page(inode, page, get_block, wbc,
> +					      end_buffer_async_write);
>  	return ret;
>  }
>  EXPORT_SYMBOL(nobh_writepage);
> @@ -2918,9 +2926,10 @@ out:
>  
>  /*
>   * The generic ->writepage function for buffer-backed address_spaces
> + * this form passes in the end_io handler used to finish the IO.
>   */
> -int block_write_full_page(struct page *page, get_block_t *get_block,
> -			struct writeback_control *wbc)
> +int block_write_full_page_endio(struct page *page, get_block_t *get_block,
> +			struct writeback_control *wbc, bh_end_io_t *handler)
>  {
>  	struct inode * const inode = page->mapping->host;
>  	loff_t i_size = i_size_read(inode);
> @@ -2929,7 +2938,8 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
>  
>  	/* Is the page fully inside i_size? */
>  	if (page->index < end_index)
> -		return __block_write_full_page(inode, page, get_block, wbc);
> +		return __block_write_full_page(inode, page, get_block, wbc,
> +					       handler);
>  
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
> @@ -2952,9 +2962,20 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
>  	 * writes to that region are not written out to the file."
>  	 */
>  	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> -	return __block_write_full_page(inode, page, get_block, wbc);
> +	return __block_write_full_page(inode, page, get_block, wbc, handler);
>  }
>  
> +/*
> + * The generic ->writepage function for buffer-backed address_spaces
> + */
> +int block_write_full_page(struct page *page, get_block_t *get_block,
> +			struct writeback_control *wbc)
> +{
> +	return block_write_full_page_endio(page, get_block, wbc,
> +					   end_buffer_async_write);
> +}
> +
> +
>  sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
>  			    get_block_t *get_block)
>  {
> @@ -3422,9 +3443,11 @@ EXPORT_SYMBOL(block_read_full_page);
>  EXPORT_SYMBOL(block_sync_page);
>  EXPORT_SYMBOL(block_truncate_page);
>  EXPORT_SYMBOL(block_write_full_page);
> +EXPORT_SYMBOL(block_write_full_page_endio);
>  EXPORT_SYMBOL(cont_write_begin);
>  EXPORT_SYMBOL(end_buffer_read_sync);
>  EXPORT_SYMBOL(end_buffer_write_sync);
> +EXPORT_SYMBOL_GPL(end_buffer_async_write);
>  EXPORT_SYMBOL(file_fsync);
>  EXPORT_SYMBOL(fsync_bdev);
>  EXPORT_SYMBOL(generic_block_bmap);
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 5fa453b..64995d0 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -38,6 +38,7 @@
>  #include <linux/bio.h>
>  #include <linux/fiemap.h>
>  #include <linux/namei.h>
> +#include <linux/workqueue.h>
>  #include "xattr.h"
>  #include "acl.h"
>  
> @@ -766,6 +767,21 @@ err_out:
>  	return err;
>  }
>  
> +static int maybe_update_disk_isize(struct inode *inode, loff_t new_size)
> +{
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	/* FIXME add a lock in the inode */
> +	spin_lock_irqsave(&EXT3_SB(inode->i_sb)->guarded_lock, flags);
> +	if (EXT3_I(inode)->i_disksize < new_size) {
> +		EXT3_I(inode)->i_disksize = new_size;
> +		ret = 1;
> +	}
> +	spin_unlock_irqrestore(&EXT3_SB(inode->i_sb)->guarded_lock, flags);
> +	return ret;
> +}
> +
>  /*
>   * Allocation strategy is simple: if we have to allocate something, we will
>   * have to go the whole way to leaf. So let's do it before attaching anything
> @@ -915,9 +931,13 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
>  	 * i_disksize growing is protected by truncate_mutex.  Don't forget to
>  	 * protect it if you're about to implement concurrent
>  	 * ext3_get_block() -bzzz
> +	 *
> +	 * FIXME, I think this only needs to extend the disk i_size when
> +	 * we're filling holes that came from using ftruncate to increase
> +	 * i_size.  Need to verify.
>  	*/
> -	if (!err && extend_disksize && inode->i_size > ei->i_disksize)
> -		ei->i_disksize = inode->i_size;
> +	if (!ext3_should_guard_data(inode) && !err && extend_disksize)
> +		maybe_update_disk_isize(inode, inode->i_size);
>  	mutex_unlock(&ei->truncate_mutex);
>  	if (err)
>  		goto cleanup;
> @@ -1079,6 +1099,50 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
>  	return NULL;
>  }
>  
> +void ext3_run_guarded_work(struct work_struct *work)
> +{
> +	struct ext3_sb_info *sbi =
> +		container_of(work, struct ext3_sb_info, guarded_work);
> +	struct buffer_head *bh;
> +	struct buffer_head *next;
> +	struct inode *inode;
> +	struct page *page;
> +	struct address_space *mapping;
> +	loff_t offset;
> +
> +	spin_lock_irq(&sbi->guarded_lock);
> +	while(sbi->guarded_buffers) {
> +		bh = sbi->guarded_buffers;
> +		next = bh->b_private;
> +		if (!next)
> +			sbi->guarded_tail = NULL;
> +		sbi->guarded_buffers = next;
> +		bh->b_private = NULL;
> +		spin_unlock_irq(&sbi->guarded_lock);
> +
> +		page = bh->b_page;
> +		mapping = page->mapping;
> +		if (!mapping)
> +			goto out;
> +
> +		/* set the offset to the end of this buffer */
> +		offset = page_offset(page) + bh_offset(bh) + bh->b_size;
> +		inode = mapping->host;
> +
> +		/*
> +		 * then chomp back to i_size if that is smaller than the
> +		 * offset
> +		 */
> +		offset = min(offset, inode->i_size);
> +		if (maybe_update_disk_isize(inode, offset))
> +			ext3_dirty_inode(inode);
> +out:
> +		end_buffer_async_write(bh, buffer_uptodate(bh));
> +		spin_lock_irq(&sbi->guarded_lock);
> +	}
> +	spin_unlock_irq(&sbi->guarded_lock);
> +}
> +
>  static int walk_page_buffers(	handle_t *handle,
>  				struct buffer_head *head,
>  				unsigned from,
> @@ -1275,8 +1339,7 @@ static int ext3_ordered_write_end(struct file *file,
>  		loff_t new_i_size;
>  
>  		new_i_size = pos + copied;
> -		if (new_i_size > EXT3_I(inode)->i_disksize)
> -			EXT3_I(inode)->i_disksize = new_i_size;
> +		maybe_update_disk_isize(inode, new_i_size);
>  		ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
>  							page, fsdata);
>  		copied = ret2;
> @@ -1303,8 +1366,30 @@ static int ext3_writeback_write_end(struct file *file,
>  	loff_t new_i_size;
>  
>  	new_i_size = pos + copied;
> -	if (new_i_size > EXT3_I(inode)->i_disksize)
> -		EXT3_I(inode)->i_disksize = new_i_size;
> +	maybe_update_disk_isize(inode, new_i_size);
> +
> +	ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
> +							page, fsdata);
> +	copied = ret2;
> +	if (ret2 < 0)
> +		ret = ret2;
> +
> +	ret2 = ext3_journal_stop(handle);
> +	if (!ret)
> +		ret = ret2;
> +	unlock_page(page);
> +	page_cache_release(page);
> +
> +	return ret ? ret : copied;
> +}
> +
> +static int ext3_guarded_write_end(struct file *file,
> +				struct address_space *mapping,
> +				loff_t pos, unsigned len, unsigned copied,
> +				struct page *page, void *fsdata)
> +{
> +	handle_t *handle = ext3_journal_current_handle();
> +	int ret = 0, ret2;
>  
>  	ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
>  							page, fsdata);
> @@ -1553,6 +1638,74 @@ out_fail:
>  	return ret;
>  }
>  
> +/*
> + * Completion handler for block_write_full_page() - pages which are unlocked
> + * during I/O, and which have PageWriteback cleared upon I/O completion.
> + */
> +static void end_buffer_async_write_guarded(struct buffer_head *bh,
> +					   int uptodate)
> +{
> +	struct ext3_sb_info *sbi;
> +	struct address_space *mapping;
> +	unsigned long flags;
> +
> +	mapping = bh->b_page->mapping;
> +	if (!mapping || bh->b_private) {
> +		end_buffer_async_write(bh, uptodate);
> +		return;
> +	}
> +
> +	/*
> +	 * the end_io callback deals with IO errors later
> +	 */
> +	if (uptodate)
> +		set_buffer_uptodate(bh);
> +	else
> +		clear_buffer_uptodate(bh);
> +
> +	sbi = EXT3_SB(mapping->host->i_sb);
> +	spin_lock_irqsave(&sbi->guarded_lock, flags);
> +	if (sbi->guarded_tail) {
> +		struct buffer_head *last = sbi->guarded_tail;
> +		last->b_private = bh;
> +	} else
> +		sbi->guarded_buffers = bh;
> +	sbi->guarded_tail = bh;
> +	spin_unlock_irqrestore(&sbi->guarded_lock, flags);
> +	queue_work(sbi->guarded_wq, &sbi->guarded_work);
> +}
> +
> +static int ext3_guarded_writepage(struct page *page,
> +				struct writeback_control *wbc)
> +{
> +	struct inode *inode = page->mapping->host;
> +	handle_t *handle = NULL;
> +	int ret = 0;
> +	int err;
> +
> +	if (ext3_journal_current_handle())
> +		goto out_fail;
> +
> +	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
> +	if (IS_ERR(handle)) {
> +		ret = PTR_ERR(handle);
> +		goto out_fail;
> +	}
> +
> +	ret = block_write_full_page_endio(page, ext3_get_block, wbc,
> +					  end_buffer_async_write_guarded);
> +
> +	err = ext3_journal_stop(handle);
> +	if (!ret)
> +		ret = err;
> +	return ret;
> +
> +out_fail:
> +	redirty_page_for_writepage(wbc, page);
> +	unlock_page(page);
> +	return ret;
> +}
> +
>  static int ext3_writeback_writepage(struct page *page,
>  				struct writeback_control *wbc)
>  {
> @@ -1812,6 +1965,21 @@ static const struct address_space_operations ext3_writeback_aops = {
>  	.is_partially_uptodate  = block_is_partially_uptodate,
>  };
>  
> +static const struct address_space_operations ext3_guarded_aops = {
> +	.readpage		= ext3_readpage,
> +	.readpages		= ext3_readpages,
> +	.writepage		= ext3_guarded_writepage,
> +	.sync_page		= block_sync_page,
> +	.write_begin		= ext3_write_begin,
> +	.write_end		= ext3_guarded_write_end,
> +	.bmap			= ext3_bmap,
> +	.invalidatepage		= ext3_invalidatepage,
> +	.releasepage		= ext3_releasepage,
> +	.direct_IO		= ext3_direct_IO,
> +	.migratepage		= buffer_migrate_page,
> +	.is_partially_uptodate  = block_is_partially_uptodate,
> +};
> +
>  static const struct address_space_operations ext3_journalled_aops = {
>  	.readpage		= ext3_readpage,
>  	.readpages		= ext3_readpages,
> @@ -1830,6 +1998,8 @@ void ext3_set_aops(struct inode *inode)
>  {
>  	if (ext3_should_order_data(inode))
>  		inode->i_mapping->a_ops = &ext3_ordered_aops;
> +	else if (ext3_should_guard_data(inode))
> +		inode->i_mapping->a_ops = &ext3_guarded_aops;
>  	else if (ext3_should_writeback_data(inode))
>  		inode->i_mapping->a_ops = &ext3_writeback_aops;
>  	else
> @@ -3081,6 +3251,14 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
>  		}
>  
>  		error = ext3_orphan_add(handle, inode);
> +
> +		/*
> +		 * this is pretty confusing, but we don't need to worry
> +		 * about guarded i_size here because ext3 truncate fixes
> +		 * it to the correct i_size when the truncate is all done,
> +		 * and the ext3_orphan_add makes sure we'll have a sane
> +		 * i_size after a crash
> +		 */
>  		EXT3_I(inode)->i_disksize = attr->ia_size;
>  		rc = ext3_mark_inode_dirty(handle, inode);
>  		if (!error)
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 4a97041..0534a95 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -37,6 +37,7 @@
>  #include <linux/quotaops.h>
>  #include <linux/seq_file.h>
>  #include <linux/log2.h>
> +#include <linux/workqueue.h>
>  
>  #include <asm/uaccess.h>
>  
> @@ -393,6 +394,9 @@ static void ext3_put_super (struct super_block * sb)
>  	struct ext3_super_block *es = sbi->s_es;
>  	int i, err;
>  
> +	flush_workqueue(sbi->guarded_wq);
> +	destroy_workqueue(sbi->guarded_wq);
> +
>  	ext3_xattr_put_super(sb);
>  	err = journal_destroy(sbi->s_journal);
>  	sbi->s_journal = NULL;
> @@ -628,6 +632,8 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
>  		seq_puts(seq, ",data=journal");
>  	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA)
>  		seq_puts(seq, ",data=ordered");
> +	else if (test_opt(sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
> +		seq_puts(seq, ",data=guarded");
>  	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
>  		seq_puts(seq, ",data=writeback");
>  
> @@ -786,7 +792,7 @@ enum {
>  	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
>  	Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
>  	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
> -	Opt_data_err_abort, Opt_data_err_ignore,
> +	Opt_data_guarded, Opt_data_err_abort, Opt_data_err_ignore,
>  	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
>  	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
>  	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
> @@ -828,6 +834,7 @@ static const match_table_t tokens = {
>  	{Opt_abort, "abort"},
>  	{Opt_data_journal, "data=journal"},
>  	{Opt_data_ordered, "data=ordered"},
> +	{Opt_data_guarded, "data=guarded"},
>  	{Opt_data_writeback, "data=writeback"},
>  	{Opt_data_err_abort, "data_err=abort"},
>  	{Opt_data_err_ignore, "data_err=ignore"},
> @@ -1030,6 +1037,9 @@ static int parse_options (char *options, struct super_block *sb,
>  		case Opt_data_ordered:
>  			data_opt = EXT3_MOUNT_ORDERED_DATA;
>  			goto datacheck;
> +		case Opt_data_guarded:
> +			data_opt = EXT3_MOUNT_GUARDED_DATA;
> +			goto datacheck;
>  		case Opt_data_writeback:
>  			data_opt = EXT3_MOUNT_WRITEBACK_DATA;
>  		datacheck:
> @@ -1945,11 +1955,24 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  			clear_opt(sbi->s_mount_opt, NOBH);
>  		}
>  	}
> +
> +	/*
> +	 * setup the guarded work list
> +	 */
> +	EXT3_SB(sb)->guarded_buffers = NULL;
> +	EXT3_SB(sb)->guarded_tail = NULL;
> +	INIT_WORK(&EXT3_SB(sb)->guarded_work, ext3_run_guarded_work);
> +	spin_lock_init(&EXT3_SB(sb)->guarded_lock);
> +	EXT3_SB(sb)->guarded_wq = create_workqueue("ext3-guard");
> +	if (!EXT3_SB(sb)->guarded_wq) {
> +		printk(KERN_ERR "EXT3-fs: failed to create workqueue\n");
> +		goto failed_mount_guard;
> +	}
> +
>  	/*
>  	 * The journal_load will have done any necessary log recovery,
>  	 * so we can safely mount the rest of the filesystem now.
>  	 */
> -
>  	root = ext3_iget(sb, EXT3_ROOT_INO);
>  	if (IS_ERR(root)) {
>  		printk(KERN_ERR "EXT3-fs: get root inode failed\n");
> @@ -1961,6 +1984,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
>  		goto failed_mount4;
>  	}
> +
>  	sb->s_root = d_alloc_root(root);
>  	if (!sb->s_root) {
>  		printk(KERN_ERR "EXT3-fs: get root dentry failed\n");
> @@ -1970,6 +1994,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  	}
>  
>  	ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
> +
>  	/*
>  	 * akpm: core read_super() calls in here with the superblock locked.
>  	 * That deadlocks, because orphan cleanup needs to lock the superblock
> @@ -1985,9 +2010,10 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
>  		printk (KERN_INFO "EXT3-fs: recovery complete.\n");
>  	ext3_mark_recovery_complete(sb, es);
>  	printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
> -		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
> -		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
> -		"writeback");
> +	      test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
> +	      test_opt(sb,GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA ? "guarded":
> +	      test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
> +	      "writeback");
>  
>  	lock_kernel();
>  	return 0;
> @@ -1999,6 +2025,8 @@ cantfind_ext3:
>  	goto failed_mount;
>  
>  failed_mount4:
> +	destroy_workqueue(EXT3_SB(sb)->guarded_wq);
> +failed_mount_guard:
>  	journal_destroy(sbi->s_journal);
>  failed_mount3:
>  	percpu_counter_destroy(&sbi->s_freeblocks_counter);
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index bd7ac79..507b38d 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -155,6 +155,7 @@ void create_empty_buffers(struct page *, unsigned long,
>  			unsigned long b_state);
>  void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
>  void end_buffer_write_sync(struct buffer_head *bh, int uptodate);
> +void end_buffer_async_write(struct buffer_head *bh, int uptodate);
>  
>  /* Things to do with buffers at mapping->private_list */
>  void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode);
> @@ -204,6 +205,8 @@ extern int buffer_heads_over_limit;
>  void block_invalidatepage(struct page *page, unsigned long offset);
>  int block_write_full_page(struct page *page, get_block_t *get_block,
>  				struct writeback_control *wbc);
> +int block_write_full_page_endio(struct page *page, get_block_t *get_block,
> +			struct writeback_control *wbc, bh_end_io_t *handler);
>  int block_read_full_page(struct page*, get_block_t*);
>  int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
>  				unsigned long from);
> diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
> index dd495b8..7966bdb 100644
> --- a/include/linux/ext3_fs.h
> +++ b/include/linux/ext3_fs.h
> @@ -18,6 +18,7 @@
>  
>  #include <linux/types.h>
>  #include <linux/magic.h>
> +#include <linux/workqueue.h>
>  
>  /*
>   * The second extended filesystem constants/structures
> @@ -397,7 +398,6 @@ struct ext3_inode {
>  #define EXT3_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
>  #define EXT3_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
>  #define EXT3_MOUNT_ABORT		0x00200	/* Fatal error detected */
> -#define EXT3_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
>  #define EXT3_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
>  #define EXT3_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
>  #define EXT3_MOUNT_WRITEBACK_DATA	0x00C00	/* No data ordering */
> @@ -413,6 +413,12 @@ struct ext3_inode {
>  #define EXT3_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
>  #define EXT3_MOUNT_DATA_ERR_ABORT	0x400000 /* Abort on file data write
>  						  * error in ordered mode */
> +#define EXT3_MOUNT_GUARDED_DATA		0x800000 /* guard new writes with
> +						    i_size */
> +#define EXT3_MOUNT_DATA_FLAGS		(EXT3_MOUNT_JOURNAL_DATA | \
> +					 EXT3_MOUNT_ORDERED_DATA | \
> +					 EXT3_MOUNT_WRITEBACK_DATA | \
> +					 EXT3_MOUNT_GUARDED_DATA)
>  
>  /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
>  #ifndef _LINUX_EXT2_FS_H
> @@ -891,6 +897,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
>  extern void ext3_set_aops(struct inode *inode);
>  extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  		       u64 start, u64 len);
> +void ext3_run_guarded_work(struct work_struct *work);
>  
>  /* ioctl.c */
>  extern int ext3_ioctl (struct inode *, struct file *, unsigned int,
> diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
> index f07f34d..868d2cd 100644
> --- a/include/linux/ext3_fs_sb.h
> +++ b/include/linux/ext3_fs_sb.h
> @@ -21,6 +21,7 @@
>  #include <linux/wait.h>
>  #include <linux/blockgroup_lock.h>
>  #include <linux/percpu_counter.h>
> +#include <linux/workqueue.h>
>  #endif
>  #include <linux/rbtree.h>
>  
> @@ -82,6 +83,12 @@ struct ext3_sb_info {
>  	char *s_qf_names[MAXQUOTAS];		/* Names of quota files with journalled quota */
>  	int s_jquota_fmt;			/* Format of quota to use */
>  #endif
> +
> +	struct workqueue_struct *guarded_wq;
> +	struct work_struct guarded_work;
> +	struct buffer_head *guarded_buffers;
> +	struct buffer_head *guarded_tail;
> +	spinlock_t guarded_lock;
>  };
>  
>  static inline spinlock_t *
> diff --git a/include/linux/ext3_jbd.h b/include/linux/ext3_jbd.h
> index cf82d51..45cb4aa 100644
> --- a/include/linux/ext3_jbd.h
> +++ b/include/linux/ext3_jbd.h
> @@ -212,6 +212,17 @@ static inline int ext3_should_order_data(struct inode *inode)
>  	return 0;
>  }
>  
> +static inline int ext3_should_guard_data(struct inode *inode)
> +{
> +	if (!S_ISREG(inode->i_mode))
> +		return 0;
> +	if (EXT3_I(inode)->i_flags & EXT3_JOURNAL_DATA_FL)
> +		return 0;
> +	if (test_opt(inode->i_sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
> +		return 1;
> +	return 0;
> +}
> +
>  static inline int ext3_should_writeback_data(struct inode *inode)
>  {
>  	if (!S_ISREG(inode->i_mode))
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-09 15:49 ` Linus Torvalds
  2009-04-09 16:23   ` Chris Mason
@ 2009-04-09 17:36   ` Jan Kara
  1 sibling, 0 replies; 69+ messages in thread
From: Jan Kara @ 2009-04-09 17:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Ts'o, Linux Kernel Developers List, Ext4 Developers List

> On Wed, 8 Apr 2009, Theodore Ts'o wrote:
> > 
> > One of these patches fixes a performance regression caused by a64c8610,
> > which unplugged the write queue after every page write.  Now that Jens
> > added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> > WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
> > to further improbve ext3 latency, especially during the "sync" command
> > in Linus's write-big-file-and-sync workload.
> 
> So here's a question and a untested _conceptual_ patch. 
> 
> The kind of writeback mode I'd personally prefer would be more of a 
> mixture of the current "data=writeback" and "data=ordered" modes, with 
> something of the best of both worlds. I'd like the data writeback to get 
> _started_ when the journal is written to disk, but I'd like it to not 
> block journal updates.
> 
> IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't 
> be totally unordered either.
> 
> For true sync operations (ie fsync()), the VFS layer then does the proper 
> "wait for data" part.
> 
> I dunno. I don't actually know the JBD internal constraints, but what I'm 
> talking about is something like the appended patch. It wouldn't help under 
> really heavy writeback IO (because even if we don't end up waiting for all 
> the random data to complete, we'd end up waiting when _submitting_ it), 
> but it might help under somewhat less extreme loads.
> 
> This is totally untested. It might well violate some serious internal jbd 
> rules and eat your filesystem, for all I know. I'm throwing the patch out 
> as a "would something _like_ this perhaps make sense as a half-way-point 
> between 'ordered' and 'writeback', nothing more.
  Yes, your patch should work - it'll change the meaning of ordered mode
as you describe above. It's kind of even safer version than Ted's tweaks
to submit IO after truncate / rename. But it has also higher cost not
only in terms of IO but also because we have to track data buffers in
journal lists in this mode. But if we decide this is a way to go, then I
can port my ordered mode changes from JBD2 so that we track inodes and
not all data buffers in journal lists and that would mitigate this
disadvantage.

									Honza
> ---
>  fs/jbd/commit.c |   11 ++++++++++-
>  1 files changed, 10 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index a8e8513..5bea3ed 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -184,6 +184,9 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
>  	}
>  }
>  
> +/* This would obviously be a real flag, set at mount time */
> +#define BACKGROUND_DATA(journal) (1)
> +
>  /*
>   *  Submit all the data buffers to disk
>   */
> @@ -198,6 +201,9 @@ static int journal_submit_data_buffers(journal_t *journal,
>  	struct buffer_head **wbuf = journal->j_wbuf;
>  	int err = 0;
>  
> +	if (BACKGROUND_DATA(journal))
> +		write_op = WRITE;
> +
>  	/*
>  	 * Whenever we unlock the journal and sleep, things can get added
>  	 * onto ->t_sync_datalist, so we have to keep looping back to
> @@ -254,7 +260,10 @@ write_out_data:
>  		if (locked && test_clear_buffer_dirty(bh)) {
>  			BUFFER_TRACE(bh, "needs writeout, adding to array");
>  			wbuf[bufs++] = bh;
> -			__journal_file_buffer(jh, commit_transaction,
> +			if (BACKGROUND_DATA(journal))
> +				__journal_unfile_buffer(jh);
> +			else
> +				__journal_file_buffer(jh, commit_transaction,
>  						BJ_Locked);
>  			jbd_unlock_bh_state(bh);
>  			if (bufs == journal->j_wbufsize) {
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-09 15:49 ` Linus Torvalds
@ 2009-04-09 16:23   ` Chris Mason
  2009-04-09 17:49     ` Jan Kara
  2009-04-09 17:36   ` Jan Kara
  1 sibling, 1 reply; 69+ messages in thread
From: Chris Mason @ 2009-04-09 16:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Ts'o, Linux Kernel Developers List, Ext4 Developers List

On Thu, 2009-04-09 at 08:49 -0700, Linus Torvalds wrote:
> 
> On Wed, 8 Apr 2009, Theodore Ts'o wrote:
> > 
> > One of these patches fixes a performance regression caused by a64c8610,
> > which unplugged the write queue after every page write.  Now that Jens
> > added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> > WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
> > to further improbve ext3 latency, especially during the "sync" command
> > in Linus's write-big-file-and-sync workload.
> 
> So here's a question and a untested _conceptual_ patch. 
> 
> The kind of writeback mode I'd personally prefer would be more of a 
> mixture of the current "data=writeback" and "data=ordered" modes, with 
> something of the best of both worlds. I'd like the data writeback to get 
> _started_ when the journal is written to disk, but I'd like it to not 
> block journal updates.
> 
> IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't 
> be totally unordered either.
> 

I started working on the xfs style i_size updates last night, and here's
my current (most definitely broken) proof of concept.  I call it
data=guarded.

In guarded mode the on disk i_size is not updated until after the data
writes are complete.  I've got a per FS work queue and I'm abusing
bh->b_private as a list pointer.  So, what happens is:

* writepage sets up the buffer with the guarded end_io handler

* The end_io handler puts the buffer onto the per-sb list of guarded
buffers and then it kicks the work queue

* The workqueue updates the on disk i_size to the min of the end of the
buffer or the in-memory i_size, and then it logs the inode.

* Then the regular async bh end_io handler is called to end writeback on
the page.

One big gotcha is that we starting a transaction while a page is
writeback.  It means that anyone who waits for writeback to finish on
the datapage with a transaction running could deadlock against the work
queue func trying to start a transaction.

I couldn't find anyone doing that, but if it matters, we can always just
mark the inode dirty and let some other async func handle the logging.
We could also play tricks with logging the inode after the real end_io
handler clears PG_writeback.

This code doesn't:

* Deal with hole filling (plan is just to use the ordered code there)

* Make sure all the blocks are on disk between the new disk i_size and
the old one.  For this, I'll add an rbtree to track BH_New buffers and
delay updating the disk isize until the pending BH_New IO is on disk.
Btrfs already does this, so I should have a handle on the spots I need
to fiddle.

There's a ton of room for optimization like not doing async end_io if
we're already inside disk i_size.

-chris

diff --git a/fs/buffer.c b/fs/buffer.c
index 891e1c7..c5e1ffd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -505,7 +505,7 @@ still_busy:
  * Completion handler for block_write_full_page() - pages which are unlocked
  * during I/O, and which have PageWriteback cleared upon I/O completion.
  */
-static void end_buffer_async_write(struct buffer_head *bh, int uptodate)
+void end_buffer_async_write(struct buffer_head *bh, int uptodate)
 {
 	char b[BDEVNAME_SIZE];
 	unsigned long flags;
@@ -583,11 +583,17 @@ static void mark_buffer_async_read(struct buffer_head *bh)
 	set_buffer_async_read(bh);
 }
 
-void mark_buffer_async_write(struct buffer_head *bh)
+void mark_buffer_async_write_endio(struct buffer_head *bh,
+				   bh_end_io_t *handler)
 {
-	bh->b_end_io = end_buffer_async_write;
+	bh->b_end_io = handler;
 	set_buffer_async_write(bh);
 }
+
+void mark_buffer_async_write(struct buffer_head *bh)
+{
+	mark_buffer_async_write_endio(bh, end_buffer_async_write);
+}
 EXPORT_SYMBOL(mark_buffer_async_write);
 
 
@@ -1706,7 +1712,8 @@ EXPORT_SYMBOL(unmap_underlying_metadata);
  * prevents this contention from occurring.
  */
 static int __block_write_full_page(struct inode *inode, struct page *page,
-			get_block_t *get_block, struct writeback_control *wbc)
+			get_block_t *get_block, struct writeback_control *wbc,
+			bh_end_io_t *handler)
 {
 	int err;
 	sector_t block;
@@ -1789,7 +1796,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 			continue;
 		}
 		if (test_clear_buffer_dirty(bh)) {
-			mark_buffer_async_write(bh);
+			mark_buffer_async_write_endio(bh, handler);
 		} else {
 			unlock_buffer(bh);
 		}
@@ -1842,7 +1849,7 @@ recover:
 		if (buffer_mapped(bh) && buffer_dirty(bh) &&
 		    !buffer_delay(bh)) {
 			lock_buffer(bh);
-			mark_buffer_async_write(bh);
+			mark_buffer_async_write_endio(bh, handler);
 		} else {
 			/*
 			 * The buffer may have been set dirty during
@@ -2760,7 +2767,8 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
 out:
 	ret = mpage_writepage(page, get_block, wbc);
 	if (ret == -EAGAIN)
-		ret = __block_write_full_page(inode, page, get_block, wbc);
+		ret = __block_write_full_page(inode, page, get_block, wbc,
+					      end_buffer_async_write);
 	return ret;
 }
 EXPORT_SYMBOL(nobh_writepage);
@@ -2918,9 +2926,10 @@ out:
 
 /*
  * The generic ->writepage function for buffer-backed address_spaces
+ * this form passes in the end_io handler used to finish the IO.
  */
-int block_write_full_page(struct page *page, get_block_t *get_block,
-			struct writeback_control *wbc)
+int block_write_full_page_endio(struct page *page, get_block_t *get_block,
+			struct writeback_control *wbc, bh_end_io_t *handler)
 {
 	struct inode * const inode = page->mapping->host;
 	loff_t i_size = i_size_read(inode);
@@ -2929,7 +2938,8 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
 
 	/* Is the page fully inside i_size? */
 	if (page->index < end_index)
-		return __block_write_full_page(inode, page, get_block, wbc);
+		return __block_write_full_page(inode, page, get_block, wbc,
+					       handler);
 
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
@@ -2952,9 +2962,20 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
 	 * writes to that region are not written out to the file."
 	 */
 	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
-	return __block_write_full_page(inode, page, get_block, wbc);
+	return __block_write_full_page(inode, page, get_block, wbc, handler);
 }
 
+/*
+ * The generic ->writepage function for buffer-backed address_spaces
+ */
+int block_write_full_page(struct page *page, get_block_t *get_block,
+			struct writeback_control *wbc)
+{
+	return block_write_full_page_endio(page, get_block, wbc,
+					   end_buffer_async_write);
+}
+
+
 sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
 			    get_block_t *get_block)
 {
@@ -3422,9 +3443,11 @@ EXPORT_SYMBOL(block_read_full_page);
 EXPORT_SYMBOL(block_sync_page);
 EXPORT_SYMBOL(block_truncate_page);
 EXPORT_SYMBOL(block_write_full_page);
+EXPORT_SYMBOL(block_write_full_page_endio);
 EXPORT_SYMBOL(cont_write_begin);
 EXPORT_SYMBOL(end_buffer_read_sync);
 EXPORT_SYMBOL(end_buffer_write_sync);
+EXPORT_SYMBOL_GPL(end_buffer_async_write);
 EXPORT_SYMBOL(file_fsync);
 EXPORT_SYMBOL(fsync_bdev);
 EXPORT_SYMBOL(generic_block_bmap);
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 5fa453b..64995d0 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -38,6 +38,7 @@
 #include <linux/bio.h>
 #include <linux/fiemap.h>
 #include <linux/namei.h>
+#include <linux/workqueue.h>
 #include "xattr.h"
 #include "acl.h"
 
@@ -766,6 +767,21 @@ err_out:
 	return err;
 }
 
+static int maybe_update_disk_isize(struct inode *inode, loff_t new_size)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	/* FIXME add a lock in the inode */
+	spin_lock_irqsave(&EXT3_SB(inode->i_sb)->guarded_lock, flags);
+	if (EXT3_I(inode)->i_disksize < new_size) {
+		EXT3_I(inode)->i_disksize = new_size;
+		ret = 1;
+	}
+	spin_unlock_irqrestore(&EXT3_SB(inode->i_sb)->guarded_lock, flags);
+	return ret;
+}
+
 /*
  * Allocation strategy is simple: if we have to allocate something, we will
  * have to go the whole way to leaf. So let's do it before attaching anything
@@ -915,9 +931,13 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 	 * i_disksize growing is protected by truncate_mutex.  Don't forget to
 	 * protect it if you're about to implement concurrent
 	 * ext3_get_block() -bzzz
+	 *
+	 * FIXME, I think this only needs to extend the disk i_size when
+	 * we're filling holes that came from using ftruncate to increase
+	 * i_size.  Need to verify.
 	*/
-	if (!err && extend_disksize && inode->i_size > ei->i_disksize)
-		ei->i_disksize = inode->i_size;
+	if (!ext3_should_guard_data(inode) && !err && extend_disksize)
+		maybe_update_disk_isize(inode, inode->i_size);
 	mutex_unlock(&ei->truncate_mutex);
 	if (err)
 		goto cleanup;
@@ -1079,6 +1099,50 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
 	return NULL;
 }
 
+void ext3_run_guarded_work(struct work_struct *work)
+{
+	struct ext3_sb_info *sbi =
+		container_of(work, struct ext3_sb_info, guarded_work);
+	struct buffer_head *bh;
+	struct buffer_head *next;
+	struct inode *inode;
+	struct page *page;
+	struct address_space *mapping;
+	loff_t offset;
+
+	spin_lock_irq(&sbi->guarded_lock);
+	while(sbi->guarded_buffers) {
+		bh = sbi->guarded_buffers;
+		next = bh->b_private;
+		if (!next)
+			sbi->guarded_tail = NULL;
+		sbi->guarded_buffers = next;
+		bh->b_private = NULL;
+		spin_unlock_irq(&sbi->guarded_lock);
+
+		page = bh->b_page;
+		mapping = page->mapping;
+		if (!mapping)
+			goto out;
+
+		/* set the offset to the end of this buffer */
+		offset = page_offset(page) + bh_offset(bh) + bh->b_size;
+		inode = mapping->host;
+
+		/*
+		 * then chomp back to i_size if that is smaller than the
+		 * offset
+		 */
+		offset = min(offset, inode->i_size);
+		if (maybe_update_disk_isize(inode, offset))
+			ext3_dirty_inode(inode);
+out:
+		end_buffer_async_write(bh, buffer_uptodate(bh));
+		spin_lock_irq(&sbi->guarded_lock);
+	}
+	spin_unlock_irq(&sbi->guarded_lock);
+}
+
 static int walk_page_buffers(	handle_t *handle,
 				struct buffer_head *head,
 				unsigned from,
@@ -1275,8 +1339,7 @@ static int ext3_ordered_write_end(struct file *file,
 		loff_t new_i_size;
 
 		new_i_size = pos + copied;
-		if (new_i_size > EXT3_I(inode)->i_disksize)
-			EXT3_I(inode)->i_disksize = new_i_size;
+		maybe_update_disk_isize(inode, new_i_size);
 		ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
 							page, fsdata);
 		copied = ret2;
@@ -1303,8 +1366,30 @@ static int ext3_writeback_write_end(struct file *file,
 	loff_t new_i_size;
 
 	new_i_size = pos + copied;
-	if (new_i_size > EXT3_I(inode)->i_disksize)
-		EXT3_I(inode)->i_disksize = new_i_size;
+	maybe_update_disk_isize(inode, new_i_size);
+
+	ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
+							page, fsdata);
+	copied = ret2;
+	if (ret2 < 0)
+		ret = ret2;
+
+	ret2 = ext3_journal_stop(handle);
+	if (!ret)
+		ret = ret2;
+	unlock_page(page);
+	page_cache_release(page);
+
+	return ret ? ret : copied;
+}
+
+static int ext3_guarded_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
+{
+	handle_t *handle = ext3_journal_current_handle();
+	int ret = 0, ret2;
 
 	ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
 							page, fsdata);
@@ -1553,6 +1638,74 @@ out_fail:
 	return ret;
 }
 
+/*
+ * Completion handler for block_write_full_page() - pages which are unlocked
+ * during I/O, and which have PageWriteback cleared upon I/O completion.
+ */
+static void end_buffer_async_write_guarded(struct buffer_head *bh,
+					   int uptodate)
+{
+	struct ext3_sb_info *sbi;
+	struct address_space *mapping;
+	unsigned long flags;
+
+	mapping = bh->b_page->mapping;
+	if (!mapping || bh->b_private) {
+		end_buffer_async_write(bh, uptodate);
+		return;
+	}
+
+	/*
+	 * the end_io callback deals with IO errors later
+	 */
+	if (uptodate)
+		set_buffer_uptodate(bh);
+	else
+		clear_buffer_uptodate(bh);
+
+	sbi = EXT3_SB(mapping->host->i_sb);
+	spin_lock_irqsave(&sbi->guarded_lock, flags);
+	if (sbi->guarded_tail) {
+		struct buffer_head *last = sbi->guarded_tail;
+		last->b_private = bh;
+	} else
+		sbi->guarded_buffers = bh;
+	sbi->guarded_tail = bh;
+	spin_unlock_irqrestore(&sbi->guarded_lock, flags);
+	queue_work(sbi->guarded_wq, &sbi->guarded_work);
+}
+
+static int ext3_guarded_writepage(struct page *page,
+				struct writeback_control *wbc)
+{
+	struct inode *inode = page->mapping->host;
+	handle_t *handle = NULL;
+	int ret = 0;
+	int err;
+
+	if (ext3_journal_current_handle())
+		goto out_fail;
+
+	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out_fail;
+	}
+
+	ret = block_write_full_page_endio(page, ext3_get_block, wbc,
+					  end_buffer_async_write_guarded);
+
+	err = ext3_journal_stop(handle);
+	if (!ret)
+		ret = err;
+	return ret;
+
+out_fail:
+	redirty_page_for_writepage(wbc, page);
+	unlock_page(page);
+	return ret;
+}
+
 static int ext3_writeback_writepage(struct page *page,
 				struct writeback_control *wbc)
 {
@@ -1812,6 +1965,21 @@ static const struct address_space_operations ext3_writeback_aops = {
 	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
+static const struct address_space_operations ext3_guarded_aops = {
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_guarded_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_guarded_write_end,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.direct_IO		= ext3_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
+};
+
 static const struct address_space_operations ext3_journalled_aops = {
 	.readpage		= ext3_readpage,
 	.readpages		= ext3_readpages,
@@ -1830,6 +1998,8 @@ void ext3_set_aops(struct inode *inode)
 {
 	if (ext3_should_order_data(inode))
 		inode->i_mapping->a_ops = &ext3_ordered_aops;
+	else if (ext3_should_guard_data(inode))
+		inode->i_mapping->a_ops = &ext3_guarded_aops;
 	else if (ext3_should_writeback_data(inode))
 		inode->i_mapping->a_ops = &ext3_writeback_aops;
 	else
@@ -3081,6 +3251,14 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 		}
 
 		error = ext3_orphan_add(handle, inode);
+
+		/*
+		 * this is pretty confusing, but we don't need to worry
+		 * about guarded i_size here because ext3 truncate fixes
+		 * it to the correct i_size when the truncate is all done,
+		 * and the ext3_orphan_add makes sure we'll have a sane
+		 * i_size after a crash
+		 */
 		EXT3_I(inode)->i_disksize = attr->ia_size;
 		rc = ext3_mark_inode_dirty(handle, inode);
 		if (!error)
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 4a97041..0534a95 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -37,6 +37,7 @@
 #include <linux/quotaops.h>
 #include <linux/seq_file.h>
 #include <linux/log2.h>
+#include <linux/workqueue.h>
 
 #include <asm/uaccess.h>
 
@@ -393,6 +394,9 @@ static void ext3_put_super (struct super_block * sb)
 	struct ext3_super_block *es = sbi->s_es;
 	int i, err;
 
+	flush_workqueue(sbi->guarded_wq);
+	destroy_workqueue(sbi->guarded_wq);
+
 	ext3_xattr_put_super(sb);
 	err = journal_destroy(sbi->s_journal);
 	sbi->s_journal = NULL;
@@ -628,6 +632,8 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
 		seq_puts(seq, ",data=journal");
 	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA)
 		seq_puts(seq, ",data=ordered");
+	else if (test_opt(sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
+		seq_puts(seq, ",data=guarded");
 	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
 		seq_puts(seq, ",data=writeback");
 
@@ -786,7 +792,7 @@ enum {
 	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
 	Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
 	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
-	Opt_data_err_abort, Opt_data_err_ignore,
+	Opt_data_guarded, Opt_data_err_abort, Opt_data_err_ignore,
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
 	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
@@ -828,6 +834,7 @@ static const match_table_t tokens = {
 	{Opt_abort, "abort"},
 	{Opt_data_journal, "data=journal"},
 	{Opt_data_ordered, "data=ordered"},
+	{Opt_data_guarded, "data=guarded"},
 	{Opt_data_writeback, "data=writeback"},
 	{Opt_data_err_abort, "data_err=abort"},
 	{Opt_data_err_ignore, "data_err=ignore"},
@@ -1030,6 +1037,9 @@ static int parse_options (char *options, struct super_block *sb,
 		case Opt_data_ordered:
 			data_opt = EXT3_MOUNT_ORDERED_DATA;
 			goto datacheck;
+		case Opt_data_guarded:
+			data_opt = EXT3_MOUNT_GUARDED_DATA;
+			goto datacheck;
 		case Opt_data_writeback:
 			data_opt = EXT3_MOUNT_WRITEBACK_DATA;
 		datacheck:
@@ -1945,11 +1955,24 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 			clear_opt(sbi->s_mount_opt, NOBH);
 		}
 	}
+
+	/*
+	 * setup the guarded work list
+	 */
+	EXT3_SB(sb)->guarded_buffers = NULL;
+	EXT3_SB(sb)->guarded_tail = NULL;
+	INIT_WORK(&EXT3_SB(sb)->guarded_work, ext3_run_guarded_work);
+	spin_lock_init(&EXT3_SB(sb)->guarded_lock);
+	EXT3_SB(sb)->guarded_wq = create_workqueue("ext3-guard");
+	if (!EXT3_SB(sb)->guarded_wq) {
+		printk(KERN_ERR "EXT3-fs: failed to create workqueue\n");
+		goto failed_mount_guard;
+	}
+
 	/*
 	 * The journal_load will have done any necessary log recovery,
 	 * so we can safely mount the rest of the filesystem now.
 	 */
-
 	root = ext3_iget(sb, EXT3_ROOT_INO);
 	if (IS_ERR(root)) {
 		printk(KERN_ERR "EXT3-fs: get root inode failed\n");
@@ -1961,6 +1984,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
 		goto failed_mount4;
 	}
+
 	sb->s_root = d_alloc_root(root);
 	if (!sb->s_root) {
 		printk(KERN_ERR "EXT3-fs: get root dentry failed\n");
@@ -1970,6 +1994,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 	}
 
 	ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
+
 	/*
 	 * akpm: core read_super() calls in here with the superblock locked.
 	 * That deadlocks, because orphan cleanup needs to lock the superblock
@@ -1985,9 +2010,10 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		printk (KERN_INFO "EXT3-fs: recovery complete.\n");
 	ext3_mark_recovery_complete(sb, es);
 	printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
-		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
-		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
-		"writeback");
+	      test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
+	      test_opt(sb,GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA ? "guarded":
+	      test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
+	      "writeback");
 
 	lock_kernel();
 	return 0;
@@ -1999,6 +2025,8 @@ cantfind_ext3:
 	goto failed_mount;
 
 failed_mount4:
+	destroy_workqueue(EXT3_SB(sb)->guarded_wq);
+failed_mount_guard:
 	journal_destroy(sbi->s_journal);
 failed_mount3:
 	percpu_counter_destroy(&sbi->s_freeblocks_counter);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index bd7ac79..507b38d 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -155,6 +155,7 @@ void create_empty_buffers(struct page *, unsigned long,
 			unsigned long b_state);
 void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
 void end_buffer_write_sync(struct buffer_head *bh, int uptodate);
+void end_buffer_async_write(struct buffer_head *bh, int uptodate);
 
 /* Things to do with buffers at mapping->private_list */
 void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode);
@@ -204,6 +205,8 @@ extern int buffer_heads_over_limit;
 void block_invalidatepage(struct page *page, unsigned long offset);
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
+int block_write_full_page_endio(struct page *page, get_block_t *get_block,
+			struct writeback_control *wbc, bh_end_io_t *handler);
 int block_read_full_page(struct page*, get_block_t*);
 int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
 				unsigned long from);
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index dd495b8..7966bdb 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -18,6 +18,7 @@
 
 #include <linux/types.h>
 #include <linux/magic.h>
+#include <linux/workqueue.h>
 
 /*
  * The second extended filesystem constants/structures
@@ -397,7 +398,6 @@ struct ext3_inode {
 #define EXT3_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
 #define EXT3_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
 #define EXT3_MOUNT_ABORT		0x00200	/* Fatal error detected */
-#define EXT3_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
 #define EXT3_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
 #define EXT3_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
 #define EXT3_MOUNT_WRITEBACK_DATA	0x00C00	/* No data ordering */
@@ -413,6 +413,12 @@ struct ext3_inode {
 #define EXT3_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
 #define EXT3_MOUNT_DATA_ERR_ABORT	0x400000 /* Abort on file data write
 						  * error in ordered mode */
+#define EXT3_MOUNT_GUARDED_DATA		0x800000 /* guard new writes with
+						    i_size */
+#define EXT3_MOUNT_DATA_FLAGS		(EXT3_MOUNT_JOURNAL_DATA | \
+					 EXT3_MOUNT_ORDERED_DATA | \
+					 EXT3_MOUNT_WRITEBACK_DATA | \
+					 EXT3_MOUNT_GUARDED_DATA)
 
 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
@@ -891,6 +897,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
 extern void ext3_set_aops(struct inode *inode);
 extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		       u64 start, u64 len);
+void ext3_run_guarded_work(struct work_struct *work);
 
 /* ioctl.c */
 extern int ext3_ioctl (struct inode *, struct file *, unsigned int,
diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
index f07f34d..868d2cd 100644
--- a/include/linux/ext3_fs_sb.h
+++ b/include/linux/ext3_fs_sb.h
@@ -21,6 +21,7 @@
 #include <linux/wait.h>
 #include <linux/blockgroup_lock.h>
 #include <linux/percpu_counter.h>
+#include <linux/workqueue.h>
 #endif
 #include <linux/rbtree.h>
 
@@ -82,6 +83,12 @@ struct ext3_sb_info {
 	char *s_qf_names[MAXQUOTAS];		/* Names of quota files with journalled quota */
 	int s_jquota_fmt;			/* Format of quota to use */
 #endif
+
+	struct workqueue_struct *guarded_wq;
+	struct work_struct guarded_work;
+	struct buffer_head *guarded_buffers;
+	struct buffer_head *guarded_tail;
+	spinlock_t guarded_lock;
 };
 
 static inline spinlock_t *
diff --git a/include/linux/ext3_jbd.h b/include/linux/ext3_jbd.h
index cf82d51..45cb4aa 100644
--- a/include/linux/ext3_jbd.h
+++ b/include/linux/ext3_jbd.h
@@ -212,6 +212,17 @@ static inline int ext3_should_order_data(struct inode *inode)
 	return 0;
 }
 
+static inline int ext3_should_guard_data(struct inode *inode)
+{
+	if (!S_ISREG(inode->i_mode))
+		return 0;
+	if (EXT3_I(inode)->i_flags & EXT3_JOURNAL_DATA_FL)
+		return 0;
+	if (test_opt(inode->i_sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
+		return 1;
+	return 0;
+}
+
 static inline int ext3_should_writeback_data(struct inode *inode)
 {
 	if (!S_ISREG(inode->i_mode))



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [GIT PULL] Ext3 latency fixes
  2009-04-08 23:40 Theodore Ts'o
@ 2009-04-09 15:49 ` Linus Torvalds
  2009-04-09 16:23   ` Chris Mason
  2009-04-09 17:36   ` Jan Kara
  0 siblings, 2 replies; 69+ messages in thread
From: Linus Torvalds @ 2009-04-09 15:49 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Linux Kernel Developers List, Ext4 Developers List



On Wed, 8 Apr 2009, Theodore Ts'o wrote:
> 
> One of these patches fixes a performance regression caused by a64c8610,
> which unplugged the write queue after every page write.  Now that Jens
> added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
> to further improbve ext3 latency, especially during the "sync" command
> in Linus's write-big-file-and-sync workload.

So here's a question and a untested _conceptual_ patch. 

The kind of writeback mode I'd personally prefer would be more of a 
mixture of the current "data=writeback" and "data=ordered" modes, with 
something of the best of both worlds. I'd like the data writeback to get 
_started_ when the journal is written to disk, but I'd like it to not 
block journal updates.

IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't 
be totally unordered either.

For true sync operations (ie fsync()), the VFS layer then does the proper 
"wait for data" part.

I dunno. I don't actually know the JBD internal constraints, but what I'm 
talking about is something like the appended patch. It wouldn't help under 
really heavy writeback IO (because even if we don't end up waiting for all 
the random data to complete, we'd end up waiting when _submitting_ it), 
but it might help under somewhat less extreme loads.

This is totally untested. It might well violate some serious internal jbd 
rules and eat your filesystem, for all I know. I'm throwing the patch out 
as a "would something _like_ this perhaps make sense as a half-way-point 
between 'ordered' and 'writeback', nothing more.

Hmm?

		Linus
---
 fs/jbd/commit.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index a8e8513..5bea3ed 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -184,6 +184,9 @@ static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
 	}
 }
 
+/* This would obviously be a real flag, set at mount time */
+#define BACKGROUND_DATA(journal) (1)
+
 /*
  *  Submit all the data buffers to disk
  */
@@ -198,6 +201,9 @@ static int journal_submit_data_buffers(journal_t *journal,
 	struct buffer_head **wbuf = journal->j_wbuf;
 	int err = 0;
 
+	if (BACKGROUND_DATA(journal))
+		write_op = WRITE;
+
 	/*
 	 * Whenever we unlock the journal and sleep, things can get added
 	 * onto ->t_sync_datalist, so we have to keep looping back to
@@ -254,7 +260,10 @@ write_out_data:
 		if (locked && test_clear_buffer_dirty(bh)) {
 			BUFFER_TRACE(bh, "needs writeout, adding to array");
 			wbuf[bufs++] = bh;
-			__journal_file_buffer(jh, commit_transaction,
+			if (BACKGROUND_DATA(journal))
+				__journal_unfile_buffer(jh);
+			else
+				__journal_file_buffer(jh, commit_transaction,
 						BJ_Locked);
 			jbd_unlock_bh_state(bh);
 			if (bufs == journal->j_wbufsize) {

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [GIT PULL] Ext3 latency fixes
@ 2009-04-08 23:40 Theodore Ts'o
  2009-04-09 15:49 ` Linus Torvalds
  0 siblings, 1 reply; 69+ messages in thread
From: Theodore Ts'o @ 2009-04-08 23:40 UTC (permalink / raw)
  To: torvalds; +Cc: Linux Kernel Developers List, Ext4 Developers List

Hi Linus,

Please pull from: 

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ext3-latency-fixes

One of these patches fixes a performance regression caused by a64c8610,
which unplugged the write queue after every page write.  Now that Jens
added WRITE_SYNC_PLUG.the patch causes us to use it instead of
WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
to further improbve ext3 latency, especially during the "sync" command
in Linus's write-big-file-and-sync workload.

In addition, I have a patch from Jan that avoids starting a transaction
unnecessarily for the data=writepage case.

						- Ted

Jan Kara (1):
      ext3: Try to avoid starting a transaction in writepage for data=writepage

Theodore Ts'o (1):
      block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG

 fs/buffer.c     |   13 ++++++++++++-
 fs/ext3/inode.c |   23 ++++++++++++++++++-----
 2 files changed, 30 insertions(+), 6 deletions(-)


commit 430db323fae7665da721768949ade6304811c648
Author: Jan Kara <jack@suse.cz>
Date:   Tue Apr 7 18:25:01 2009 -0400

    ext3: Try to avoid starting a transaction in writepage for data=writepage
    
    This does the same as commit 9e80d407736161d9b8b0c5a0d44f786e44c322ea
    (avoid starting a transaction when no block allocation is needed)
    but for data=writeback mode of ext3. We also cleanup the data=ordered
    case a bit to stick to coding style...
    
    Signed-off-by: Jan Kara <jack@suse.cz>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 466a332..fcfa243 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1521,12 +1521,16 @@ static int ext3_ordered_writepage(struct page *page,
 	if (!page_has_buffers(page)) {
 		create_empty_buffers(page, inode->i_sb->s_blocksize,
 				(1 << BH_Dirty)|(1 << BH_Uptodate));
-	} else if (!walk_page_buffers(NULL, page_buffers(page), 0, PAGE_CACHE_SIZE, NULL, buffer_unmapped)) {
-		/* Provide NULL instead of get_block so that we catch bugs if buffers weren't really mapped */
-		return block_write_full_page(page, NULL, wbc);
+		page_bufs = page_buffers(page);
+	} else {
+		page_bufs = page_buffers(page);
+		if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
+				       NULL, buffer_unmapped)) {
+			/* Provide NULL get_block() to catch bugs if buffers
+			 * weren't really mapped */
+			return block_write_full_page(page, NULL, wbc);
+		}
 	}
-	page_bufs = page_buffers(page);
-
 	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
 
 	if (IS_ERR(handle)) {
@@ -1581,6 +1585,15 @@ static int ext3_writeback_writepage(struct page *page,
 	if (ext3_journal_current_handle())
 		goto out_fail;
 
+	if (page_has_buffers(page)) {
+		if (!walk_page_buffers(NULL, page_buffers(page), 0,
+				      PAGE_CACHE_SIZE, NULL, buffer_unmapped)) {
+			/* Provide NULL get_block() to catch bugs if buffers
+			 * weren't really mapped */
+			return block_write_full_page(page, NULL, wbc);
+		}
+	}
+
 	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);

commit 6e34eeddf7deec1444bbddab533f03f520d8458c
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Tue Apr 7 18:12:43 2009 -0400

    block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG
    
    Now that we have a distinction between WRITE_SYNC and WRITE_SYNC_PLUG,
    use WRITE_SYNC_PLUG in __block_write_full_page() to avoid unplugging
    the block device I/O queue between each page that gets flushed out.
    
    Otherwise, when we run sync() or fsync() and we need to write out a
    large number of pages, the block device queue will get unplugged
    between for every page that is flushed out, which will be a pretty
    serious performance regression caused by commit a64c8610.
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/fs/buffer.c b/fs/buffer.c
index 6e35762..13edf7a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1596,6 +1596,16 @@ EXPORT_SYMBOL(unmap_underlying_metadata);
  * locked buffer.   This only can happen if someone has written the buffer
  * directly, with submit_bh().  At the address_space level PageWriteback
  * prevents this contention from occurring.
+ *
+ * If block_write_full_page() is called with wbc->sync_mode ==
+ * WB_SYNC_ALL, the writes are posted using WRITE_SYNC_PLUG; this
+ * causes the writes to be flagged as synchronous writes, but the
+ * block device queue will NOT be unplugged, since usually many pages
+ * will be pushed to the out before the higher-level caller actually
+ * waits for the writes to be completed.  The various wait functions,
+ * such as wait_on_writeback_range() will ultimately call sync_page()
+ * which will ultimately call blk_run_backing_dev(), which will end up
+ * unplugging the device queue.
  */
 static int __block_write_full_page(struct inode *inode, struct page *page,
 			get_block_t *get_block, struct writeback_control *wbc)
@@ -1606,7 +1616,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	struct buffer_head *bh, *head;
 	const unsigned blocksize = 1 << inode->i_blkbits;
 	int nr_underway = 0;
-	int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+	int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
+			WRITE_SYNC_PLUG : WRITE);
 
 	BUG_ON(!PageLocked(page));
 

^ permalink raw reply related	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2009-04-18  3:09 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-03  7:01 [GIT PULL] Ext3 latency fixes Theodore Ts'o
2009-04-03  7:01 ` [PATCH 1/4] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks Theodore Ts'o
2009-04-03  7:01   ` [PATCH 2/4] ext3: Use WRITE_SYNC for commits which are caused by fsync() Theodore Ts'o
2009-04-03  7:01     ` [PATCH 3/4] ext3: Add replace-on-truncate hueristics for data=writeback mode Theodore Ts'o
2009-04-03  7:01       ` [PATCH 4/4] ext3: Add replace-on-rename " Theodore Ts'o
2009-04-03 18:24 ` [GIT PULL] Ext3 latency fixes Linus Torvalds
2009-04-03 18:47   ` Jens Axboe
2009-04-03 19:13     ` Theodore Tso
2009-04-03 21:01     ` Chris Mason
2009-04-03 19:02   ` Linus Torvalds
2009-04-03 20:41     ` Linus Torvalds
2009-04-04 13:57       ` Theodore Tso
2009-04-04 15:16         ` Jens Axboe
2009-04-04 15:57           ` Linus Torvalds
2009-04-04 16:06             ` Linus Torvalds
2009-04-04 17:36               ` Jens Axboe
2009-04-04 17:34             ` Jens Axboe
2009-04-04 17:44               ` Linus Torvalds
2009-04-04 18:00                 ` Trenton D. Adams
2009-04-04 18:01                 ` Jens Axboe
2009-04-04 18:10                   ` Linus Torvalds
2009-04-04 23:22                   ` Theodore Tso
2009-04-04 23:33                     ` Arjan van de Ven
2009-04-05  0:10                       ` Theodore Tso
2009-04-05 15:05                         ` Arjan van de Ven
2009-04-05 17:01                         ` Linus Torvalds
2009-04-05 17:15                           ` Mark Lord
2009-04-05 20:57                             ` Jeff Garzik
2009-04-05 23:48                               ` Arjan van de Ven
2009-04-06  2:32                                 ` Mark Lord
2009-04-06  5:47                                 ` Jeff Garzik
2009-04-07 18:18                                   ` Linus Torvalds
2009-04-07 18:22                                     ` Linus Torvalds
2009-04-07 19:40                                     ` [PATCH libata: add SSD detection hueristic; move SSD setup to ata_dev_configure (was Re: [GIT PULL] Ext3 latency fixes) Jeff Garzik
2009-04-09 18:21                                       ` Tejun Heo
2009-04-18  3:02                                         ` George Spelvin
2009-04-06  8:13                             ` [GIT PULL] Ext3 latency fixes Jens Axboe
2009-04-05 18:56                           ` Arjan van de Ven
2009-04-05 19:34                             ` Linus Torvalds
2009-04-05 20:06                               ` Arjan van de Ven
2009-04-06  6:25                               ` Jens Axboe
2009-04-06  6:05                           ` Theodore Tso
2009-04-06  6:23                           ` Jens Axboe
2009-04-06  8:16                       ` Jens Axboe
2009-04-06 14:48                         ` Linus Torvalds
2009-04-06 15:09                           ` Jens Axboe
2009-04-06  6:15                     ` Jens Axboe
2009-04-04 20:18               ` Ingo Molnar
2009-04-06 21:50                 ` Lennart Sorensen
2009-04-07 13:31                   ` Mark Lord
2009-04-07 14:48                     ` Lennart Sorensen
2009-04-07 19:21                       ` Mark Lord
2009-04-07 19:57                         ` Lennart Sorensen
2009-04-04 20:56               ` Arjan van de Ven
2009-04-06  7:06                 ` Jens Axboe
2009-04-07 15:39             ` Indan Zupancic
2009-04-04 19:18           ` Theodore Tso
2009-04-06  8:12             ` Jens Axboe
2009-04-04 22:13         ` Linus Torvalds
2009-04-04 22:19           ` Linus Torvalds
2009-04-05  0:20           ` Theodore Tso
2009-04-03 19:54   ` Theodore Tso
2009-04-08 23:40 Theodore Ts'o
2009-04-09 15:49 ` Linus Torvalds
2009-04-09 16:23   ` Chris Mason
2009-04-09 17:49     ` Jan Kara
2009-04-09 18:10       ` Chris Mason
2009-04-09 19:04         ` Jan Kara
2009-04-09 17:36   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).