linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <chris.mason@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Linux Kernel Developers List <linux-kernel@vger.kernel.org>,
	Ext4 Developers List <linux-ext4@vger.kernel.org>
Subject: Re: [GIT PULL] Ext3 latency fixes
Date: Thu, 09 Apr 2009 14:10:03 -0400	[thread overview]
Message-ID: <1239300603.31257.26.camel@think.oraclecorp.com> (raw)
In-Reply-To: <20090409174932.GD26552@atrey.karlin.mff.cuni.cz>

On Thu, 2009-04-09 at 19:49 +0200, Jan Kara wrote:
> > On Thu, 2009-04-09 at 08:49 -0700, Linus Torvalds wrote:
> > > 
> > > On Wed, 8 Apr 2009, Theodore Ts'o wrote:
> > > > 
> > > > One of these patches fixes a performance regression caused by a64c8610,
> > > > which unplugged the write queue after every page write.  Now that Jens
> > > > added WRITE_SYNC_PLUG.the patch causes us to use it instead of
> > > > WRITE_SYNC, to avoid the implicit unplugging.  These patches also seem
> > > > to further improbve ext3 latency, especially during the "sync" command
> > > > in Linus's write-big-file-and-sync workload.
> > > 
> > > So here's a question and a untested _conceptual_ patch. 
> > > 
> > > The kind of writeback mode I'd personally prefer would be more of a 
> > > mixture of the current "data=writeback" and "data=ordered" modes, with 
> > > something of the best of both worlds. I'd like the data writeback to get 
> > > _started_ when the journal is written to disk, but I'd like it to not 
> > > block journal updates.
> > > 
> > > IOW, it wouldn't be "strictly ordered", but at the same time it wouldn't 
> > > be totally unordered either.
> > > 
> > 
> > I started working on the xfs style i_size updates last night, and here's
> > my current (most definitely broken) proof of concept.  I call it
> > data=guarded.
> > 
> > In guarded mode the on disk i_size is not updated until after the data
> > writes are complete.  I've got a per FS work queue and I'm abusing
> > bh->b_private as a list pointer.  So, what happens is:
> > 
> > * writepage sets up the buffer with the guarded end_io handler
> > 
> > * The end_io handler puts the buffer onto the per-sb list of guarded
> > buffers and then it kicks the work queue
> > 
> > * The workqueue updates the on disk i_size to the min of the end of the
> > buffer or the in-memory i_size, and then it logs the inode.
> > 
> > * Then the regular async bh end_io handler is called to end writeback on
> > the page.
> > 
> > One big gotcha is that we starting a transaction while a page is
> > writeback.  It means that anyone who waits for writeback to finish on
> > the datapage with a transaction running could deadlock against the work
> > queue func trying to start a transaction.
>   For ext3 I don't think anyone waits for PageWriteback with a
> transaction open. We definitely don't do it from ext3 code and generic
> code does usually sequence like:
>   lock_page(page);
>   ...
>   wait_on_page_writeback(page)
> 
>   and because lock ordering is page_lock < transaction start, we
> shouldn't have transaction open at that point.
>   But with ext4 it may be different - there, the lock ordering is
> transaction start > page_lock and so above code could well have
> transaction started.
>   Wouldn't it actually be better to update i_size when the page is
> fully written out after we clear PG_writeback as you write below?

It would, but then we have to take a ref on the inode and risk iput
leading to inode deletion in the handler that is supposed to be doing IO
completion.  It's icky either way ;)

The nice part with doing it before writeback is that we know that when
we wait for page writeback, we don't also have to wait for i_size update
to be finished.

If we go this route and it gets copied to ext4, we can weigh our options
I guess.

>   One thing which does not seem to be handled is that your code can
> happily race with truncate. So IO completion could reset i_size which
> has been just set by truncate. And I'm not sure how to handle this
> effectively. Generally I'm not sure how much this is going to cost...
> 

Yeah, I was worried about that.  What ends up happening is the setattr
call sets the disk i_size and then calls inode_setattr, who calls
vmtruncate who actually waits on the writeback.

Then, we wander into the ext3 truncate who resets disk i_size down
again.  It's a pretty strange window of updates, but I was thinking it
would work to cut down i_size, wait on IO, then cut it down again in
setattr.

Once we wait on all IO past the new in-memory i_size, writepage won't
send any more down.  So updating disk i_size after the wait should be
enough.

-chris



  reply	other threads:[~2009-04-09 18:11 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-08 23:40 [GIT PULL] Ext3 latency fixes Theodore Ts'o
2009-04-09 15:49 ` Linus Torvalds
2009-04-09 16:23   ` Chris Mason
2009-04-09 17:49     ` Jan Kara
2009-04-09 18:10       ` Chris Mason [this message]
2009-04-09 19:04         ` Jan Kara
2009-04-14  2:29           ` [RFC] ext3 data=guarded was " Chris Mason
2009-04-09 17:36   ` Jan Kara
  -- strict thread matches above, loose matches on Subject: below --
2009-04-03  7:01 Theodore Ts'o
2009-04-03 18:24 ` Linus Torvalds
2009-04-03 18:47   ` Jens Axboe
2009-04-03 19:13     ` Theodore Tso
2009-04-03 21:01     ` Chris Mason
2009-04-03 19:02   ` Linus Torvalds
2009-04-03 20:41     ` Linus Torvalds
2009-04-04 13:57       ` Theodore Tso
2009-04-04 15:16         ` Jens Axboe
2009-04-04 15:57           ` Linus Torvalds
2009-04-04 16:06             ` Linus Torvalds
2009-04-04 17:36               ` Jens Axboe
2009-04-04 17:34             ` Jens Axboe
2009-04-04 17:44               ` Linus Torvalds
2009-04-04 18:00                 ` Trenton D. Adams
2009-04-04 18:01                 ` Jens Axboe
2009-04-04 18:10                   ` Linus Torvalds
2009-04-04 23:22                   ` Theodore Tso
2009-04-04 23:33                     ` Arjan van de Ven
2009-04-05  0:10                       ` Theodore Tso
2009-04-05 15:05                         ` Arjan van de Ven
2009-04-05 17:01                         ` Linus Torvalds
2009-04-05 17:15                           ` Mark Lord
2009-04-05 20:57                             ` Jeff Garzik
2009-04-05 23:48                               ` Arjan van de Ven
2009-04-06  2:32                                 ` Mark Lord
2009-04-06  5:47                                 ` Jeff Garzik
2009-04-07 18:18                                   ` Linus Torvalds
2009-04-07 18:22                                     ` Linus Torvalds
2009-04-06  8:13                             ` Jens Axboe
2009-04-05 18:56                           ` Arjan van de Ven
2009-04-05 19:34                             ` Linus Torvalds
2009-04-05 20:06                               ` Arjan van de Ven
2009-04-06  6:25                               ` Jens Axboe
2009-04-06  6:05                           ` Theodore Tso
2009-04-06  6:23                           ` Jens Axboe
2009-04-06  8:16                       ` Jens Axboe
2009-04-06 14:48                         ` Linus Torvalds
2009-04-06 15:09                           ` Jens Axboe
2009-04-06  6:15                     ` Jens Axboe
2009-04-04 20:18               ` Ingo Molnar
2009-04-06 21:50                 ` Lennart Sorensen
2009-04-07 13:31                   ` Mark Lord
2009-04-07 14:48                     ` Lennart Sorensen
2009-04-07 19:21                       ` Mark Lord
2009-04-07 19:57                         ` Lennart Sorensen
2009-04-04 20:56               ` Arjan van de Ven
2009-04-06  7:06                 ` Jens Axboe
2009-04-07 15:39             ` Indan Zupancic
2009-04-04 19:18           ` Theodore Tso
2009-04-06  8:12             ` Jens Axboe
2009-04-04 22:13         ` Linus Torvalds
2009-04-04 22:19           ` Linus Torvalds
2009-04-05  0:20           ` Theodore Tso
2009-04-03 19:54   ` Theodore Tso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1239300603.31257.26.camel@think.oraclecorp.com \
    --to=chris.mason@oracle.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).