From: Vladislav Bolkhovitin <vst@vlnb.net>
To: Christoph Hellwig <hch@lst.de>
Cc: Ted Ts'o <tytso@mit.edu>, Andreas Dilger <adilger@dilger.ca>,
Ric Wheeler <rwheeler@redhat.com>, Tejun Heo <tj@kernel.org>,
Vivek Goyal <vgoyal@redhat.com>, Jan Kara <jack@suse.cz>,
jaxboe@fusionio.com, James.Bottomley@suse.de,
linux-fsdevel@vger.kernel.org, linux-scsi@vger.kernel.org,
chris.mason@oracle.com, swhiteho@redhat.com,
konishi.ryusuke@lab.ntt.co.jp
Subject: Re: [RFC] relaxed barrier semantics
Date: Mon, 02 Aug 2010 23:01:53 +0400 [thread overview]
Message-ID: <4C571621.8070007@vlnb.net> (raw)
In-Reply-To: <20100730142025.GA29341@lst.de>
Christoph Hellwig, on 07/30/2010 06:20 PM wrote:
> On Fri, Jul 30, 2010 at 05:44:08PM +0400, Vladislav Bolkhovitin wrote:
>> Yes, but why not to make step further and allow to completely eliminate
>> the waiting/draining using ORDERED requests? Current advanced storage
>> hardware allows that.
>
> There is a few caes where we could do that - the fsync without metadata
> changes above would be the prime example. But there's a lot lower
> hanging fruit until we get to the point where it's worth trying.
Yes, but, since there is also interface and file systems update coming,
why not to design the interface now and then gracefully fill it with
implementation?
All barriers discussions are always very hot. It definitely means the
current approach doesn't satisfy too many people, from FS developers to
storage vendors and users. I believe this is because the whole barriers
ideology is not natural, hence there are too many troubles to fit it in
the real life. Apparently, this approach needs some redesign to get in a
more acceptable form.
IMHO, all is needed are:
1. Allow to optionally combine requests in groups and set for groups
optional properties: caching and ordering modes (see below). Each group
would reflect a higher level operation.
2. Allow to chain requests groups. Each chain would reflect order
dependency between groups, i.e. higher level operations.
This interface is a natural extension of the current interface. Natural
for storage too. In the extreme, when a group is empty, it could be
implemented as a barrier, although, since there would be no dependencies
between not chained groups, they would be freely reordered between each
other.
We would need grouping requests sooner or later anyway, because
otherwise it is impossible to implement selective cache flushing instead
of flushing cache for the whole device as currently. This is highly
demanded feature, especially for shared and distributed devices.
The caching properties would be:
- None (default) - no cache flushing needed.
- "Flush after each request". It would be translated to FUA on write
back devices with FUA, (write, sync_cache) sequence on write back
devices without FUA, and to nothing on write through devices.
- "Flush at once after all finished". It would be translated to one or
more SYNC_CACHE commands, executed after all done and syncing _only_
what was modified in the group, not the whole device as now.
The order properties would be:
- None (default) - there are no order dependency between requests in
the group.
- ORDERED - all requests in the group must be executed in order.
Additionally, if the backend device supported ORDERED commands, this
facility would be used to eliminate extra queue draining. For instance,
"flush after each request" on WB devices without FUA would be a sequence
of ORDERED commands: [(write, sync_cache) ... (write, sync_cache) wait].
Compare to [(write, wait, sync_cache, wait) ... (write, wait,
sync_cache, wait)] needed achieve the same without ORDERED commands support.
For instance, your example of the fsync in XFS would be:
1) Write out all the data blocks as a group with no caching and ordering
properties.
2) Wait that group to finish
3) Propagate any I/O error to the inode so we can pick them up
4) Update the inode size in the shadow in-memory structure
5) Start a transaction to log the inode size in the new group with
properties "Flush at once after all finished" and no ordering (or, if
necessary, (it isn't clear from your text) ORDERED).
6) Write out a log buffer containing the inode and btree updates in the
new group in a chain after the group from (5) with necessary cache
flushing and ordering properties.
I believe, it can be implemented acceptably simply and effectively,
including the I/O scheduler level, and have some ideas for that.
Just my 5c from the storage vendors side.
> But in most cases we don't just drain an imaginary queue but actually
> need to modify software state before finishing one class of I/O and
> submitting the next.
>
> Again, take the example of fsync, but this time we have actually
> extended the file and need to log an inode size update, as well
> as a modification to to the btree blocks.
>
> Now the fsync in XFS looks like this:
>
> 1) write out all the data blocks using WRITE
> 2) wait for these to finish
> 3) propagate any I/O error to the inode so we can pick them up
> 4) update the inode size in the shadow in-memory structure
> 5) start a transaction to log the inode size
> 6) flush the write cache to make sure the data really is on disk
Here should be "6.1) wait for it to finish" which can be eliminated if
requests sent ordered, correct?
> 7) write out a log buffer containing the inode and btree updates
> 8) if the FUA bit is not support flush the cache again
>
> and yes, the flush in 6) is important so that we don't happen
> to log the inode size update before all data has made it to disk
> in case the cache flush in 8) is interrupted
next prev parent reply other threads:[~2010-08-02 19:02 UTC|newest]
Thread overview: 155+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-07-27 16:56 [RFC] relaxed barrier semantics Christoph Hellwig
2010-07-27 17:54 ` Jan Kara
2010-07-27 18:35 ` Vivek Goyal
2010-07-27 18:42 ` James Bottomley
2010-07-27 18:51 ` Ric Wheeler
2010-07-27 19:43 ` Christoph Hellwig
2010-07-27 19:38 ` Christoph Hellwig
2010-07-28 8:08 ` Tejun Heo
2010-07-28 8:20 ` Tejun Heo
2010-07-28 13:55 ` Vladislav Bolkhovitin
2010-07-28 14:23 ` Tejun Heo
2010-07-28 14:37 ` James Bottomley
2010-07-28 14:44 ` Tejun Heo
2010-07-28 16:17 ` Vladislav Bolkhovitin
2010-07-28 16:17 ` Vladislav Bolkhovitin
2010-07-28 16:16 ` Vladislav Bolkhovitin
2010-07-28 8:24 ` Christoph Hellwig
2010-07-28 8:40 ` Tejun Heo
2010-07-28 8:50 ` Christoph Hellwig
2010-07-28 8:58 ` Tejun Heo
2010-07-28 9:00 ` Christoph Hellwig
2010-07-28 9:11 ` Hannes Reinecke
2010-07-28 9:16 ` Christoph Hellwig
2010-07-28 9:24 ` Tejun Heo
2010-07-28 9:38 ` Christoph Hellwig
2010-07-28 9:28 ` Steven Whitehouse
2010-07-28 9:35 ` READ_META semantics, was " Christoph Hellwig
2010-07-28 13:52 ` Jeff Moyer
2010-07-28 9:17 ` Tejun Heo
2010-07-28 9:28 ` Christoph Hellwig
2010-07-28 9:48 ` Tejun Heo
2010-07-28 10:19 ` Steven Whitehouse
2010-07-28 11:45 ` Christoph Hellwig
2010-07-28 12:47 ` Jan Kara
2010-07-28 23:00 ` Christoph Hellwig
2010-07-29 10:45 ` Jan Kara
2010-07-29 16:54 ` Joel Becker
2010-07-29 17:02 ` Christoph Hellwig
2010-07-29 17:02 ` Christoph Hellwig
2010-07-29 1:44 ` Ted Ts'o
2010-07-29 2:43 ` Vivek Goyal
2010-07-29 2:43 ` Vivek Goyal
2010-07-29 8:42 ` Christoph Hellwig
2010-07-29 20:02 ` Vivek Goyal
2010-07-29 20:06 ` Christoph Hellwig
2010-07-30 3:17 ` Vivek Goyal
2010-07-30 7:07 ` Christoph Hellwig
2010-07-30 7:41 ` Vivek Goyal
2010-08-02 18:28 ` [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) Vivek Goyal
2010-08-03 13:03 ` Christoph Hellwig
2010-08-04 15:29 ` Vivek Goyal
2010-08-04 16:21 ` Christoph Hellwig
2010-07-29 8:31 ` [RFC] relaxed barrier semantics Christoph Hellwig
2010-07-29 11:16 ` Jan Kara
2010-07-29 13:00 ` extfs reliability Vladislav Bolkhovitin
2010-07-29 13:08 ` Christoph Hellwig
2010-07-29 14:12 ` Vladislav Bolkhovitin
2010-07-29 14:34 ` Jan Kara
2010-07-29 18:20 ` Vladislav Bolkhovitin
2010-07-29 18:49 ` Vladislav Bolkhovitin
2010-07-29 14:26 ` Jan Kara
2010-07-29 18:20 ` Vladislav Bolkhovitin
2010-07-29 18:58 ` Ted Ts'o
2010-07-29 19:44 ` [RFC] relaxed barrier semantics Ric Wheeler
2010-07-29 19:49 ` Christoph Hellwig
2010-07-29 19:56 ` Ric Wheeler
2010-07-29 19:59 ` James Bottomley
2010-07-29 20:03 ` Christoph Hellwig
2010-07-29 20:07 ` James Bottomley
2010-07-29 20:11 ` Christoph Hellwig
2010-07-30 12:45 ` Vladislav Bolkhovitin
2010-07-30 12:56 ` Christoph Hellwig
2010-08-04 1:58 ` Jamie Lokier
2010-07-30 12:46 ` Vladislav Bolkhovitin
2010-07-30 12:57 ` Christoph Hellwig
2010-07-30 13:09 ` Vladislav Bolkhovitin
2010-07-30 13:12 ` Christoph Hellwig
2010-07-30 17:40 ` Vladislav Bolkhovitin
2010-07-29 20:58 ` Ric Wheeler
2010-07-29 22:30 ` Andreas Dilger
2010-07-29 23:04 ` Ted Ts'o
2010-07-29 23:08 ` Ric Wheeler
2010-07-29 23:08 ` Ric Wheeler
2010-07-29 23:28 ` James Bottomley
2010-07-29 23:37 ` James Bottomley
2010-07-30 0:19 ` Ted Ts'o
2010-07-30 12:56 ` Vladislav Bolkhovitin
2010-07-30 7:11 ` Christoph Hellwig
2010-07-30 7:11 ` Christoph Hellwig
2010-07-30 12:56 ` Vladislav Bolkhovitin
2010-07-30 13:07 ` Tejun Heo
2010-07-30 13:22 ` Vladislav Bolkhovitin
2010-07-30 13:27 ` Vladislav Bolkhovitin
2010-07-30 13:09 ` Christoph Hellwig
2010-07-30 13:25 ` Vladislav Bolkhovitin
2010-07-30 13:34 ` Christoph Hellwig
2010-07-30 13:44 ` Vladislav Bolkhovitin
2010-07-30 14:20 ` Christoph Hellwig
2010-07-31 0:47 ` Jan Kara
2010-07-31 9:12 ` Christoph Hellwig
2010-08-02 13:14 ` Jan Kara
2010-08-02 10:38 ` Vladislav Bolkhovitin
2010-08-02 12:48 ` Christoph Hellwig
2010-08-02 19:03 ` xfs rm performance Vladislav Bolkhovitin
2010-08-02 19:18 ` Christoph Hellwig
2010-08-05 19:31 ` Vladislav Bolkhovitin
2010-08-02 19:01 ` Vladislav Bolkhovitin [this message]
2010-08-02 19:26 ` [RFC] relaxed barrier semantics Christoph Hellwig
2010-07-30 12:56 ` Vladislav Bolkhovitin
2010-07-31 0:35 ` Jan Kara
2010-07-29 19:44 ` Ric Wheeler
2010-08-02 16:47 ` Ryusuke Konishi
2010-08-02 17:39 ` Chris Mason
2010-08-05 13:11 ` Vladislav Bolkhovitin
2010-08-05 13:32 ` Chris Mason
2010-08-05 14:52 ` Hannes Reinecke
2010-08-05 14:52 ` Hannes Reinecke
2010-08-05 15:17 ` Chris Mason
2010-08-05 17:07 ` Christoph Hellwig
2010-08-05 19:48 ` Vladislav Bolkhovitin
2010-08-05 19:48 ` Vladislav Bolkhovitin
2010-08-05 19:50 ` Christoph Hellwig
2010-08-05 20:05 ` Vladislav Bolkhovitin
2010-08-06 14:56 ` Hannes Reinecke
2010-08-06 18:38 ` Vladislav Bolkhovitin
2010-08-06 23:38 ` Christoph Hellwig
2010-08-06 23:34 ` Christoph Hellwig
2010-08-05 17:09 ` Christoph Hellwig
2010-08-05 19:32 ` Vladislav Bolkhovitin
2010-08-05 19:40 ` Christoph Hellwig
2010-08-05 13:11 ` Vladislav Bolkhovitin
2010-07-28 13:56 ` Vladislav Bolkhovitin
2010-07-28 14:42 ` Vivek Goyal
2010-07-27 19:37 ` Christoph Hellwig
2010-08-03 18:49 ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig
2010-08-03 18:51 ` [PATCH, RFC 2/2] dm: support REQ_FLUSH directly Christoph Hellwig
2010-08-04 4:57 ` Kiyoshi Ueda
2010-08-04 8:54 ` Christoph Hellwig
2010-08-05 2:16 ` Jun'ichi Nomura
2010-08-26 22:50 ` Mike Snitzer
2010-08-27 0:40 ` Mike Snitzer
2010-08-27 1:20 ` Jamie Lokier
2010-08-27 1:43 ` Jun'ichi Nomura
2010-08-27 4:08 ` Mike Snitzer
2010-08-27 5:52 ` Jun'ichi Nomura
2010-08-27 14:13 ` Mike Snitzer
2010-08-30 4:45 ` Jun'ichi Nomura
2010-08-30 8:33 ` Tejun Heo
2010-08-30 12:43 ` Mike Snitzer
2010-08-30 12:45 ` Tejun Heo
2010-08-06 16:04 ` [PATCH, RFC] relaxed barriers Tejun Heo
2010-08-06 23:34 ` Christoph Hellwig
2010-08-07 10:13 ` [PATCH REPOST " Tejun Heo
2010-08-08 14:31 ` Christoph Hellwig
2010-08-09 14:50 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C571621.8070007@vlnb.net \
--to=vst@vlnb.net \
--cc=James.Bottomley@suse.de \
--cc=adilger@dilger.ca \
--cc=chris.mason@oracle.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=jaxboe@fusionio.com \
--cc=konishi.ryusuke@lab.ntt.co.jp \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=rwheeler@redhat.com \
--cc=swhiteho@redhat.com \
--cc=tj@kernel.org \
--cc=tytso@mit.edu \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.