All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <jens.axboe@oracle.com>,
	linux-fsdevel@vger.kernel.org, linux-scsi@vger.kernel.org
Subject: Re: O_DIRECT and barriers
Date: Fri, 21 Aug 2009 16:24:59 +0100	[thread overview]
Message-ID: <20090821152459.GC6929@shareable.org> (raw)
In-Reply-To: <20090821142635.GB30617@infradead.org>

Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 02:54:03PM +0100, Jamie Lokier wrote:
> > I've been thinking about this too, and for optimal performance with
> > VMs and also with databases, I think FUA is too strong.  (It's also
> > too weak, on drives which don't have FUA).
> 
> Why is FUA too strong?

In measurements I've done, disabling a disk's write cache results in
much slower ext3 filesystem writes than using barriers.  Others report
similar results.  This is with disks that don't have NCQ; good NCQ may
be better.

Using FUA for all writes should be equivalent to writing with write
cache disabled.

A journalling filesystem or database tends to write like this:

   (guest) WRITE
   (guest) WRITE
   (guest) WRITE
   (guest) WRITE
   (guest) WRITE
   (guest) CACHE FLUSH
   (guest) WRITE
   (guest) CACHE FLUSH
   (guest) WRITE
   (guest) WRITE
   (guest) WRITE

When a guest does that, for integrity it can be mapped to this on the
host with FUA:

   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA

or

   (host) WRITE
   (host) WRITE
   (host) WRITE
   (host) WRITE
   (host) WRITE
   (host) CACHE FLUSH
   (host) WRITE
   (host) CACHE FLUSH 
   (host) WRITE
   (host) WRITE
   (host) WRITE

We know from measurements that disabling the disk write cache is much
slower than using barriers, at least with some disks.

Assuming that WRITE FUA is equivalent to disabling write cache, we may
expect the WRITE FUA version to run much slower than the CACHE FLUSH
version.

It's also too weak, of course, on drives which don't support FUA.
Then you have to use CACHE FLUSH anyway, so the code should support
that (or disable the write cache entirely, which also performs badly).
If you don't handle drives without FUA, then you're back to "integrity
sometimes, user must check type of hardware", which is something we're
trying to get away from.  Integrity should not be a surprise when the
application requests it.

> > Fortunately there's already a sensible API for both: fdatasync (and
> > aio_fsync) to mean flush, and O_DSYNC (or inferred from
> > flush-after-one-write) to mean FUA.
> 
> I thought about this alot .  It would be sensible to only require
> the FUA semantics if O_SYNC is specified.  But from looking around at
> users of O_DIRECT no one seems to actually specify O_SYNC with it.

O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
inode metadata (like mtime) too.  O_DIRECT|O_DSYNC is better.

O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
integrity problems when direct writes are converted to buffered writes
- which applies to all or nearly all OSes according to their
documentation (I've read a lot of them).

I notice that all applications I looked at which use O_DIRECT don't
attempt to determine when O_DIRECT will definitely result in direct
writes; they simpy assume it can be used as a substituted for O_SYNC
or O_DSYNC, as long as you follow the alignment rules.  Generally they
leave it to the user to configure what they want, and often don't
explain the drive integrity issue, except to say "depends on the OS,
your mileage may vary, we can do nothing about it".

Imho, integrity should not be something which depends on the user
knowing the details of their hardware to decide application
configuration options - at least, not out of the box.

On a related note,
http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm
says:

    Direct I/O and Data I/O Integrity Completion

    Although direct I/O writes are done synchronously, they do not
    provide synchronized I/O data integrity completion, as defined by
    POSIX. Applications that need this feature should use O_DSYNC in
    addition to O_DIRECT. O_DSYNC guarantees that all of the data and
    enough of the metadata (for example, indirect blocks) have written
    to the stable store to be able to retrieve the data after a system
    crash. O_DIRECT only writes the data; it does not write the
    metadata.

That's another reason to use O_DIRECT|O_DSYNC in moderately portable code.

> And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
> if O_DIRECT bypasses the filesystem cache there is nothing else
> left to sync for a non-extending write.

Oh, O_SYNC means O_DSYNC?  I thought it was the other way around.
Ugh, how messy.

> That is until those pesky disk
> write back caches come into play that no application writer wants or
> should have to understand.

As far as I can tell, they generally go out of their way to avoid
understanding it, except as a vaguely uncomfortable awareness and pass
the problem on to the application's user.

Unfortunately just disabling the disk cache for O_DIRECT would make
it's performance drop significantly, otherwise I'd say go for it.

> > It turns out that applications needing integrity must use fdatasync or
> > O_DSYNC (or O_SYNfC) *already* with O_DIRECT, because the kernel may
> > choose to use buffered writes at any time, with no signal to the
> > application.
> 
> The fallback was a relatively recent addition to the O_DIRECT semantics
> for broken filesystems that can't handle holes very well.  Fortunately
> enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> semantics for that already.

Ok, so you're saying there's no _harm_ in specifying O_DSYNC with
O_DIRECT either? :-)

-- Jamie

  reply	other threads:[~2009-08-21 15:25 UTC|newest]

Thread overview: 139+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
2009-08-19 16:04 ` [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
2009-08-19 16:10   ` Christoph Hellwig
2009-08-19 16:04 ` [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments Jan Kara
2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
2009-08-19 16:11   ` Christoph Hellwig
2009-08-19 16:11     ` [Ocfs2-devel] " Christoph Hellwig
2009-08-20 12:04     ` Jan Kara
2009-08-20 12:04       ` [Ocfs2-devel] " Jan Kara
2009-08-19 20:22   ` Evgeniy Polyakov
2009-08-19 20:22     ` [Ocfs2-devel] " Evgeniy Polyakov
2009-08-20 12:31     ` Jan Kara
2009-08-20 12:31       ` [Ocfs2-devel] " Jan Kara
2009-08-20 13:30       ` Evgeniy Polyakov
2009-08-20 13:30         ` [Ocfs2-devel] " Evgeniy Polyakov
2009-08-20 13:52         ` Jan Kara
2009-08-20 13:52           ` [Ocfs2-devel] " Jan Kara
2009-08-20 13:58           ` Evgeniy Polyakov
2009-08-20 13:58             ` [Ocfs2-devel] " Evgeniy Polyakov
2009-08-19 16:04 ` [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
2009-08-19 16:04   ` Jan Kara
2009-08-19 16:18   ` Christoph Hellwig
2009-08-19 16:18     ` [Ocfs2-devel] " Christoph Hellwig
2009-08-19 16:18     ` Christoph Hellwig
2009-08-20 13:31     ` Jan Kara
2009-08-20 13:31       ` [Ocfs2-devel] " Jan Kara
2009-08-20 13:31       ` Jan Kara
2009-08-19 16:04 ` [PATCH 04/17] pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock Jan Kara
2009-08-19 16:04 ` [PATCH 05/17] ocfs2: " Jan Kara
2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
2009-08-19 16:04 ` [PATCH 06/17] vfs: Remove sync_page_range_nolock Jan Kara
2009-08-19 16:21   ` Christoph Hellwig
2009-08-19 16:04 ` [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Jan Kara
2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
2009-08-19 16:04   ` Jan Kara
2009-08-19 16:26   ` Christoph Hellwig
2009-08-19 16:26     ` [Ocfs2-devel] " Christoph Hellwig
2009-08-19 16:26     ` Christoph Hellwig
2009-08-20 12:15     ` Jan Kara
2009-08-20 12:15       ` [Ocfs2-devel] " Jan Kara
2009-08-20 12:15       ` Jan Kara
2009-08-20 16:27       ` Christoph Hellwig
2009-08-20 16:27         ` [Ocfs2-devel] " Christoph Hellwig
2009-08-20 16:27         ` Christoph Hellwig
2009-08-21 15:23         ` Jan Kara
2009-08-21 15:23           ` [Ocfs2-devel] " Jan Kara
2009-08-21 15:23           ` Jan Kara
2009-08-21 15:32           ` Christoph Hellwig
2009-08-21 15:32             ` [Ocfs2-devel] " Christoph Hellwig
2009-08-21 15:32             ` Christoph Hellwig
2009-08-21 15:48             ` Jan Kara
2009-08-21 15:48               ` [Ocfs2-devel] " Jan Kara
2009-08-21 15:48               ` Jan Kara
2009-08-26 18:22         ` Christoph Hellwig
2009-08-26 18:22           ` [Ocfs2-devel] " Christoph Hellwig
2009-08-26 18:22           ` Christoph Hellwig
2009-08-27  0:04           ` Christoph Hellwig
2009-08-27  0:04             ` [Ocfs2-devel] " Christoph Hellwig
2009-08-27  0:04             ` Christoph Hellwig
2009-08-19 16:04 ` [PATCH 08/17] ext2: Update comment about generic_osync_inode Jan Kara
2009-08-19 16:04 ` [PATCH 09/17] ext3: Remove syncing logic from ext3_file_write Jan Kara
2009-08-19 16:04 ` [PATCH 10/17] ext4: Remove syncing logic from ext4_file_write Jan Kara
2009-08-19 16:04   ` Jan Kara
2009-08-19 16:04 ` [PATCH 11/17] fat: Opencode sync_page_range_nolock() Jan Kara
2009-08-19 16:04 ` [PATCH 12/17] ntfs: Use new syncing helpers and update comments Jan Kara
2009-08-19 16:04 ` [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version Jan Kara
2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
2009-08-21  1:36   ` Joel Becker
2009-08-21  1:36     ` Joel Becker
2009-08-21 14:30     ` Jan Kara
2009-08-21 14:30       ` Jan Kara
2009-08-19 16:04 ` [PATCH 14/17] xfs: Use new syncing helper Jan Kara
2009-08-19 16:04   ` Jan Kara
2009-08-19 16:33   ` Christoph Hellwig
2009-08-19 16:33     ` Christoph Hellwig
2009-08-20 12:22     ` Jan Kara
2009-08-20 12:22       ` Jan Kara
2009-08-19 16:04 ` [PATCH 15/17] pohmelfs: " Jan Kara
2009-08-19 16:04 ` [PATCH 16/17] nfs: Remove reference to generic_osync_inode from a comment Jan Kara
2009-08-19 16:04 ` [PATCH 17/17] vfs: Remove generic_osync_inode() and sync_page_range() Jan Kara
2009-08-20 22:12 ` O_DIRECT and barriers Christoph Hellwig
2009-08-21 11:40   ` Jens Axboe
2009-08-21 13:54     ` Jamie Lokier
2009-08-21 14:26       ` Christoph Hellwig
2009-08-21 15:24         ` Jamie Lokier [this message]
2009-08-21 17:45           ` Christoph Hellwig
2009-08-21 19:18             ` Ric Wheeler
2009-08-22  0:50             ` Jamie Lokier
2009-08-22  2:19               ` Theodore Tso
2009-08-22  2:31                 ` Theodore Tso
2009-08-24  2:34               ` Christoph Hellwig
2009-08-27 14:34                 ` Jamie Lokier
2009-08-27 17:10                   ` adding proper O_SYNC/O_DSYNC, was " Christoph Hellwig
2009-08-27 17:24                     ` Ulrich Drepper
2009-08-27 17:24                       ` Ulrich Drepper
2009-08-28 15:46                       ` Christoph Hellwig
2009-08-28 16:06                         ` Ulrich Drepper
2009-08-28 16:06                           ` Ulrich Drepper
2009-08-28 16:17                           ` Christoph Hellwig
2009-08-28 16:33                             ` Ulrich Drepper
2009-08-28 16:33                               ` Ulrich Drepper
2009-08-28 16:41                               ` Christoph Hellwig
2009-08-28 20:51                                 ` Ulrich Drepper
2009-08-28 20:51                                   ` Ulrich Drepper
2009-08-28 21:08                                   ` Christoph Hellwig
2009-08-28 21:16                                     ` Trond Myklebust
2009-08-28 21:29                                       ` Christoph Hellwig
2009-08-28 21:43                                         ` Trond Myklebust
2009-08-28 22:39                                           ` Christoph Hellwig
2009-08-30 16:44                                     ` Jamie Lokier
2009-08-28 16:46                               ` Jamie Lokier
2009-08-29  0:59                                 ` Jamie Lokier
2009-08-28 16:44                         ` Jamie Lokier
2009-08-28 16:50                           ` Jamie Lokier
2009-08-28 21:08                           ` Ulrich Drepper
2009-08-28 21:08                             ` Ulrich Drepper
2009-08-30 16:58                             ` Jamie Lokier
2009-08-30 17:48                             ` Jamie Lokier
2009-08-28 23:06                         ` Jamie Lokier
2009-08-28 23:46                           ` Christoph Hellwig
2009-08-21 22:08         ` Theodore Tso
2009-08-21 22:38           ` Joel Becker
2009-08-21 22:45           ` Joel Becker
2009-08-22  2:11             ` Theodore Tso
2009-08-24  2:42               ` Christoph Hellwig
2009-08-24  2:37             ` Christoph Hellwig
2009-08-24  2:37             ` Christoph Hellwig
2009-08-21 22:45           ` Joel Becker
2009-08-22  0:56           ` Jamie Lokier
2009-08-22  2:06             ` Theodore Tso
2009-08-26  6:34           ` Dave Chinner
2009-08-26  6:34           ` Dave Chinner
2009-08-26 15:01             ` Jamie Lokier
2009-08-26 18:47               ` Theodore Tso
2009-08-27 14:50                 ` Jamie Lokier
2009-08-21 14:20     ` Christoph Hellwig
2009-08-21 15:06       ` James Bottomley
2009-08-21 15:23         ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090821152459.GC6929@shareable.org \
    --to=jamie@shareable.org \
    --cc=hch@infradead.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.