Linux-Fsdevel Archive on
 help / color / Atom feed
From: Dave Chinner <>
To: Dan Williams <>
Cc: Christoph Hellwig <>,
	linux-fsdevel <>,
	linux-nvdimm <>,
	XFS Developers <>
Subject: Re: xfs: untangle the direct I/O and DAX path, fix DAX locking
Date: Fri, 24 Jun 2016 17:13:18 +1000
Message-ID: <20160624071318.GE12670@dastard> (raw)
In-Reply-To: <>

On Thu, Jun 23, 2016 at 06:14:47PM -0700, Dan Williams wrote:
> On Thu, Jun 23, 2016 at 4:24 PM, Dave Chinner <> wrote:
> > On Wed, Jun 22, 2016 at 05:27:08PM +0200, Christoph Hellwig wrote:
> >> The last patch is what started the series:  XFS currently uses the
> >> direct I/O locking strategy for DAX because DAX was overloaded onto
> >> the direct I/O path.  For XFS this means that we only take a shared
> >> inode lock instead of the normal exclusive one for writes IFF they
> >> are properly aligned.  While this is fine for O_DIRECT which requires
> >> explicit opt-in from the application it's not fine for DAX where we'll
> >> suddenly lose expected and required synchronization of the file system
> >> happens to use DAX undeneath.
> >
> > Except we did that *intentionally* - by definition there is no
> > cache to bypass with DAX and so all IO is "direct". That, combined
> > with the fact that all Linux filesystems except XFS break the POSIX
> > exclusive writer rule you are quoting to begin with, it seemed
> > pointless to enforce it for DAX....
> If we're going to be strict about POSIX fsync() semantics we should be
> strict about this exclusive write semantic.  In other words why is it
> ok to loosen one and not the other, if application compatibility is
> the concern?

This is a POSIX compliant fsync() implementation:

int fsync(int fd)
	return 0;

That's not what we require from Linux filesystems and storage
subsystems.  Our data integrity requirements are not actually
defined by POSIX - we go way beyond what POSIX actually requires us
to implement. If all we cared about is POSIX, then the above is how
we'd implement fsync() simply because it's fast. Everyone implements
fsync differently, so portable applications can't actually rely on
the POSIX standard fsync() implementation to keep their data safe...

IOWs, we don't give a shit about what POSIX says about fsync
because, in practice, it's useless. Instead, we implement something
that *works* and provides users with real data integrity guarantees.

If you like the POSIX specs for data integrity, go use
sync_file_range() - it doesn't guarantee data integrity, just like
posix compliant fsync(). And yes, applications that use
sync_file_range() are known to lose data when systems crash...

The POSIX exclusive write requirement is a different case. No linux
filesystem except XFS has ever met that requirement (in 20 something
years), yet I don't see applications falling over with corrupt data
from non-exclusive writes all the time, nor do I see application
developers shouting at us to provide it. i.e. reality tells us this
isn't a POSIX behaviour that applications rely on because everyone
implements it differently.

So, like fsync(), if everyone implements it differently,
applications don't rely on posix smeantics to serialise access to
overlapping ranges of a file. And if that's the case, then
why even bother exclusive write locking in the filesystem when there
is no need for serialisation of page cache contents?

We don't do it because POSIX says so, because we already ignore what
POSIX says about this topic for technical reasons. So why should we
make DAX conform to POSIX exclusive writer behaviour when DAX is
being specifically aimed at high performance, highly concurrent
applications where exclusive writer behaviour will cause major
performance issues?


Dave Chinner

  reply index

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-22 15:27 Christoph Hellwig
2016-06-22 15:27 ` [PATCH 1/8] xfs: don't pass ioflags around in the ioctl path Christoph Hellwig
2016-06-22 15:27 ` [PATCH 2/8] xfs: kill ioflags Christoph Hellwig
2016-06-22 15:27 ` [PATCH 3/8] xfs: remove s_maxbytes enforcement in xfs_file_read_iter Christoph Hellwig
2016-06-22 15:27 ` [PATCH 4/8] xfs: split xfs_file_read_iter into buffered and direct I/O helpers Christoph Hellwig
2016-06-22 15:27 ` [PATCH 5/8] xfs: stop using generic_file_read_iter for direct I/O Christoph Hellwig
2016-06-22 15:27 ` [PATCH 6/8] xfs: direct calls in the direct I/O path Christoph Hellwig
2016-06-22 15:27 ` [PATCH 7/8] xfs: split direct I/O and DAX path Christoph Hellwig
2016-09-29  2:53   ` Darrick J. Wong
2016-09-29  8:38     ` aio completions vs file_accessed race, was: " Christoph Hellwig
2016-09-29 20:18       ` Christoph Hellwig
2016-09-29 20:18         ` Christoph Hellwig
2016-09-29 20:33           ` Darrick J. Wong
2016-06-22 15:27 ` [PATCH 8/8] xfs: fix locking for DAX writes Christoph Hellwig
2016-06-23 14:22   ` Boaz Harrosh
2016-06-23 23:24 ` xfs: untangle the direct I/O and DAX path, fix DAX locking Dave Chinner
2016-06-24  1:14   ` Dan Williams
2016-06-24  7:13     ` Dave Chinner [this message]
2016-06-24  7:31       ` Christoph Hellwig
2016-06-24  7:26   ` Christoph Hellwig
2016-06-24 23:00     ` Dave Chinner
2016-06-28 13:10       ` Christoph Hellwig
2016-06-28 13:27         ` Boaz Harrosh
2016-06-28 13:39           ` Christoph Hellwig
2016-06-28 13:56             ` Boaz Harrosh
2016-06-28 15:39               ` Christoph Hellwig
2016-06-29 12:23                 ` Boaz Harrosh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160624071318.GE12670@dastard \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on

Archives are clonable:
	git clone --mirror linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ \
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone