linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: NeilBrown <neilb@suse.de>
Cc: Jeff Layton <jlayton@kernel.org>,
	Trond Myklebust <trondmy@hammerspace.com>,
	"zohar@linux.ibm.com" <zohar@linux.ibm.com>,
	"djwong@kernel.org" <djwong@kernel.org>,
	"xiubli@redhat.com" <xiubli@redhat.com>,
	"brauner@kernel.org" <brauner@kernel.org>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	"bfields@fieldses.org" <bfields@fieldses.org>,
	"fweimer@redhat.com" <fweimer@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"chuck.lever@oracle.com" <chuck.lever@oracle.com>,
	"linux-man@vger.kernel.org" <linux-man@vger.kernel.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"tytso@mit.edu" <tytso@mit.edu>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"jack@suse.cz" <jack@suse.cz>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"adilger.kernel@dilger.ca" <adilger.kernel@dilger.ca>,
	"lczerner@redhat.com" <lczerner@redhat.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO_VERSION field
Date: Tue, 13 Sep 2022 11:15:18 +1000	[thread overview]
Message-ID: <20220913011518.GE3600936@dread.disaster.area> (raw)
In-Reply-To: <166302538820.30452.7783524836504548113@noble.neil.brown.name>

On Tue, Sep 13, 2022 at 09:29:48AM +1000, NeilBrown wrote:
> On Mon, 12 Sep 2022, Jeff Layton wrote:
> > On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote:
> > > This could be expensive.
> > > 
> > > There is not currently any locking around O_DIRECT writes.  You cannot
> > > synchronise with them.
> > > 
> > 
> > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold
> > inode_lock_shared across the I/O. That was why patch #8 takes the
> > inode_lock (exclusive) across the getattr.
> 
> Looking at ext4_dio_write_iter() it certain does take
> inode_lock_shared() before starting the write and in some cases it
> requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for
> the write to complete.  But not in all cases.
> So I don't think it always holds the shared lock across all direct IO.

To serialise against dio writes, one must:

	// Lock the inode exclusively to block new DIO submissions
	inode_lock(inode);

	// Wait for all in flight DIO reads and writes to complete
	inode_dio_wait(inode);

This is how truncate, fallocate, etc serialise against AIO+DIO which
do not hold the inode lock across the entire IO. These have to
serialise aginst DIO reads, too, because we can't have IO in
progress over a range of the file that we are about to free....

> > Taking inode_lock_shared is sufficient to block out buffered and DAX
> > writes. DIO writes sometimes only take the shared lock (e.g. when the
> > data is already properly aligned). If we want to ensure the getattr
> > doesn't run while _any_ writes are running, we'd need the exclusive
> > lock.
> 
> But the exclusive lock is bad for scalability.

Serilisation against DIO is -expensive- and -slow-. It's not a
solution for what is supposed to be a fast unlocked read-only
operation like statx().

> > Maybe that's overkill, though it seems like we could have a race like
> > this without taking inode_lock across the getattr:
> > 
> > reader				writer
> > -----------------------------------------------------------------
> > 				i_version++
> > getattr
> > read
> > 				DIO write to backing store
> > 
> 
> This is why I keep saying that the i_version increment must be after the
> write, not before it.

Sure, but that ignores the reason why we actually need to bump
i_version *before* we submit a DIO write. DIO write invalidates the
page cache over the range of the write, so any racing read will
re-populate the page cache during the DIO write.

Hence buffered reads can return before the DIO write has completed,
and the contents of the read can contain, none, some or all of the
contents of the DIO write. Hence i_version has to be incremented
before the DIO write is submitted so that racing getattrs will
indicate that the local caches have been invalidated and that data
needs to be refetched.

But, yes, to actually be safe here, we *also* should be bumping
i_version on DIO write on DIO write completion so that racing
i_version and data reads that occur *after* the initial i_version
bump are invalidated immediately.

IOWs, to avoid getattr/read races missing stale data invalidations
during DIO writes, we really need to bump i_version both _before and
after_ DIO write submission.

It's corner cases like this where "i_version should only be bumped
when ctime changes" fails completely. i.e. there are concurrent IO
situations which can only really be handled correctly by bumping
i_version whenever either in-memory and/or on-disk persistent data/
metadata state changes occur.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2022-09-13  1:15 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-07 11:16 [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO_VERSION field Jeff Layton
2022-09-07 11:37 ` NeilBrown
2022-09-07 12:20   ` J. Bruce Fields
2022-09-07 12:58     ` Jeff Layton
2022-09-07 12:47   ` Jeff Layton
2022-09-07 12:52     ` J. Bruce Fields
2022-09-07 13:12       ` Jeff Layton
2022-09-07 13:51         ` Jan Kara
2022-09-07 14:43           ` Jeff Layton
2022-09-08  0:44           ` NeilBrown
2022-09-08  8:33             ` Jan Kara
2022-09-08 15:21               ` Theodore Ts'o
2022-09-08 15:44                 ` J. Bruce Fields
2022-09-08 15:44                 ` Jeff Layton
2022-09-08 15:56                   ` J. Bruce Fields
2022-09-08 16:15                     ` Chuck Lever III
2022-09-08 17:40                     ` Jeff Layton
2022-09-08 18:22                       ` J. Bruce Fields
2022-09-08 19:07                         ` Jeff Layton
2022-09-08 23:01                           ` NeilBrown
2022-09-08 23:23                             ` Jeff Layton
2022-09-08 23:45                               ` NeilBrown
2022-09-09 15:45                           ` J. Bruce Fields
2022-09-09 16:36                             ` Jeff Layton
2022-09-10 14:56                               ` J. Bruce Fields
2022-09-12 11:42                                 ` Jeff Layton
2022-09-12 12:13                                   ` Florian Weimer
2022-09-12 12:55                                     ` Jeff Layton
2022-09-12 13:20                                       ` Florian Weimer
2022-09-12 13:49                                         ` Jeff Layton
2022-09-12 13:51                                       ` J. Bruce Fields
2022-09-12 14:02                                         ` Jeff Layton
2022-09-12 14:47                                           ` J. Bruce Fields
2022-09-12 14:15                                         ` Trond Myklebust
2022-09-12 14:50                                           ` J. Bruce Fields
2022-09-12 14:56                                             ` Trond Myklebust
2022-09-12 15:32                                               ` Trond Myklebust
2022-09-12 15:49                                                 ` Jeff Layton
2022-09-12 12:54                                   ` J. Bruce Fields
2022-09-12 12:59                                     ` Jeff Layton
2022-09-13  0:29                                   ` John Stoffel
2022-09-13  0:41                                   ` Dave Chinner
2022-09-13  1:49                                     ` NeilBrown
2022-09-13  2:41                                       ` Dave Chinner
2022-09-13  3:30                                         ` NeilBrown
2022-09-13  9:38                                           ` Theodore Ts'o
2022-09-13 19:02                                       ` J. Bruce Fields
2022-09-13 23:19                                         ` NeilBrown
2022-09-14  0:08                                           ` J. Bruce Fields
2022-09-09 20:34                           ` John Stoffel
2022-09-10 22:13                           ` NeilBrown
2022-09-12 10:43                             ` Jeff Layton
2022-09-12 13:42                             ` J. Bruce Fields
2022-09-12 23:14                               ` NeilBrown
2022-09-15 14:06                                 ` J. Bruce Fields
2022-09-15 15:08                                   ` Trond Myklebust
2022-09-15 16:45                                     ` Jeff Layton
2022-09-15 17:49                                       ` Trond Myklebust
2022-09-15 18:11                                         ` Jeff Layton
2022-09-15 19:03                                           ` Trond Myklebust
2022-09-15 19:25                                             ` Jeff Layton
2022-09-15 22:23                                               ` NeilBrown
2022-09-16  6:54                                                 ` Theodore Ts'o
2022-09-16 11:36                                                   ` Jeff Layton
2022-09-16 15:11                                                     ` Jeff Layton
2022-09-18 23:53                                                       ` Dave Chinner
2022-09-19 13:13                                                         ` Jeff Layton
2022-09-20  0:16                                                           ` Dave Chinner
2022-09-20 10:26                                                             ` Jeff Layton
2022-09-21  0:00                                                               ` Dave Chinner
2022-09-21 10:33                                                                 ` Jeff Layton
2022-09-21 21:41                                                                   ` Dave Chinner
2022-09-22 10:18                                                                     ` Jeff Layton
2022-09-22 20:18                                                                       ` Jeff Layton
2022-09-23  9:56                                                                         ` Jan Kara
2022-09-23 10:19                                                                           ` Jeff Layton
2022-09-23 13:44                                                                           ` Trond Myklebust
2022-09-23 13:50                                                                             ` Jeff Layton
2022-09-23 14:58                                                                               ` Frank Filz
2022-09-26 22:43                                                                               ` NeilBrown
2022-09-27 11:14                                                                                 ` Jeff Layton
2022-09-27 13:18                                                                                 ` Jeff Layton
2022-09-15 15:41                                   ` Jeff Layton
2022-09-15 22:42                                     ` NeilBrown
2022-09-16 11:32                                       ` Jeff Layton
2022-09-09 12:11                       ` Theodore Ts'o
2022-09-09 12:47                         ` Jeff Layton
2022-09-09 13:48                           ` Theodore Ts'o
2022-09-09 14:43                             ` Jeff Layton
2022-09-09 14:58                               ` Theodore Ts'o
2022-09-08 22:55                   ` NeilBrown
2022-09-08 23:59                     ` Trond Myklebust
2022-09-09  0:51                       ` NeilBrown
2022-09-09  1:05                         ` Trond Myklebust
2022-09-09  1:07                         ` NeilBrown
2022-09-09  1:10                           ` Trond Myklebust
2022-09-09  2:14                             ` Trond Myklebust
2022-09-09  6:41                               ` NeilBrown
2022-09-10 12:39                                 ` Jeff Layton
2022-09-10 22:53                                   ` NeilBrown
2022-09-12 10:25                                     ` Jeff Layton
2022-09-12 23:29                                       ` NeilBrown
2022-09-13  1:15                                         ` Dave Chinner [this message]
2022-09-13  1:41                                           ` NeilBrown
2022-09-13 19:01                                           ` Jeff Layton
2022-09-13 23:24                                             ` NeilBrown
2022-09-14 11:51                                               ` Jeff Layton
2022-09-14 22:45                                                 ` NeilBrown
2022-09-14 23:02                                                   ` NeilBrown
2022-09-08 22:40                 ` NeilBrown
2022-09-07 13:55         ` Trond Myklebust
2022-09-07 14:05           ` Jeff Layton
2022-09-07 15:04             ` Trond Myklebust
2022-09-07 15:11               ` Jeff Layton
2022-09-08  0:40             ` NeilBrown
2022-09-08 11:34               ` Jeff Layton
2022-09-08 22:29                 ` NeilBrown
2022-09-09 11:53                   ` Jeff Layton
2022-09-10 22:58                     ` NeilBrown
2022-09-10 19:46               ` Al Viro
2022-09-10 23:00                 ` NeilBrown
2022-09-08  0:31           ` NeilBrown
2022-09-08  0:41             ` Trond Myklebust
2022-09-08  0:53               ` NeilBrown
2022-09-08 11:37               ` Jeff Layton
2022-09-08 12:40                 ` Trond Myklebust

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220913011518.GE3600936@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=bfields@fieldses.org \
    --cc=brauner@kernel.org \
    --cc=ceph-devel@vger.kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=djwong@kernel.org \
    --cc=fweimer@redhat.com \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=lczerner@redhat.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-man@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=trondmy@hammerspace.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=xiubli@redhat.com \
    --cc=zohar@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).