All of lore.kernel.org
 help / color / mirror / Atom feed
From: "NeilBrown" <neilb@suse.de>
To: "Jeff Layton" <jlayton@kernel.org>
Cc: "Trond Myklebust" <trondmy@hammerspace.com>,
	"zohar@linux.ibm.com" <zohar@linux.ibm.com>,
	"djwong@kernel.org" <djwong@kernel.org>,
	"xiubli@redhat.com" <xiubli@redhat.com>,
	"brauner@kernel.org" <brauner@kernel.org>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	"bfields@fieldses.org" <bfields@fieldses.org>,
	"david@fromorbit.com" <david@fromorbit.com>,
	"fweimer@redhat.com" <fweimer@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"chuck.lever@oracle.com" <chuck.lever@oracle.com>,
	"linux-man@vger.kernel.org" <linux-man@vger.kernel.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"tytso@mit.edu" <tytso@mit.edu>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"jack@suse.cz" <jack@suse.cz>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"adilger.kernel@dilger.ca" <adilger.kernel@dilger.ca>,
	"lczerner@redhat.com" <lczerner@redhat.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO_VERSION field
Date: Sun, 11 Sep 2022 08:53:06 +1000	[thread overview]
Message-ID: <166285038617.30452.11636397081493278357@noble.neil.brown.name> (raw)
In-Reply-To: <33d058be862ccc0ccaf959f2841a7e506e51fd1f.camel@kernel.org>

On Sat, 10 Sep 2022, Jeff Layton wrote:
> On Fri, 2022-09-09 at 16:41 +1000, NeilBrown wrote:
> > > On Fri, 09 Sep 2022, Trond Myklebust wrote:
> > > > > On Fri, 2022-09-09 at 01:10 +0000, Trond Myklebust wrote:
> > > > > > > On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote:
> > > > > > > > > On Fri, 09 Sep 2022, NeilBrown wrote:
> > > > > > > > > > > On Fri, 09 Sep 2022, Trond Myklebust wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > IOW: the minimal condition needs to be that for all cases
> > > > > > > > > > > > > below,
> > > > > > > > > > > > > the
> > > > > > > > > > > > > application reads 'state B' as having occurred if any data was
> > > > > > > > > > > > > committed to disk before the crash.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Application                             Filesystem
> > > > > > > > > > > > > ===========                             =========
> > > > > > > > > > > > > read change attr <- 'state A'
> > > > > > > > > > > > > read data <- 'state A'
> > > > > > > > > > > > >                                         write data -> 'state B'
> > > > > > > > > > > > >                                         <crash>+<reboot>
> > > > > > > > > > > > > read change attr <- 'state B'
> > > > > > > > > > > 
> > > > > > > > > > > The important thing here is to not see 'state A'.  Seeing 'state
> > > > > > > > > > > C'
> > > > > > > > > > > should be acceptable.  Worst case we could merge in wall-clock
> > > > > > > > > > > time
> > > > > > > > > > > of
> > > > > > > > > > > system boot, but the filesystem should be able to be more helpful
> > > > > > > > > > > than
> > > > > > > > > > > that.
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Actually, without the crash+reboot it would still be acceptable to
> > > > > > > > > see
> > > > > > > > > "state A" at the end there - but preferably not for long.
> > > > > > > > > From the NFS perspective, the changeid needs to update by the time
> > > > > > > > > of
> > > > > > > > > a
> > > > > > > > > close or unlock (so it is visible to open or lock), but before that
> > > > > > > > > it
> > > > > > > > > is just best-effort.
> > > > > > > 
> > > > > > > Nope. That will inevitably lead to data corruption, since the
> > > > > > > application might decide to use the data from state A instead of
> > > > > > > revalidating it.
> > > > > > > 
> > > > > 
> > > > > The point is, NFS is not the only potential use case for change
> > > > > attributes. We wouldn't be bothering to discuss statx() if it was.
> > > 
> > > My understanding is that it was primarily a desire to add fstests to
> > > exercise the i_version which motivated the statx extension.
> > > Obviously we should prepare for other uses though.
> > > 
> 
> Mainly. Also, userland nfs servers might also like this for obvious
> reasons. For now though, in the v5 set, I've backed off on trying to
> expose this to userland in favor of trying to just clean up the internal
> implementation.
> 
> I'd still like to expose this via statx if possible, but I don't want to
> get too bogged down in interface design just now as we have Real Bugs to
> fix. That patchset should make it simple to expose it later though.
> 
> > > > > 
> > > > > I could be using O_DIRECT, and all the tricks in order to ensure
> > > > > that
> > > > > my stock broker application (to choose one example) has access
> > > > > to the
> > > > > absolute very latest prices when I'm trying to execute a trade.
> > > > > When the filesystem then says 'the prices haven't changed since
> > > > > your
> > > > > last read because the change attribute on the database file is
> > > > > the
> > > > > same' in response to a statx() request with the
> > > > > AT_STATX_FORCE_SYNC
> > > > > flag set, then why shouldn't my application be able to assume it
> > > > > can
> > > > > serve those prices right out of memory instead of having to go
> > > > > to disk?
> > > 
> > > I would think that such an application would be using inotify rather
> > > than having to poll.  But certainly we should have a clear statement
> > > of
> > > quality-of-service parameters in the documentation.
> > > If we agree that perfect atomicity is what we want to promise, and
> > > that
> > > the cost to the filesystem and the statx call is acceptable, then so
> > > be it.
> > > 
> > > My point wasn't to say that atomicity is bad.  It was that:
> > >  - if the i_version change is visible before the change itself is
> > >    visible, then that is a correctness problem.
> > >  - if the i_version change is only visible some time after the
> > > change
> > >    itself is visible, then that is a quality-of-service issue.
> > > I cannot see any room for debating the first.  I do see some room to
> > > debate the second.
> > > 
> > > Cached writes, directory ops, and attribute changes are, I think,
> > > easy
> > > enough to provide truly atomic i_version updates with the change
> > > being
> > > visible.
> > > 
> > > Changes to a shared memory-mapped files is probably the hardest to
> > > provide timely i_version updates for.  We might want to document an
> > > explicit exception for those.  Alternately each request for
> > > i_version
> > > would need to find all pages that are writable, remap them read-only
> > > to
> > > catch future writes, then update i_version if any were writable
> > > (i.e.
> > > ->mkwrite had been called).  That is the only way I can think of to
> > > provide atomicity.
> > > 
> 
> I don't think we really want to make i_version bumps that expensive.
> Documenting that you can't expect perfect consistency vs. mmap with NFS
> seems like the best thing to do. We do our best, but that sort of
> synchronization requires real locking.
> 
> > > O_DIRECT writes are a little easier than mmapped files.  I suspect we
> > > should update the i_version once the device reports that the write is
> > > complete, but a parallel reader could have seem some of the write before
> > > that moment.  True atomicity could only be provided by taking some
> > > exclusive lock that blocked all O_DIRECT writes.  Jeff seems to be
> > > suggesting this, but I doubt the stock broker application would be
> > > willing to make the call in that case.  I don't think I would either.
> 
> Well, only blocked for long enough to run the getattr. Granted, with a
> slow underlying filesystem that can take a while.

Maybe I misunderstand, but this doesn't seem to make much sense.

If you want i_version updates to appear to be atomic w.r.t O_DIRECT
writes, then you need to prevent accessing the i_version while any write
is on-going. At that time there is no meaningful value for i_version.
So you need a lock (At least shared) around the actual write, and you
need an exclusive lock around the get_i_version().
So accessing the i_version would have to wait for all pending O_DIRECT
writes to complete, and would block any new O_DIRECT writes from
starting.

This could be expensive.

There is not currently any locking around O_DIRECT writes.  You cannot
synchronise with them.

The best you can do is update the i_version immediately after all the
O_DIRECT writes in a single request complete.

> 
> To summarize, there are two main uses for the change attr in NFSv4:
> 
> 1/ to provide change_info4 for directory morphing operations (CREATE,
> LINK, OPEN, REMOVE, and RENAME). It turns out that this is already
> atomic in the current nfsd code (AFAICT) by virtue of the fact that we
> hold the i_rwsem exclusively over these operations. The change attr is
> also queried pre and post while the lock is held, so that should ensure
> that we get true atomicity for this.

Yes, directory ops are relatively easy.

> 
> 2/ as an adjunct for the ctime when fetching attributes to validate
> caches. We don't expect perfect consistency between read (and readlike)
> operations and GETATTR, even when they're in the same compound.
> 
> IOW, a READ+GETATTR compound can legally give you a short (or zero-
> length) read, and then the getattr indicates a size that is larger than
> where the READ data stops, due to a write or truncate racing in after
> the read.

I agree that atomicity is neither necessary nor practical.  Ordering is
important though.  I don't think a truncate(0) racing with a READ can
credibly result in a non-zero size AFTER a zero-length read.  A truncate
that extends the size could have that effect though.

> 
> Ideally, the attributes in the GETATTR reply should be consistent
> between themselves though. IOW, all of the attrs should accurately
> represent the state of the file at a single point in time.
> change+size+times+etc. should all be consistent with one another.
> 
> I think we get all of this by taking the inode_lock around the
> vfs_getattr call in nfsd4_encode_fattr. It may not be the most elegant
> solution, but it should give us the atomicity we need, and it doesn't
> require adding extra operations or locking to the write codepaths.

Explicit attribute changes (chown/chmod/utimes/truncate etc) are always
done under the inode lock.  Implicit changes via inode_update_time() are
not (though xfs does take the lock, ext4 doesn't, haven't checked
others).  So taking the inode lock won't ensure those are internally
consistent.

I think using inode_lock_shared() is acceptable.  It doesn't promise
perfect atomicity, but it is probably good enough.

We'd need a good reason to want perfect atomicity to go further, and I
cannot think of one.

NeilBrown


> 
> We could also consider less invasive ways to achieve this (maybe some
> sort of seqretry loop around the vfs_getattr call?), but I'd rather not
> do extra work in the write codepaths if we can get away with it.
> -- 
> Jeff Layton <jlayton@kernel.org>
> 
> 

  reply	other threads:[~2022-09-10 22:53 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-07 11:16 [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO_VERSION field Jeff Layton
2022-09-07 11:37 ` NeilBrown
2022-09-07 12:20   ` J. Bruce Fields
2022-09-07 12:58     ` Jeff Layton
2022-09-07 12:47   ` Jeff Layton
2022-09-07 12:52     ` J. Bruce Fields
2022-09-07 13:12       ` Jeff Layton
2022-09-07 13:51         ` Jan Kara
2022-09-07 14:43           ` Jeff Layton
2022-09-08  0:44           ` NeilBrown
2022-09-08  8:33             ` Jan Kara
2022-09-08 15:21               ` Theodore Ts'o
2022-09-08 15:44                 ` J. Bruce Fields
2022-09-08 15:44                 ` Jeff Layton
2022-09-08 15:56                   ` J. Bruce Fields
2022-09-08 16:15                     ` Chuck Lever III
2022-09-08 17:40                     ` Jeff Layton
2022-09-08 18:22                       ` J. Bruce Fields
2022-09-08 19:07                         ` Jeff Layton
2022-09-08 23:01                           ` NeilBrown
2022-09-08 23:23                             ` Jeff Layton
2022-09-08 23:45                               ` NeilBrown
2022-09-09 15:45                           ` J. Bruce Fields
2022-09-09 16:36                             ` Jeff Layton
2022-09-10 14:56                               ` J. Bruce Fields
2022-09-12 11:42                                 ` Jeff Layton
2022-09-12 12:13                                   ` Florian Weimer
2022-09-12 12:55                                     ` Jeff Layton
2022-09-12 13:20                                       ` Florian Weimer
2022-09-12 13:49                                         ` Jeff Layton
2022-09-12 13:51                                       ` J. Bruce Fields
2022-09-12 14:02                                         ` Jeff Layton
2022-09-12 14:47                                           ` J. Bruce Fields
2022-09-12 14:15                                         ` Trond Myklebust
2022-09-12 14:50                                           ` J. Bruce Fields
2022-09-12 14:56                                             ` Trond Myklebust
2022-09-12 15:32                                               ` Trond Myklebust
2022-09-12 15:49                                                 ` Jeff Layton
2022-09-12 12:54                                   ` J. Bruce Fields
2022-09-12 12:59                                     ` Jeff Layton
2022-09-13  0:29                                   ` John Stoffel
2022-09-13  0:41                                   ` Dave Chinner
2022-09-13  1:49                                     ` NeilBrown
2022-09-13  2:41                                       ` Dave Chinner
2022-09-13  3:30                                         ` NeilBrown
2022-09-13  9:38                                           ` Theodore Ts'o
2022-09-13 19:02                                       ` J. Bruce Fields
2022-09-13 23:19                                         ` NeilBrown
2022-09-14  0:08                                           ` J. Bruce Fields
2022-09-09 20:34                           ` John Stoffel
2022-09-10 22:13                           ` NeilBrown
2022-09-12 10:43                             ` Jeff Layton
2022-09-12 13:42                             ` J. Bruce Fields
2022-09-12 23:14                               ` NeilBrown
2022-09-15 14:06                                 ` J. Bruce Fields
2022-09-15 15:08                                   ` Trond Myklebust
2022-09-15 16:45                                     ` Jeff Layton
2022-09-15 17:49                                       ` Trond Myklebust
2022-09-15 18:11                                         ` Jeff Layton
2022-09-15 19:03                                           ` Trond Myklebust
2022-09-15 19:25                                             ` Jeff Layton
2022-09-15 22:23                                               ` NeilBrown
2022-09-16  6:54                                                 ` Theodore Ts'o
2022-09-16 11:36                                                   ` Jeff Layton
2022-09-16 15:11                                                     ` Jeff Layton
2022-09-18 23:53                                                       ` Dave Chinner
2022-09-19 13:13                                                         ` Jeff Layton
2022-09-20  0:16                                                           ` Dave Chinner
2022-09-20 10:26                                                             ` Jeff Layton
2022-09-21  0:00                                                               ` Dave Chinner
2022-09-21 10:33                                                                 ` Jeff Layton
2022-09-21 21:41                                                                   ` Dave Chinner
2022-09-22 10:18                                                                     ` Jeff Layton
2022-09-22 20:18                                                                       ` Jeff Layton
2022-09-23  9:56                                                                         ` Jan Kara
2022-09-23 10:19                                                                           ` Jeff Layton
2022-09-23 13:44                                                                           ` Trond Myklebust
2022-09-23 13:50                                                                             ` Jeff Layton
2022-09-23 14:58                                                                               ` Frank Filz
2022-09-26 22:43                                                                               ` NeilBrown
2022-09-27 11:14                                                                                 ` Jeff Layton
2022-09-27 13:18                                                                                 ` Jeff Layton
2022-09-15 15:41                                   ` Jeff Layton
2022-09-15 22:42                                     ` NeilBrown
2022-09-16 11:32                                       ` Jeff Layton
2022-09-09 12:11                       ` Theodore Ts'o
2022-09-09 12:47                         ` Jeff Layton
2022-09-09 13:48                           ` Theodore Ts'o
2022-09-09 14:43                             ` Jeff Layton
2022-09-09 14:58                               ` Theodore Ts'o
2022-09-08 22:55                   ` NeilBrown
2022-09-08 23:59                     ` Trond Myklebust
2022-09-09  0:51                       ` NeilBrown
2022-09-09  1:05                         ` Trond Myklebust
2022-09-09  1:07                         ` NeilBrown
2022-09-09  1:10                           ` Trond Myklebust
2022-09-09  2:14                             ` Trond Myklebust
2022-09-09  6:41                               ` NeilBrown
2022-09-10 12:39                                 ` Jeff Layton
2022-09-10 22:53                                   ` NeilBrown [this message]
2022-09-12 10:25                                     ` Jeff Layton
2022-09-12 23:29                                       ` NeilBrown
2022-09-13  1:15                                         ` Dave Chinner
2022-09-13  1:41                                           ` NeilBrown
2022-09-13 19:01                                           ` Jeff Layton
2022-09-13 23:24                                             ` NeilBrown
2022-09-14 11:51                                               ` Jeff Layton
2022-09-14 22:45                                                 ` NeilBrown
2022-09-14 23:02                                                   ` NeilBrown
2022-09-08 22:40                 ` NeilBrown
2022-09-07 13:55         ` Trond Myklebust
2022-09-07 14:05           ` Jeff Layton
2022-09-07 15:04             ` Trond Myklebust
2022-09-07 15:11               ` Jeff Layton
2022-09-08  0:40             ` NeilBrown
2022-09-08 11:34               ` Jeff Layton
2022-09-08 22:29                 ` NeilBrown
2022-09-09 11:53                   ` Jeff Layton
2022-09-10 22:58                     ` NeilBrown
2022-09-10 19:46               ` Al Viro
2022-09-10 23:00                 ` NeilBrown
2022-09-08  0:31           ` NeilBrown
2022-09-08  0:41             ` Trond Myklebust
2022-09-08  0:53               ` NeilBrown
2022-09-08 11:37               ` Jeff Layton
2022-09-08 12:40                 ` Trond Myklebust

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=166285038617.30452.11636397081493278357@noble.neil.brown.name \
    --to=neilb@suse.de \
    --cc=adilger.kernel@dilger.ca \
    --cc=bfields@fieldses.org \
    --cc=brauner@kernel.org \
    --cc=ceph-devel@vger.kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=fweimer@redhat.com \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=lczerner@redhat.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-man@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=xiubli@redhat.com \
    --cc=zohar@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.