From: "NeilBrown" <neilb@suse.de>
To: "Amir Goldstein" <amir73il@gmail.com>
Cc: "Jeff Layton" <jlayton@kernel.org>,
tytso@mit.edu, adilger.kernel@dilger.ca, djwong@kernel.org,
david@fromorbit.com, trondmy@hammerspace.com,
viro@zeniv.linux.org.uk, zohar@linux.ibm.com, xiubli@redhat.com,
chuck.lever@oracle.com, lczerner@redhat.com, jack@suse.cz,
bfields@fieldses.org, brauner@kernel.org, fweimer@redhat.com,
linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, ceph-devel@vger.kernel.org,
linux-ext4@vger.kernel.org, linux-nfs@vger.kernel.org,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter
Date: Tue, 04 Oct 2022 09:56:42 +1100 [thread overview]
Message-ID: <166483780286.14457.1388505585556274283@noble.neil.brown.name> (raw)
In-Reply-To: <CAOQ4uxi6pPDexF7Z1wshnpV0kbSKsHUeawaUkhjq4FNGbqWU+A@mail.gmail.com>
On Tue, 04 Oct 2022, Amir Goldstein wrote:
> On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> > > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > >
> > > > The c/mtime and i_version currently get updated before the data is
> > > > copied (or a DIO write is issued), which is problematic for NFS.
> > > >
> > > > READ+GETATTR can race with a write (even a local one) in such a way as
> > > > to make the client associate the state of the file with the wrong change
> > > > attribute. That association can persist indefinitely if the file sees no
> > > > further changes.
> > > >
> > > > Move the setting of times to the bottom of the function in
> > > > __generic_file_write_iter and only update it if something was
> > > > successfully written.
> > > >
> > >
> > > This solution is wrong for several reasons:
> > >
> > > 1. There is still file_update_time() in ->page_mkwrite() so you haven't
> > > solved the problem completely
> >
> > Right. I don't think there is a way to solve the problem vs. mmap.
> > Userland can write to a writeable mmap'ed page at any time and we'd
> > never know. We have to specifically carve out mmap as an exception here.
> > I'll plan to add something to the manpage patch for this.
> >
> > > 2. The other side of the coin is that post crash state is more likely to end
> > > up data changes without mtime/ctime change
> > >
> >
> > Is this really something filesystems rely on? I suppose the danger is
> > that some cached data gets written to disk before the write returns and
> > the inode on disk never gets updated.
> >
> > But...isn't that a danger now? Some of the cached data could get written
> > out and the updated inode just never makes it to disk before a crash
> > (AFAIU). I'm not sure that this increases our exposure to that problem.
> >
> >
>
> You are correct that that danger exists, but it only exists for overwriting
> to allocated blocks.
>
> For writing to new blocks, mtime change is recorded in transaction
> before the block mapping is recorded in transaction so there is no
> danger in this case (before your patch).
>
> Also, observing size change without observing mtime change
> after crash seems like a very bad outcome that may be possible
> after your change.
>
> These are just a few cases that I could think of, they may be filesystem
> dependent, but my gut feeling is that if you remove the time update before
> the operation, that has been like that forever, a lot of s#!t is going to float
> for various filesystems and applications.
>
> And it is not one of those things that are discovered during rc or even
> stable kernel testing - they are discovered much later when users start to
> realize their applications got bogged up after crash, so it feels like to me
> like playing with fire.
>
> > > If I read the problem description correctly, then a solution that invalidates
> > > the NFS cache before AND after the write would be acceptable. Right?
> > > Would an extra i_version bump after the write solve the race?
> > >
> >
> > I based this patch on Neil's assertion that updating the time before an
> > operation was pointless if we were going to do it afterward. The NFS
> > client only really cares about seeing it change after a write.
> >
>
> Pointless to NFS client maybe.
> Whether or not this is not changing user behavior for other applications
> is up to you to prove and I doubt that you can prove it because I doubt
> that it is true.
>
> > Doing both would be fine from a correctness standpoint, and in most
> > cases, the second would be a no-op anyway since a query would have to
> > race in between the two for that to happen.
> >
> > FWIW, I think we should update the m/ctime and version at the same time.
> > If the version changes, then there is always the potential that a timer
> > tick has occurred. So, that would translate to a second call to
> > file_update_time in here.
> >
> > The downside of bumping the times/version both before and after is that
> > these are hot codepaths, and we'd be adding extra operations there. Even
> > in the case where nothing has changed, we'd have to call
> > inode_needs_update_time a second time for every write. Is that worth the
> > cost?
>
> Is there a practical cost for iversion bump AFTER write as I suggested?
> If you NEED m/ctime update AFTER write and iversion update is not enough
> then I did not understand from your commit message why that is.
>
> Thanks,
> Amir.
>
Maybe we should split i_version updates from ctime updates.
While it isn't true that ctime updates have happened before the write
"forever" it has been true since 2.3.43[1] which is close to forever.
For ctime there doesn't appear to be a strong specification of when the
change happens, so history provides a good case for leaving it before.
For i_version we want to provide clear and unambiguous semantics.
Performing 2 updates makes the specification muddy.
So I would prefer a single update for i_version, performed after the
change becomes visible. If that means it has to be separate from ctime,
then so be it.
NeilBrown
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=636b38438001a00b25f23e38747a91cb8428af29
next prev parent reply other threads:[~2022-10-03 22:57 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-30 11:18 [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Jeff Layton
2022-09-30 11:18 ` [PATCH v6 1/9] iversion: move inode_query_iversion to libfs.c Jeff Layton
2022-10-03 23:05 ` NeilBrown
2022-09-30 11:18 ` [PATCH v6 2/9] iversion: clarify when the i_version counter must be updated Jeff Layton
2022-10-03 23:10 ` NeilBrown
2022-10-04 9:53 ` Jeff Layton
2022-09-30 11:18 ` [PATCH v6 3/9] vfs: plumb i_version handling into struct kstat Jeff Layton
2022-10-03 23:14 ` NeilBrown
2022-09-30 11:18 ` [PATCH v6 4/9] nfs: report the inode version in getattr if requested Jeff Layton
2022-10-03 23:29 ` NeilBrown
2022-10-04 9:43 ` Jeff Layton
2022-10-04 22:27 ` NeilBrown
2022-09-30 11:18 ` [PATCH v6 5/9] ceph: " Jeff Layton
2022-09-30 11:18 ` [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
2022-09-30 14:34 ` Chuck Lever III
2022-09-30 22:32 ` Dave Chinner
2022-10-03 23:39 ` NeilBrown
2022-10-05 10:06 ` Jeff Layton
2022-10-05 13:33 ` Chuck Lever III
2022-10-05 13:34 ` Trond Myklebust
2022-10-05 13:57 ` Jeff Layton
2022-10-05 21:14 ` NeilBrown
2022-10-06 11:15 ` Jeff Layton
2022-10-06 21:17 ` Dave Chinner
2022-09-30 11:18 ` [PATCH v6 7/9] vfs: expose STATX_VERSION to userland Jeff Layton
2022-10-03 23:42 ` NeilBrown
2022-10-05 10:08 ` Jeff Layton
2022-09-30 11:18 ` [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter Jeff Layton
2022-10-02 7:08 ` Amir Goldstein
2022-10-03 13:01 ` Jeff Layton
2022-10-03 13:52 ` Amir Goldstein
2022-10-03 22:56 ` NeilBrown [this message]
2022-10-05 16:40 ` Jeff Layton
2022-10-05 21:40 ` NeilBrown
2022-09-30 11:18 ` [PATCH v6 9/9] ext4: update times after I/O in write codepaths Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=166483780286.14457.1388505585556274283@noble.neil.brown.name \
--to=neilb@suse.de \
--cc=adilger.kernel@dilger.ca \
--cc=amir73il@gmail.com \
--cc=bfields@fieldses.org \
--cc=brauner@kernel.org \
--cc=ceph-devel@vger.kernel.org \
--cc=chuck.lever@oracle.com \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=fweimer@redhat.com \
--cc=jack@suse.cz \
--cc=jlayton@kernel.org \
--cc=lczerner@redhat.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=trondmy@hammerspace.com \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
--cc=xiubli@redhat.com \
--cc=zohar@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).