ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "NeilBrown" <neilb@suse.de>
To: "Amir Goldstein" <amir73il@gmail.com>
Cc: "Jeff Layton" <jlayton@kernel.org>,
	tytso@mit.edu, adilger.kernel@dilger.ca, djwong@kernel.org,
	david@fromorbit.com, trondmy@hammerspace.com,
	viro@zeniv.linux.org.uk, zohar@linux.ibm.com, xiubli@redhat.com,
	chuck.lever@oracle.com, lczerner@redhat.com, jack@suse.cz,
	bfields@fieldses.org, brauner@kernel.org, fweimer@redhat.com,
	linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, ceph-devel@vger.kernel.org,
	linux-ext4@vger.kernel.org, linux-nfs@vger.kernel.org,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter
Date: Tue, 04 Oct 2022 09:56:42 +1100	[thread overview]
Message-ID: <166483780286.14457.1388505585556274283@noble.neil.brown.name> (raw)
In-Reply-To: <CAOQ4uxi6pPDexF7Z1wshnpV0kbSKsHUeawaUkhjq4FNGbqWU+A@mail.gmail.com>

On Tue, 04 Oct 2022, Amir Goldstein wrote:
> On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> > > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > >
> > > > The c/mtime and i_version currently get updated before the data is
> > > > copied (or a DIO write is issued), which is problematic for NFS.
> > > >
> > > > READ+GETATTR can race with a write (even a local one) in such a way as
> > > > to make the client associate the state of the file with the wrong change
> > > > attribute. That association can persist indefinitely if the file sees no
> > > > further changes.
> > > >
> > > > Move the setting of times to the bottom of the function in
> > > > __generic_file_write_iter and only update it if something was
> > > > successfully written.
> > > >
> > >
> > > This solution is wrong for several reasons:
> > >
> > > 1. There is still file_update_time() in ->page_mkwrite() so you haven't
> > >     solved the problem completely
> >
> > Right. I don't think there is a way to solve the problem vs. mmap.
> > Userland can write to a writeable mmap'ed page at any time and we'd
> > never know. We have to specifically carve out mmap as an exception here.
> > I'll plan to add something to the manpage patch for this.
> >
> > > 2. The other side of the coin is that post crash state is more likely to end
> > >     up data changes without mtime/ctime change
> > >
> >
> > Is this really something filesystems rely on? I suppose the danger is
> > that some cached data gets written to disk before the write returns and
> > the inode on disk never gets updated.
> >
> > But...isn't that a danger now? Some of the cached data could get written
> > out and the updated inode just never makes it to disk before a crash
> > (AFAIU). I'm not sure that this increases our exposure to that problem.
> >
> >
> 
> You are correct that that danger exists, but it only exists for overwriting
> to allocated blocks.
> 
> For writing to new blocks, mtime change is recorded in transaction
> before the block mapping is recorded in transaction so there is no
> danger in this case (before your patch).
> 
> Also, observing size change without observing mtime change
> after crash seems like a very bad outcome that may be possible
> after your change.
> 
> These are just a few cases that I could think of, they may be filesystem
> dependent, but my gut feeling is that if you remove the time update before
> the operation, that has been like that forever, a lot of s#!t is going to float
> for various filesystems and applications.
> 
> And it is not one of those things that are discovered  during rc or even
> stable kernel testing - they are discovered much later when users start to
> realize their applications got bogged up after crash, so it feels like to me
> like playing with fire.
> 
> > > If I read the problem description correctly, then a solution that invalidates
> > > the NFS cache before AND after the write would be acceptable. Right?
> > > Would an extra i_version bump after the write solve the race?
> > >
> >
> > I based this patch on Neil's assertion that updating the time before an
> > operation was pointless if we were going to do it afterward. The NFS
> > client only really cares about seeing it change after a write.
> >
> 
> Pointless to NFS client maybe.
> Whether or not this is not changing user behavior for other applications
> is up to you to prove and I doubt that you can prove it because I doubt
> that it is true.
> 
> > Doing both would be fine from a correctness standpoint, and in most
> > cases, the second would be a no-op anyway since a query would have to
> > race in between the two for that to happen.
> >
> > FWIW, I think we should update the m/ctime and version at the same time.
> > If the version changes, then there is always the potential that a timer
> > tick has occurred. So, that would translate to a second call to
> > file_update_time in here.
> >
> > The downside of bumping the times/version both before and after is that
> > these are hot codepaths, and we'd be adding extra operations there. Even
> > in the case where nothing has changed, we'd have to call
> > inode_needs_update_time a second time for every write. Is that worth the
> > cost?
> 
> Is there a practical cost for iversion bump AFTER write as I suggested?
> If you NEED m/ctime update AFTER write and iversion update is not enough
> then I did not understand from your commit message why that is.
> 
> Thanks,
> Amir.
> 

Maybe we should split i_version updates from ctime updates.

While it isn't true that ctime updates have happened before the write
"forever" it has been true since 2.3.43[1] which is close to forever.

For ctime there doesn't appear to be a strong specification of when the
change happens, so history provides a good case for leaving it before.
For i_version we want to provide clear and unambiguous semantics.
Performing 2 updates makes the specification muddy.

So I would prefer a single update for i_version, performed after the
change becomes visible.  If that means it has to be separate from ctime,
then so be it.

NeilBrown


[1]:  https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=636b38438001a00b25f23e38747a91cb8428af29

  reply	other threads:[~2022-10-03 22:57 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-30 11:18 [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Jeff Layton
2022-09-30 11:18 ` [PATCH v6 1/9] iversion: move inode_query_iversion to libfs.c Jeff Layton
2022-10-03 23:05   ` NeilBrown
2022-09-30 11:18 ` [PATCH v6 2/9] iversion: clarify when the i_version counter must be updated Jeff Layton
2022-10-03 23:10   ` NeilBrown
2022-10-04  9:53     ` Jeff Layton
2022-09-30 11:18 ` [PATCH v6 3/9] vfs: plumb i_version handling into struct kstat Jeff Layton
2022-10-03 23:14   ` NeilBrown
2022-09-30 11:18 ` [PATCH v6 4/9] nfs: report the inode version in getattr if requested Jeff Layton
2022-10-03 23:29   ` NeilBrown
2022-10-04  9:43     ` Jeff Layton
2022-10-04 22:27       ` NeilBrown
2022-09-30 11:18 ` [PATCH v6 5/9] ceph: " Jeff Layton
2022-09-30 11:18 ` [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
2022-09-30 14:34   ` Chuck Lever III
2022-09-30 22:32     ` Dave Chinner
2022-10-03 23:39   ` NeilBrown
2022-10-05 10:06     ` Jeff Layton
2022-10-05 13:33       ` Chuck Lever III
2022-10-05 13:34       ` Trond Myklebust
2022-10-05 13:57         ` Jeff Layton
2022-10-05 21:14       ` NeilBrown
2022-10-06 11:15         ` Jeff Layton
2022-10-06 21:17     ` Dave Chinner
2022-09-30 11:18 ` [PATCH v6 7/9] vfs: expose STATX_VERSION to userland Jeff Layton
2022-10-03 23:42   ` NeilBrown
2022-10-05 10:08     ` Jeff Layton
2022-09-30 11:18 ` [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter Jeff Layton
2022-10-02  7:08   ` Amir Goldstein
2022-10-03 13:01     ` Jeff Layton
2022-10-03 13:52       ` Amir Goldstein
2022-10-03 22:56         ` NeilBrown [this message]
2022-10-05 16:40           ` Jeff Layton
2022-10-05 21:40             ` NeilBrown
2022-09-30 11:18 ` [PATCH v6 9/9] ext4: update times after I/O in write codepaths Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=166483780286.14457.1388505585556274283@noble.neil.brown.name \
    --to=neilb@suse.de \
    --cc=adilger.kernel@dilger.ca \
    --cc=amir73il@gmail.com \
    --cc=bfields@fieldses.org \
    --cc=brauner@kernel.org \
    --cc=ceph-devel@vger.kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=fweimer@redhat.com \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=lczerner@redhat.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=xiubli@redhat.com \
    --cc=zohar@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).