All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>,
	tytso@mit.edu, adilger.kernel@dilger.ca, djwong@kernel.org,
	trondmy@hammerspace.com, neilb@suse.de, viro@zeniv.linux.org.uk,
	zohar@linux.ibm.com, xiubli@redhat.com, chuck.lever@oracle.com,
	lczerner@redhat.com, bfields@fieldses.org, brauner@kernel.org,
	fweimer@redhat.com, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	ceph-devel@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-nfs@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
Date: Tue, 18 Oct 2022 13:04:34 -0400	[thread overview]
Message-ID: <fcd35c14353ae859d778a23f80249047819bc4b0.camel@kernel.org> (raw)
In-Reply-To: <20221018151721.cl6dbupqjkkivxyf@quack3>

On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote:
> On Tue 18-10-22 10:21:08, Jeff Layton wrote:
> > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote:
> > > On Tue 18-10-22 06:35:14, Jeff Layton wrote:
> > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > > > > > Trond is of the opinion that monotonicity is a hard requirement, and
> > > > > > that we should not allow filesystems that can't provide that quality to
> > > > > > report STATX_VERSION at all.  His rationale is that one of the main uses
> > > > > > for this is for backup applications, and for those a counter that could
> > > > > > go backward is worse than useless.
> > > > > 
> > > > > From the perspective of a backup program doing incremental backups,
> > > > > an inode with a change counter that has a different value to the
> > > > > current backup inventory means the file contains different
> > > > > information than what the current backup inventory holds. Again,
> > > > > snapshots, rollbacks, etc.
> > > > > 
> > > > > Therefore, regardless of whether the change counter has gone
> > > > > forwards or backwards, the backup program needs to back up this
> > > > > current version of the file in this backup because it is different
> > > > > to the inventory copy.  Hence if the backup program fails to back it
> > > > > up, it will not be creating an exact backup of the user's data at
> > > > > the point in time the backup is run...
> > > > > 
> > > > > Hence I don't see that MONOTONIC is a requirement for backup
> > > > > programs - they really do have to be able to handle filesystems that
> > > > > have modifications that move backwards in time as well as forwards...
> > > > 
> > > > Rolling backward is not a problem in and of itself. The big issue is
> > > > that after a crash, we can end up with a change attr seen before the
> > > > crash that is now associated with a completely different inode state.
> > > > 
> > > > The scenario is something like:
> > > > 
> > > > - Change attr for an empty file starts at 1
> > > > 
> > > > - Write "A" to file, change attr goes to 2
> > > > 
> > > > - Read and statx happens (client sees "A" with change attr 2)
> > > > 
> > > > - Crash (before last change is logged to disk)
> > > > 
> > > > - Machine reboots, inode is empty, change attr back to 1
> > > > 
> > > > - Write "B" to file, change attr goes to 2
> > > > 
> > > > - Client stat's file, sees change attr 2 and assumes its cache is
> > > > correct when it isn't (should be "B" not "A" now).
> > > > 
> > > > The real danger comes not from the thing going backward, but the fact
> > > > that it can march forward again after going backward, and then the
> > > > client can see two different inode states associated with the same
> > > > change attr value. Jumping all the change attributes forward by a
> > > > significant amount after a crash should avoid this issue.
> > > 
> > > As Dave pointed out, the problem with change attr having the same value for
> > > a different inode state (after going backwards) holds not only for the
> > > crashes but also for restore from backups, fs snapshots, device snapshots
> > > etc. So relying on change attr only looks a bit fragile. It works for the
> > > common case but the edge cases are awkward and there's no easy way to
> > > detect you are in the edge case.
> > > 
> > 
> > This is true. In fact in the snapshot case you can't even rely on doing
> > anything at reboot since you won't necessarily need to reboot to make it
> > roll backward.
> > 
> > Whether that obviates the use of this value altogether, I'm not sure.
> > 
> > > So I think any implementation caring about data integrity would have to
> > > include something like ctime into the picture anyway. Or we could just
> > > completely give up any idea of monotonicity and on each mount select random
> > > prime P < 2^64 and instead of doing inc when advancing the change
> > > attribute, we'd advance it by P. That makes collisions after restore /
> > > crash fairly unlikely.
> > 
> > Part of the goal (at least for NFS) is to avoid unnecessary cache
> > invalidations.
> > 
> > If we just increment it by a particular offset on every reboot, then
> > every time the server reboots, the clients will invalidate all of their
> > cached inodes, and proceed to hammer the server with READ calls just as
> > it's having to populate its own caches from disk.
> 
> Note that I didn't propose to increment by offset on every reboot or mount.
> I have proposed that inode_maybe_inc_iversion() would not increment
> i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P
> << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected
> on filesystem mount.
> 
> This will not cause cache invalidation after a clean unmount + remount. It
> will cause cache invalidation after a crash, snapshot rollback etc., only for
> inodes where i_version changed. If P is suitably selected (e.g. as being a
> prime), then the chances of collisions (even after a snapshot rollback) are
> very low (on the order of 2^(-50) if my piece of envelope calculations are
> right).
>
> So this should nicely deal with all the problems we've spotted so far. But
> I may be missing something...


Got it! That makes a lot more sense. Thinking about this some more...

What sort of range for P would be suitable?

Every increment would need to be by (shifted) P, so we can't choose too
large a number. Queries are pretty rare vs. writes though, so that
mitigates the issue somewhat.

There are 31 primes between 1 and 127. Worst case, we'd still have 2^48
increments before the counter wraps.

Let me think about this some more, but maybe that's good enough to
ensure uniqueness.
-- 
Jeff Layton <jlayton@kernel.org>

  reply	other threads:[~2022-10-18 17:04 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
2022-10-17 10:57 ` [PATCH v7 1/9] fs: uninline inode_query_iversion Jeff Layton
2022-10-17 10:57 ` [PATCH v7 2/9] fs: clarify when the i_version counter must be updated Jeff Layton
2022-10-17 10:57 ` [PATCH v7 3/9] vfs: plumb i_version handling into struct kstat Jeff Layton
2022-10-17 10:57 ` [PATCH v7 4/9] nfs: report the inode version in getattr if requested Jeff Layton
2022-10-17 10:57 ` [PATCH v7 5/9] ceph: " Jeff Layton
2022-10-17 10:57 ` [PATCH v7 6/9] nfsd: move nfsd4_change_attribute to nfsfh.c Jeff Layton
2022-11-03 15:42   ` Chuck Lever III
2022-10-17 10:57 ` [PATCH v7 7/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
2022-11-03 15:42   ` Chuck Lever III
2022-10-17 10:57 ` [PATCH v7 8/9] nfsd: remove fetch_iversion export operation Jeff Layton
2022-11-03 15:43   ` Chuck Lever III
2022-10-17 10:57 ` [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland Jeff Layton
2022-10-17 22:14   ` Dave Chinner
2022-10-18 10:35     ` Jeff Layton
2022-10-18 13:49       ` Jan Kara
2022-10-18 14:21         ` Jeff Layton
2022-10-18 15:17           ` Jan Kara
2022-10-18 17:04             ` Jeff Layton [this message]
2022-10-19 17:23               ` Jan Kara
2022-10-19 18:47                 ` Jeff Layton
2022-10-20 10:39                   ` Jan Kara
2022-10-21 10:08                     ` Jeff Layton
2022-10-18 14:56       ` Jeff Layton
2022-10-19 11:13 ` [PATCH v7 0/9] fs: clean up handling of i_version counter Christian Brauner
2022-10-19 12:18   ` Jeff Layton
2022-10-19 15:45     ` Darrick J. Wong
2022-10-19 20:36       ` Jeff Layton
2022-10-20  6:58         ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fcd35c14353ae859d778a23f80249047819bc4b0.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=adilger.kernel@dilger.ca \
    --cc=bfields@fieldses.org \
    --cc=brauner@kernel.org \
    --cc=ceph-devel@vger.kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=fweimer@redhat.com \
    --cc=jack@suse.cz \
    --cc=lczerner@redhat.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=trondmy@hammerspace.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=xiubli@redhat.com \
    --cc=zohar@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.