linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/7] vfs: clean up i_version behavior and expose it via statx
@ 2022-08-26 21:46 Jeff Layton
  2022-08-26 21:46 ` [PATCH v3 1/7] iversion: update comments with info about atime updates Jeff Layton
                   ` (6 more replies)
  0 siblings, 7 replies; 46+ messages in thread
From: Jeff Layton @ 2022-08-26 21:46 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, brauner
  Cc: linux-api, linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph,
	linux-ext4, linux-nfs, linux-xfs

The purpose of this patchset is two-fold:

1/ correct performance issues in the existing i_version counter
implementations and (hopefully) to bring them into behavioral
alignment.

2/ expose the i_version field to userland via statx. This is useful for
testing the various i_version implementations, but may also be useful for
userland applications that want a way to tell whether a file might
have changed.

The i_version field in Linux has been around since 1994, but its meaning
and behavior has subtly changed over time [1]. The first patch in this
series fleshes out the comments in iversion.h to try and give a clear
explanation of what's expected from the filesystem. My first ask is for
feedback on this -- does the proposed definition seem reasonable for
presenting to userland?

There are two main consumers of i_version in the kernel: nfsd and IMA.
They both only want to see a change to the i_version iff there was an
explicit change to the inode. They can cope with an implementation that
does more increments than that, but that measurably harms performance.

I'll argue that atime-only updates SHOULD be excluded from i_version
bumps since they don't represent a "real" change to the inode. Spurious
updates to the i_version have real, measurable performance impacts with
NFSv4, and possibly with IMA.

There are 3 kernel-managed i_version implementations in the kernel:
btrfs, ext4 and xfs.

btrfs seems to work as we'd expect. It doesn't bump the i_version
on atime-only updates and seems to bump it appropriately for other
activity.

ext4 currently bumps the i_version even when only the atime is being
updated. I have a patch to fix this that Jan and Christian have
Reviewed, but I haven't yet heard from Ted or Andreas.

xfs has the same issue as ext4 bumping i_version on atime updates.  He
has NACK'ed the patch I proposed since there are evidently tools that
depend on every log transaction being represented in i_version.

I've included the xfs patch in this series, but if it's not suitable I'm
open to fixing this other ways, but I'll need some feedback as to what
the xfs developers would like to do here. Should we add a new on-disk
field to the inode? Try to do something clever with "unused" parts of
the ctime? What would be best?

Lastly, there are patches to allow NFS and Ceph to report this value
as well. They should be fairly straightforward once the earlier pile is
resolved.

Note that I dropped the patch to make AFS report STATX_INO_VERSION since
its semantics don't match the proposed definition.

[1]: Now, for your entertainment...

A BRIEF HISTORY OF THE I_VERSION FIELD
======================================

PRE-GIT-HISTORY ERA:
--------------------
The i_version field first appears in v1.1.30 (summer 1994) with more increments
added to ext2 over next few v1.1.x versions. There were ioctls to set and fetch
the value in ext2. They're still there but they access the i_generation
counter now.

It was mostly used to catch races in lookup and readdir due to directory
changes, and a lot of filesystems still use it that way today. Non-directory
inodes would have this value set, but the kernel didn't do much with it.

GIT HISTORY ERA:
----------------
Then in 2.6.24, Jean Noel Cordenner from Bull increased the size to 64
bits, added the MS_I_VERSION flag, and started incrementing it in
file_update_time:

    commit 7a224228ed79d587ece2304869000aad1b8e97dd
    Author: Jean Noel Cordenner <jean-noel.cordenner@bull.net>
    Date:   Mon Jan 28 23:58:27 2008 -0500

        vfs: Add 64 bit i_version support

        The i_version field of the inode is changed to be a 64-bit counter that
        is set on every inode creation and that is incremented every time the
        inode data is modified (similarly to the "ctime" time-stamp).
        The aim is to fulfill a NFSv4 requirement for rfc3530.
        This first part concerns the vfs, it converts the 32-bit i_version in
        the generic inode to a 64-bit, a flag is added in the super block in
        order to check if the feature is enabled and the i_version is
        incremented in the vfs.

        Signed-off-by: Mingming Cao <cmm@us.ibm.com>
        Signed-off-by: Jean Noel Cordenner <jean-noel.cordenner@bull.net>
        Signed-off-by: Kalpak Shah <kalpak@clusterfs.com>

Then he added support to ext4, plus the mount option to enable it. The
problem with the i_version being incremented during atime updates
probably dates back to this patch. I imagine it was probably just an
oversight, though it could just have been due to unclear definition for
the change attr in the NFSv4.0 spec:

    commit 25ec56b518257a56d2ff41a941d288e4b5ff9488
    Author: Jean Noel Cordenner <jean-noel.cordenner@bull.net>
    Date:   Mon Jan 28 23:58:27 2008 -0500

        ext4: Add inode version support in ext4

        This patch adds 64-bit inode version support to ext4. The lower 32 bits
        are stored in the osd1.linux1.l_i_version field while the high 32 bits
        are stored in the i_version_hi field newly created in the ext4_inode.
        This field is incremented in case the ext4_inode is large enough. A
        i_version mount option has been added to enable the feature.

        Signed-off-by: Mingming Cao <cmm@us.ibm.com>
        Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
        Signed-off-by: Kalpak Shah <kalpak@clusterfs.com>
        Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
        Signed-off-by: Jean Noel Cordenner <jean-noel.cordenner@bull.net>

Bruce then added support for it to nfsd:

    commit c654b8a9cba6002aad1c01919e4928a79a4a6dcf
    Author: J. Bruce Fields <bfields@citi.umich.edu>
    Date:   Thu Apr 16 17:33:25 2009 -0400

        nfsd: support ext4 i_version

        ext4 supports a real NFSv4 change attribute, which is bumped whenever
        the ctime would be updated, including times when two updates arrive
        within a jiffy of each other.  (Note that although ext4 has space for
        nanosecond-precision ctime, the real resolution is lower: it actually
        uses jiffies as the time-source.)  This ensures clients will invalidate
        their caches when they need to.

        There is some fear that keeping the i_version up-to-date could have
        performance drawbacks, so for now it's turned on only by a mount option.
        We hope to do something better eventually.

        Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
        Cc: Theodore Tso <tytso@mit.edu>

Josef converted btrfs to use it instead of their own internal counter.
It looks like the btrfs implementation has probably avoided the issue
with atime updates causing i_version bumps.

    commit 0c4d2d95d06e920e0c61707e62c7fffc9c57f63a
    Author: Josef Bacik <josef@redhat.com>
    Date:   Thu Apr 5 15:03:02 2012 -0400

        Btrfs: use i_version instead of our own sequence

        We've been keeping around the inode sequence number in hopes that somebody
        would use it, but nobody uses it and people actually use i_version which
        serves the same purpose, so use i_version where we used the incore inode's
        sequence number and that way the sequence is updated properly across the
        board, and not just in file write.  Thanks,

        Signed-off-by: Josef Bacik <josef@redhat.com>

Then, in 2013 Dave added support for xfs with v3 superblocks. There were
some later changes of how it was stored, but its behavior has largely
been the same on xfs since then. Note that at the time, the stated
reason for adding this was to provide NFSv4 semantics:

    commit dc037ad7d24f3711e431a45c053b5d425995e9e4
    Author: Dave Chinner <dchinner@redhat.com>
    Date:   Thu Jun 27 16:04:59 2013 +1000

        xfs: implement inode change count

        For CRC enabled filesystems, add support for the monotonic inode
        version change counter that is needed by protocols like NFSv4 for
        determining if the inode has changed in any way at all between two
        unrelated operations on the inode.

        This bumps the change count the first time an inode is dirtied in a
        transaction. Since all modifications to the inode are logged, this
        will catch all changes that are made to the inode, including
        timestamp updates that occur during data writes.

        Signed-off-by: Dave Chinner <dchinner@redhat.com>
        Reviewed-by: Mark Tinguely <tinguely@sgi.com>
        Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com>
        Signed-off-by: Ben Myers <bpm@sgi.com>

Jeff Layton (7):
  iversion: update comments with info about atime updates
  ext4: fix i_version handling in ext4
  ext4: unconditionally enable the i_version counter
  xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  vfs: report an inode version in statx for IS_I_VERSION inodes
  nfs: report the inode version in statx if requested
  ceph: fill in the change attribute in statx requests

 fs/ceph/inode.c                 | 14 +++++++++-----
 fs/ext4/inode.c                 | 10 +++++-----
 fs/ext4/ioctl.c                 |  4 ++++
 fs/ext4/move_extent.c           |  6 ++++++
 fs/ext4/super.c                 | 13 ++++---------
 fs/ext4/xattr.c                 |  1 +
 fs/nfs/inode.c                  |  7 +++++--
 fs/stat.c                       |  7 +++++++
 fs/xfs/libxfs/xfs_log_format.h  |  2 +-
 fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
 fs/xfs/xfs_iops.c               | 11 +++++++++--
 include/linux/iversion.h        | 23 +++++++++++++++++++++--
 include/linux/stat.h            |  1 +
 include/uapi/linux/stat.h       |  3 ++-
 samples/vfs/test-statx.c        |  8 ++++++--
 15 files changed, 82 insertions(+), 30 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-26 21:46 [PATCH v3 0/7] vfs: clean up i_version behavior and expose it via statx Jeff Layton
@ 2022-08-26 21:46 ` Jeff Layton
  2022-08-29  7:56   ` Dave Chinner
  2022-08-26 21:46 ` [PATCH v3 2/7] ext4: fix i_version handling in ext4 Jeff Layton
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2022-08-26 21:46 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, brauner
  Cc: linux-api, linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph,
	linux-ext4, linux-nfs, linux-xfs, Colin Walters

The i_version field in the kernel has had different semantics over
the decades, but we're now proposing to expose it to userland via
statx. This means that we need a clear, consistent definition of
what it means and when it should change.

Update the comments in iversion.h to describe how a conformant
i_version implementation is expected to behave. This definition
suits the current users of i_version (NFSv4 and IMA), but is
loose enough to allow for a wide range of possible implementations.

Cc: Colin Walters <walters@verbum.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trondmy@hammerspace.com>
Cc: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/iversion.h | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 3bfebde5a1a6..45e93e1b4edc 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -9,8 +9,19 @@
  * ---------------------------
  * The change attribute (i_version) is mandated by NFSv4 and is mostly for
  * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
- * appear different to observers if there was a change to the inode's data or
- * metadata since it was last queried.
+ * appear different to observers if there was an explicit change to the inode's
+ * data or metadata since it was last queried.
+ *
+ * An explicit change is one that would ordinarily result in a change to the
+ * inode status change time (aka ctime). The version must appear to change, even
+ * if the ctime does not (since the whole point is to avoid missing updates due
+ * to timestamp granularity). If POSIX mandates that the ctime must change due
+ * to an operation, then the i_version counter must be incremented as well.
+ *
+ * A conformant implementation is allowed to increment the counter in other
+ * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
+ * whether caches are up to date. Spurious increments can cause false cache
+ * invalidations.
  *
  * Observers see the i_version as a 64-bit number that never decreases. If it
  * remains the same since it was last checked, then nothing has changed in the
@@ -66,6 +77,14 @@
  * Storing the value to disk therefore does not count as a query, so those
  * filesystems should use inode_peek_iversion to grab the value to be stored.
  * There is no need to flag the value as having been queried in that case.
+ *
+ * Notes on atime updates
+ * ----------------------
+ * Access time (atime) updates due to reads or similar activity do not represent
+ * an explicit change to the inode data or metadata. If the only change to the
+ * inode is the atime, then i_version should not be incremented. If an observer
+ * cares about atime updates, it should plan to fetch and store the atime in
+ * conjunction with the i_version.
  */
 
 /*
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 2/7] ext4: fix i_version handling in ext4
  2022-08-26 21:46 [PATCH v3 0/7] vfs: clean up i_version behavior and expose it via statx Jeff Layton
  2022-08-26 21:46 ` [PATCH v3 1/7] iversion: update comments with info about atime updates Jeff Layton
@ 2022-08-26 21:46 ` Jeff Layton
  2022-08-26 21:46 ` [PATCH v3 3/7] ext4: unconditionally enable the i_version counter Jeff Layton
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2022-08-26 21:46 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, brauner
  Cc: linux-api, linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph,
	linux-ext4, linux-nfs, linux-xfs

ext4 currently updates the i_version counter when the atime is updated
during a read. This is less than ideal as it can cause unnecessary cache
invalidations with NFSv4 and unnecessary remeasurements for IMA.

The increment in ext4_mark_iloc_dirty is also problematic since it can
corrupt the i_version counter for ea_inodes. We aren't bumping the file
times in ext4_mark_iloc_dirty, so changing the i_version there seems
wrong, and is the cause of both problems.

Remove that callsite and add increments to the setattr, setxattr and
ioctl codepaths, at the same times that we update the ctime. The
i_version bump that already happens during timestamp updates should take
care of the rest.

In ext4_move_extents, increment the i_version on both inodes, and also
add in missing ctime updates.

Cc: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ext4/inode.c       | 10 +++++-----
 fs/ext4/ioctl.c       |  8 ++++++++
 fs/ext4/move_extent.c |  8 ++++++++
 fs/ext4/xattr.c       |  2 ++
 4 files changed, 23 insertions(+), 5 deletions(-)

This may have some minor conflicts with one of Lukas' patches. It shouldn't
be too hard to resolve though.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 601214453c3a..aa37bce4c541 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5342,6 +5342,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 	int error, rc = 0;
 	int orphan = 0;
 	const unsigned int ia_valid = attr->ia_valid;
+	bool inc_ivers = IS_I_VERSION(inode);
 
 	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
 		return -EIO;
@@ -5425,8 +5426,8 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 			return -EINVAL;
 		}
 
-		if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
-			inode_inc_iversion(inode);
+		if (attr->ia_size == inode->i_size)
+			inc_ivers = false;
 
 		if (shrink) {
 			if (ext4_should_order_data(inode)) {
@@ -5528,6 +5529,8 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 	}
 
 	if (!error) {
+		if (inc_ivers)
+			inode_inc_iversion(inode);
 		setattr_copy(mnt_userns, inode, attr);
 		mark_inode_dirty(inode);
 	}
@@ -5731,9 +5734,6 @@ int ext4_mark_iloc_dirty(handle_t *handle,
 	}
 	ext4_fc_track_inode(handle, inode);
 
-	if (IS_I_VERSION(inode))
-		inode_inc_iversion(inode);
-
 	/* the do_update_inode consumes one bh->b_count */
 	get_bh(iloc->bh);
 
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 3cf3ec4b1c21..60e77ae9342d 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -452,6 +452,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
 	swap_inode_data(inode, inode_bl);
 
 	inode->i_ctime = inode_bl->i_ctime = current_time(inode);
+	if (IS_I_VERSION(inode))
+		inode_inc_iversion(inode);
 
 	inode->i_generation = prandom_u32();
 	inode_bl->i_generation = prandom_u32();
@@ -665,6 +667,8 @@ static int ext4_ioctl_setflags(struct inode *inode,
 	ext4_set_inode_flags(inode, false);
 
 	inode->i_ctime = current_time(inode);
+	if (IS_I_VERSION(inode))
+		inode_inc_iversion(inode);
 
 	err = ext4_mark_iloc_dirty(handle, inode, &iloc);
 flags_err:
@@ -775,6 +779,8 @@ static int ext4_ioctl_setproject(struct inode *inode, __u32 projid)
 
 	EXT4_I(inode)->i_projid = kprojid;
 	inode->i_ctime = current_time(inode);
+	if (IS_I_VERSION(inode))
+		inode_inc_iversion(inode);
 out_dirty:
 	rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
 	if (!err)
@@ -1257,6 +1263,8 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		err = ext4_reserve_inode_write(handle, inode, &iloc);
 		if (err == 0) {
 			inode->i_ctime = current_time(inode);
+			if (IS_I_VERSION(inode))
+				inode_inc_iversion(inode);
 			inode->i_generation = generation;
 			err = ext4_mark_iloc_dirty(handle, inode, &iloc);
 		}
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 701f1d6a217f..d73ab3153218 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -6,6 +6,7 @@
  */
 
 #include <linux/fs.h>
+#include <linux/iversion.h>
 #include <linux/quotaops.h>
 #include <linux/slab.h>
 #include <linux/sched/mm.h>
@@ -683,6 +684,13 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
 			break;
 		o_start += cur_len;
 		d_start += cur_len;
+
+		orig_inode->i_ctime = current_time(orig_inode);
+		donor_inode->i_ctime = current_time(donor_inode);
+		if (IS_I_VERSION(orig_inode))
+			inode_inc_iversion(orig_inode);
+		if (IS_I_VERSION(donor_inode))
+			inode_inc_iversion(donor_inode);
 	}
 	*moved_len = o_start - orig_blk;
 	if (*moved_len > len)
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 533216e80fa2..e975442e4ab2 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -2412,6 +2412,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 	if (!error) {
 		ext4_xattr_update_super_block(handle, inode->i_sb);
 		inode->i_ctime = current_time(inode);
+		if (IS_I_VERSION(inode))
+			inode_inc_iversion(inode);
 		if (!value)
 			no_expand = 0;
 		error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 3/7] ext4: unconditionally enable the i_version counter
  2022-08-26 21:46 [PATCH v3 0/7] vfs: clean up i_version behavior and expose it via statx Jeff Layton
  2022-08-26 21:46 ` [PATCH v3 1/7] iversion: update comments with info about atime updates Jeff Layton
  2022-08-26 21:46 ` [PATCH v3 2/7] ext4: fix i_version handling in ext4 Jeff Layton
@ 2022-08-26 21:46 ` Jeff Layton
  2022-08-29 14:51   ` Jan Kara
  2022-08-26 21:47 ` [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time Jeff Layton
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2022-08-26 21:46 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, brauner
  Cc: linux-api, linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph,
	linux-ext4, linux-nfs, linux-xfs, Benjamin Coddington,
	Christoph Hellwig

The original i_version implementation was pretty expensive, requiring a
log flush on every change. Because of this, it was gated behind a mount
option (implemented via the MS_I_VERSION mountoption flag).

Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
i_version flag much less expensive, so there is no longer a performance
penalty from enabling it. xfs and btrfs already enable it
unconditionally when the on-disk format can support it.

Have ext4 ignore the SB_I_VERSION flag, and just enable it
unconditionally. While we're in here, remove the handling of
Opt_i_version as well since it's due for deprecation anyway.

Ideally, we'd couple this change with a way to disable the i_version
counter (just in case), but the way the iversion mount option was
implemented makes that difficult to do. We'd need to add a new mount
option altogether or do something with tune2fs. That's probably best
left to later patches if it turns out to be needed.

Cc: Dave Chinner <david@fromorbit.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ext4/inode.c       |  2 +-
 fs/ext4/ioctl.c       | 12 ++++--------
 fs/ext4/move_extent.c |  6 ++----
 fs/ext4/super.c       | 13 ++++---------
 fs/ext4/xattr.c       |  3 +--
 5 files changed, 12 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index aa37bce4c541..6ef37269e7c0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5342,7 +5342,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 	int error, rc = 0;
 	int orphan = 0;
 	const unsigned int ia_valid = attr->ia_valid;
-	bool inc_ivers = IS_I_VERSION(inode);
+	bool inc_ivers = true;
 
 	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
 		return -EIO;
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 60e77ae9342d..ad3a294a88eb 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -452,8 +452,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
 	swap_inode_data(inode, inode_bl);
 
 	inode->i_ctime = inode_bl->i_ctime = current_time(inode);
-	if (IS_I_VERSION(inode))
-		inode_inc_iversion(inode);
+	inode_inc_iversion(inode);
 
 	inode->i_generation = prandom_u32();
 	inode_bl->i_generation = prandom_u32();
@@ -667,8 +666,7 @@ static int ext4_ioctl_setflags(struct inode *inode,
 	ext4_set_inode_flags(inode, false);
 
 	inode->i_ctime = current_time(inode);
-	if (IS_I_VERSION(inode))
-		inode_inc_iversion(inode);
+	inode_inc_iversion(inode);
 
 	err = ext4_mark_iloc_dirty(handle, inode, &iloc);
 flags_err:
@@ -779,8 +777,7 @@ static int ext4_ioctl_setproject(struct inode *inode, __u32 projid)
 
 	EXT4_I(inode)->i_projid = kprojid;
 	inode->i_ctime = current_time(inode);
-	if (IS_I_VERSION(inode))
-		inode_inc_iversion(inode);
+	inode_inc_iversion(inode);
 out_dirty:
 	rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
 	if (!err)
@@ -1263,8 +1260,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		err = ext4_reserve_inode_write(handle, inode, &iloc);
 		if (err == 0) {
 			inode->i_ctime = current_time(inode);
-			if (IS_I_VERSION(inode))
-				inode_inc_iversion(inode);
+			inode_inc_iversion(inode);
 			inode->i_generation = generation;
 			err = ext4_mark_iloc_dirty(handle, inode, &iloc);
 		}
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index d73ab3153218..285700b00d38 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -687,10 +687,8 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
 
 		orig_inode->i_ctime = current_time(orig_inode);
 		donor_inode->i_ctime = current_time(donor_inode);
-		if (IS_I_VERSION(orig_inode))
-			inode_inc_iversion(orig_inode);
-		if (IS_I_VERSION(donor_inode))
-			inode_inc_iversion(donor_inode);
+		inode_inc_iversion(orig_inode);
+		inode_inc_iversion(donor_inode);
 	}
 	*moved_len = o_start - orig_blk;
 	if (*moved_len > len)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9a66abcca1a8..e7cf5361245a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1585,7 +1585,7 @@ enum {
 	Opt_inlinecrypt,
 	Opt_usrjquota, Opt_grpjquota, Opt_quota,
 	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
-	Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
+	Opt_usrquota, Opt_grpquota, Opt_prjquota,
 	Opt_dax, Opt_dax_always, Opt_dax_inode, Opt_dax_never,
 	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
 	Opt_nowarn_on_error, Opt_mblk_io_submit, Opt_debug_want_extra_isize,
@@ -1694,7 +1694,6 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
 	fsparam_flag	("barrier",		Opt_barrier),
 	fsparam_u32	("barrier",		Opt_barrier),
 	fsparam_flag	("nobarrier",		Opt_nobarrier),
-	fsparam_flag	("i_version",		Opt_i_version),
 	fsparam_flag	("dax",			Opt_dax),
 	fsparam_enum	("dax",			Opt_dax_type, ext4_param_dax),
 	fsparam_u32	("stripe",		Opt_stripe),
@@ -2140,11 +2139,6 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	case Opt_abort:
 		ctx_set_mount_flag(ctx, EXT4_MF_FS_ABORTED);
 		return 0;
-	case Opt_i_version:
-		ext4_msg(NULL, KERN_WARNING, deprecated_msg, param->key, "5.20");
-		ext4_msg(NULL, KERN_WARNING, "Use iversion instead\n");
-		ctx_set_flags(ctx, SB_I_VERSION);
-		return 0;
 	case Opt_inlinecrypt:
 #ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
 		ctx_set_flags(ctx, SB_INLINECRYPT);
@@ -2970,8 +2964,6 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
 		SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
 	if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
 		SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
-	if (sb->s_flags & SB_I_VERSION)
-		SEQ_OPTS_PUTS("i_version");
 	if (nodefs || sbi->s_stripe)
 		SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
 	if (nodefs || EXT4_MOUNT_DATA_FLAGS &
@@ -4640,6 +4632,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 	sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
 		(test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);
 
+	/* i_version is always enabled now */
+	sb->s_flags |= SB_I_VERSION;
+
 	if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
 	    (ext4_has_compat_features(sb) ||
 	     ext4_has_ro_compat_features(sb) ||
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e975442e4ab2..36d6ba7190b6 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -2412,8 +2412,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 	if (!error) {
 		ext4_xattr_update_super_block(handle, inode->i_sb);
 		inode->i_ctime = current_time(inode);
-		if (IS_I_VERSION(inode))
-			inode_inc_iversion(inode);
+		inode_inc_iversion(inode);
 		if (!value)
 			no_expand = 0;
 		error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-26 21:46 [PATCH v3 0/7] vfs: clean up i_version behavior and expose it via statx Jeff Layton
                   ` (2 preceding siblings ...)
  2022-08-26 21:46 ` [PATCH v3 3/7] ext4: unconditionally enable the i_version counter Jeff Layton
@ 2022-08-26 21:47 ` Jeff Layton
  2022-08-27  7:26   ` Amir Goldstein
  2022-08-26 21:47 ` [PATCH v3 5/7] vfs: report an inode version in statx for IS_I_VERSION inodes Jeff Layton
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2022-08-26 21:47 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, brauner
  Cc: linux-api, linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph,
	linux-ext4, linux-nfs, linux-xfs, David Wysochanski

xfs will update the i_version when updating only the atime value, which
is not desirable for any of the current consumers of i_version. Doing so
leads to unnecessary cache invalidations on NFS and extra measurement
activity in IMA.

Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the
transaction should not update the i_version. Set that value in
xfs_vn_update_time if we're only updating the atime.

Cc: Dave Chinner <david@fromorbit.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trondmy@hammerspace.com>
Cc: David Wysochanski <dwysocha@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/xfs/libxfs/xfs_log_format.h  |  2 +-
 fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
 fs/xfs/xfs_iops.c               | 11 +++++++++--
 3 files changed, 11 insertions(+), 4 deletions(-)

Dave has NACK'ed this patch, but I'm sending it as a way to illustrate
the problem. I still think this approach should at least fix the worst
problems with atime updates being counted. We can look to carve out
other "spurious" i_version updates as we identify them.

If however there are offline analysis tools that require atime updates
to be counted, then we won't be able to do this. If that's the case, how
can we fix this such that serving xfs via NFSv4 doesn't suck?

diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index b351b9dc6561..866a4c5cf70c 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -323,7 +323,7 @@ struct xfs_inode_log_format_32 {
 #define	XFS_ILOG_ABROOT	0x100	/* log i_af.i_broot */
 #define XFS_ILOG_DOWNER	0x200	/* change the data fork owner on replay */
 #define XFS_ILOG_AOWNER	0x400	/* change the attr fork owner on replay */
-
+#define XFS_ILOG_NOIVER	0x800	/* don't bump i_version */
 
 /*
  * The timestamps are dirty, but not necessarily anything else in the inode
diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 8b5547073379..ffe6d296e7f9 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -126,7 +126,7 @@ xfs_trans_log_inode(
 	 * unconditionally.
 	 */
 	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
-		if (IS_I_VERSION(inode) &&
+		if (!(flags & XFS_ILOG_NOIVER) && IS_I_VERSION(inode) &&
 		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
 			iversion_flags = XFS_ILOG_CORE;
 	}
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 45518b8c613c..94f14d96641b 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1041,10 +1041,17 @@ xfs_vn_update_time(
 		return error;
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (flags & S_CTIME)
+
+	if (!(flags & S_VERSION))
+		log_flags |= XFS_ILOG_NOIVER;
+	if (flags & S_CTIME) {
 		inode->i_ctime = *now;
-	if (flags & S_MTIME)
+		log_flags &= ~XFS_ILOG_NOIVER;
+	}
+	if (flags & S_MTIME) {
 		inode->i_mtime = *now;
+		log_flags &= ~XFS_ILOG_NOIVER;
+	}
 	if (flags & S_ATIME)
 		inode->i_atime = *now;
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 5/7] vfs: report an inode version in statx for IS_I_VERSION inodes
  2022-08-26 21:46 [PATCH v3 0/7] vfs: clean up i_version behavior and expose it via statx Jeff Layton
                   ` (3 preceding siblings ...)
  2022-08-26 21:47 ` [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time Jeff Layton
@ 2022-08-26 21:47 ` Jeff Layton
  2022-08-26 21:47 ` [PATCH v3 6/7] nfs: report the inode version in statx if requested Jeff Layton
  2022-08-26 21:47 ` [PATCH v3 7/7] ceph: fill in the change attribute in statx requests Jeff Layton
  6 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2022-08-26 21:47 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, brauner
  Cc: linux-api, linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph,
	linux-ext4, linux-nfs, linux-xfs, Jeff Layton, David Howells,
	Frank Filz

From: Jeff Layton <jlayton@redhat.com>

The NFS server and IMA both rely heavily on the i_version counter, but
it's largely invisible to userland, which makes it difficult to test its
behavior. This value would also be of use to userland NFS servers, and
other applications that want a reliable way to know whether there might
have been an explicit change to an inode since they last checked.

Claim one of the spare fields in struct statx to hold a 64-bit inode
version attribute. This value must change with any explicit, observeable
metadata or data change. Note that atime updates are excluded from this,
unless it is due to an explicit change via utimes or similar mechanism.

When statx requests this attribute on an IS_I_VERSION inode, do an
inode_query_iversion and fill the result in the field. Also, update the
test-statx.c program to display the inode version and the mountid.

Cc: David Howells <dhowells@redhat.com>
Cc: Frank Filz <ffilzlnx@mindspring.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/stat.c                 | 7 +++++++
 include/linux/stat.h      | 1 +
 include/uapi/linux/stat.h | 3 ++-
 samples/vfs/test-statx.c  | 8 ++++++--
 4 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/stat.c b/fs/stat.c
index 9ced8860e0f3..d892909836aa 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -17,6 +17,7 @@
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
 #include <linux/compat.h>
+#include <linux/iversion.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -118,6 +119,11 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	stat->attributes_mask |= (STATX_ATTR_AUTOMOUNT |
 				  STATX_ATTR_DAX);
 
+	if ((request_mask & STATX_INO_VERSION) && IS_I_VERSION(inode)) {
+		stat->result_mask |= STATX_INO_VERSION;
+		stat->ino_version = inode_query_iversion(inode);
+	}
+
 	mnt_userns = mnt_user_ns(path->mnt);
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(mnt_userns, path, stat,
@@ -611,6 +617,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_dev_major = MAJOR(stat->dev);
 	tmp.stx_dev_minor = MINOR(stat->dev);
 	tmp.stx_mnt_id = stat->mnt_id;
+	tmp.stx_ino_version = stat->ino_version;
 
 	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
 }
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 7df06931f25d..9cd77eb7bc1a 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -50,6 +50,7 @@ struct kstat {
 	struct timespec64 btime;			/* File creation time */
 	u64		blocks;
 	u64		mnt_id;
+	u64		ino_version;
 };
 
 #endif
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 1500a0f58041..48d9307d7f31 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -124,7 +124,7 @@ struct statx {
 	__u32	stx_dev_minor;
 	/* 0x90 */
 	__u64	stx_mnt_id;
-	__u64	__spare2;
+	__u64	stx_ino_version; /* Inode change attribute */
 	/* 0xa0 */
 	__u64	__spare3[12];	/* Spare space for future expansion */
 	/* 0x100 */
@@ -152,6 +152,7 @@ struct statx {
 #define STATX_BASIC_STATS	0x000007ffU	/* The stuff in the normal stat struct */
 #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
 #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
+#define STATX_INO_VERSION	0x00002000U	/* Want/got stx_change_attr */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c
index 49c7a46cee07..23e68036fdfb 100644
--- a/samples/vfs/test-statx.c
+++ b/samples/vfs/test-statx.c
@@ -107,6 +107,8 @@ static void dump_statx(struct statx *stx)
 	printf("Device: %-15s", buffer);
 	if (stx->stx_mask & STATX_INO)
 		printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino);
+	if (stx->stx_mask & STATX_MNT_ID)
+		printf(" MountId: %llx", stx->stx_mnt_id);
 	if (stx->stx_mask & STATX_NLINK)
 		printf(" Links: %-5u", stx->stx_nlink);
 	if (stx->stx_mask & STATX_TYPE) {
@@ -145,7 +147,9 @@ static void dump_statx(struct statx *stx)
 	if (stx->stx_mask & STATX_CTIME)
 		print_time("Change: ", &stx->stx_ctime);
 	if (stx->stx_mask & STATX_BTIME)
-		print_time(" Birth: ", &stx->stx_btime);
+		print_time("Birth: ", &stx->stx_btime);
+	if (stx->stx_mask & STATX_INO_VERSION)
+		printf("Inode Version: 0x%llx\n", stx->stx_ino_version);
 
 	if (stx->stx_attributes_mask) {
 		unsigned char bits, mbits;
@@ -218,7 +222,7 @@ int main(int argc, char **argv)
 	struct statx stx;
 	int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
 
-	unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
+	unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_MNT_ID | STATX_INO_VERSION;
 
 	for (argv++; *argv; argv++) {
 		if (strcmp(*argv, "-F") == 0) {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 6/7] nfs: report the inode version in statx if requested
  2022-08-26 21:46 [PATCH v3 0/7] vfs: clean up i_version behavior and expose it via statx Jeff Layton
                   ` (4 preceding siblings ...)
  2022-08-26 21:47 ` [PATCH v3 5/7] vfs: report an inode version in statx for IS_I_VERSION inodes Jeff Layton
@ 2022-08-26 21:47 ` Jeff Layton
  2022-08-26 21:47 ` [PATCH v3 7/7] ceph: fill in the change attribute in statx requests Jeff Layton
  6 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2022-08-26 21:47 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, brauner
  Cc: linux-api, linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph,
	linux-ext4, linux-nfs, linux-xfs

Allow NFS to report the i_version in statx. Since the cost to fetch it
is relatively cheap, do it unconditionally and just set the flag if it
looks like it's valid.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfs/inode.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index bea7c005119c..88c732a5c821 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -830,6 +830,8 @@ static u32 nfs_get_valid_attrmask(struct inode *inode)
 		reply_mask |= STATX_UID | STATX_GID;
 	if (!(cache_validity & NFS_INO_INVALID_BLOCKS))
 		reply_mask |= STATX_BLOCKS;
+	if (!(cache_validity & NFS_INO_INVALID_CHANGE))
+		reply_mask |= STATX_INO_VERSION;
 	return reply_mask;
 }
 
@@ -848,7 +850,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 
 	request_mask &= STATX_TYPE | STATX_MODE | STATX_NLINK | STATX_UID |
 			STATX_GID | STATX_ATIME | STATX_MTIME | STATX_CTIME |
-			STATX_INO | STATX_SIZE | STATX_BLOCKS;
+			STATX_INO | STATX_SIZE | STATX_BLOCKS | STATX_INO_VERSION;
 
 	if ((query_flags & AT_STATX_DONT_SYNC) && !force_sync) {
 		if (readdirplus_enabled)
@@ -877,7 +879,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 	/* Is the user requesting attributes that might need revalidation? */
 	if (!(request_mask & (STATX_MODE|STATX_NLINK|STATX_ATIME|STATX_CTIME|
 					STATX_MTIME|STATX_UID|STATX_GID|
-					STATX_SIZE|STATX_BLOCKS)))
+					STATX_SIZE|STATX_BLOCKS|STATX_INO_VERSION)))
 		goto out_no_revalidate;
 
 	/* Check whether the cached attributes are stale */
@@ -915,6 +917,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 
 	generic_fillattr(&init_user_ns, inode, stat);
 	stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
+	stat->ino_version = inode_peek_iversion_raw(inode);
 	if (S_ISDIR(inode->i_mode))
 		stat->blksize = NFS_SERVER(inode)->dtsize;
 out:
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 7/7] ceph: fill in the change attribute in statx requests
  2022-08-26 21:46 [PATCH v3 0/7] vfs: clean up i_version behavior and expose it via statx Jeff Layton
                   ` (5 preceding siblings ...)
  2022-08-26 21:47 ` [PATCH v3 6/7] nfs: report the inode version in statx if requested Jeff Layton
@ 2022-08-26 21:47 ` Jeff Layton
  6 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2022-08-26 21:47 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, brauner
  Cc: linux-api, linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph,
	linux-ext4, linux-nfs, linux-xfs

When statx requests the change attribute, request the full gamut of caps
(similarly to how ctime is handled). When the change attribute seems to
be valid, return it in the ino_version field.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 42351d7a0dd6..ccc926a7dcb0 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2415,10 +2415,10 @@ static int statx_to_caps(u32 want, umode_t mode)
 {
 	int mask = 0;
 
-	if (want & (STATX_MODE|STATX_UID|STATX_GID|STATX_CTIME|STATX_BTIME))
+	if (want & (STATX_MODE|STATX_UID|STATX_GID|STATX_CTIME|STATX_BTIME|STATX_INO_VERSION))
 		mask |= CEPH_CAP_AUTH_SHARED;
 
-	if (want & (STATX_NLINK|STATX_CTIME)) {
+	if (want & (STATX_NLINK|STATX_CTIME|STATX_INO_VERSION)) {
 		/*
 		 * The link count for directories depends on inode->i_subdirs,
 		 * and that is only updated when Fs caps are held.
@@ -2429,11 +2429,10 @@ static int statx_to_caps(u32 want, umode_t mode)
 			mask |= CEPH_CAP_LINK_SHARED;
 	}
 
-	if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|
-		    STATX_BLOCKS))
+	if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|STATX_BLOCKS|STATX_INO_VERSION))
 		mask |= CEPH_CAP_FILE_SHARED;
 
-	if (want & (STATX_CTIME))
+	if (want & (STATX_CTIME|STATX_INO_VERSION))
 		mask |= CEPH_CAP_XATTR_SHARED;
 
 	return mask;
@@ -2475,6 +2474,11 @@ int ceph_getattr(struct user_namespace *mnt_userns, const struct path *path,
 		valid_mask |= STATX_BTIME;
 	}
 
+	if (request_mask & STATX_INO_VERSION) {
+		stat->ino_version = inode_peek_iversion_raw(inode);
+		valid_mask |= STATX_INO_VERSION;
+	}
+
 	if (ceph_snap(inode) == CEPH_NOSNAP)
 		stat->dev = inode->i_sb->s_dev;
 	else
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-26 21:47 ` [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time Jeff Layton
@ 2022-08-27  7:26   ` Amir Goldstein
  2022-08-27  8:01     ` Amir Goldstein
  0 siblings, 1 reply; 46+ messages in thread
From: Amir Goldstein @ 2022-08-27  7:26 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Theodore Tso, Andreas Dilger, Darrick J. Wong, Dave Chinner,
	Trond Myklebust, Neil Brown, Al Viro, Mimi Zohar, xiubli,
	Chuck Lever, Lukas Czerner, Jan Kara, Christian Brauner,
	Linux API, Linux Btrfs, linux-fsdevel, linux-kernel, linux-ceph,
	Ext4, Linux NFS Mailing List, linux-xfs, David Wysochanski

On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> xfs will update the i_version when updating only the atime value, which
> is not desirable for any of the current consumers of i_version. Doing so
> leads to unnecessary cache invalidations on NFS and extra measurement
> activity in IMA.
>
> Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the
> transaction should not update the i_version. Set that value in
> xfs_vn_update_time if we're only updating the atime.
>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Trond Myklebust <trondmy@hammerspace.com>
> Cc: David Wysochanski <dwysocha@redhat.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
>  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
>  fs/xfs/xfs_iops.c               | 11 +++++++++--
>  3 files changed, 11 insertions(+), 4 deletions(-)
>
> Dave has NACK'ed this patch, but I'm sending it as a way to illustrate
> the problem. I still think this approach should at least fix the worst
> problems with atime updates being counted. We can look to carve out
> other "spurious" i_version updates as we identify them.
>

AFAIK, "spurious" is only inode blocks map changes due to writeback
of dirty pages. Anybody know about other cases?

Regarding inode blocks map changes, first of all, I don't think that there is
any practical loss from invalidating NFS client cache on dirty data writeback,
because NFS server should be serving cold data most of the time.
If there are a few unneeded cache invalidations they would only be temporary.

One may even consider if NFSv4 server should not flush dirty data of an inode
before granting a read lease to client.
After all, if read lease was granted, client cached data and then server crashed
before persisting the dirty data, then client will have cached a
"future" version
of the data and if i_version on the server did not roll back in that situation,
we are looking at possible data corruptions.

Same goes for IMA. IIUC, IMA data checksum would be stored in xattr?
Storing in xattr a data checksum for data that is not persistent on disk
would be an odd choice.

So in my view, I only see benefits to current i_version users in the xfs
i_version implementations and I don't think that it contradicts the
i_version definition in the man page patch.

> If however there are offline analysis tools that require atime updates
> to be counted, then we won't be able to do this. If that's the case, how
> can we fix this such that serving xfs via NFSv4 doesn't suck?
>

If I read the arguments correctly, implicit atime updates could be relaxed
as long as this behavior is clearly documented and coherent on all
implementations.

Forensics and other applications that care about atime updates can and
should check atime and don't need i_version to know that it was changed.
The reliability of atime as an audit tool has dropped considerably since
the default in relatime.
If we want to be paranoid, maybe we can leave i_version increment on
atime updates in case the user opted-in to strict '-o atime' updates, but
IMO, there is no need for that.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-27  7:26   ` Amir Goldstein
@ 2022-08-27  8:01     ` Amir Goldstein
  2022-08-27 13:14       ` Jeff Layton
  0 siblings, 1 reply; 46+ messages in thread
From: Amir Goldstein @ 2022-08-27  8:01 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Theodore Tso, Andreas Dilger, Darrick J. Wong, Dave Chinner,
	Trond Myklebust, Neil Brown, Al Viro, Mimi Zohar, xiubli,
	Chuck Lever, Lukas Czerner, Jan Kara, Christian Brauner,
	Linux API, Linux Btrfs, linux-fsdevel, linux-kernel, Ext4,
	Linux NFS Mailing List, linux-xfs, David Wysochanski, ceph-devel

On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > xfs will update the i_version when updating only the atime value, which
> > is not desirable for any of the current consumers of i_version. Doing so
> > leads to unnecessary cache invalidations on NFS and extra measurement
> > activity in IMA.
> >
> > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the
> > transaction should not update the i_version. Set that value in
> > xfs_vn_update_time if we're only updating the atime.
> >
> > Cc: Dave Chinner <david@fromorbit.com>
> > Cc: NeilBrown <neilb@suse.de>
> > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > Cc: David Wysochanski <dwysocha@redhat.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> >  3 files changed, 11 insertions(+), 4 deletions(-)
> >
> > Dave has NACK'ed this patch, but I'm sending it as a way to illustrate
> > the problem. I still think this approach should at least fix the worst
> > problems with atime updates being counted. We can look to carve out
> > other "spurious" i_version updates as we identify them.
> >
>
> AFAIK, "spurious" is only inode blocks map changes due to writeback
> of dirty pages. Anybody know about other cases?
>
> Regarding inode blocks map changes, first of all, I don't think that there is
> any practical loss from invalidating NFS client cache on dirty data writeback,
> because NFS server should be serving cold data most of the time.
> If there are a few unneeded cache invalidations they would only be temporary.
>

Unless there is an issue with a writer NFS client that invalidates its
own attribute
caches on server data writeback?

> One may even consider if NFSv4 server should not flush dirty data of an inode
> before granting a read lease to client.
> After all, if read lease was granted, client cached data and then server crashed
> before persisting the dirty data, then client will have cached a
> "future" version
> of the data and if i_version on the server did not roll back in that situation,
> we are looking at possible data corruptions.
>
> Same goes for IMA. IIUC, IMA data checksum would be stored in xattr?
> Storing in xattr a data checksum for data that is not persistent on disk
> would be an odd choice.
>
> So in my view, I only see benefits to current i_version users in the xfs
> i_version implementations and I don't think that it contradicts the
> i_version definition in the man page patch.
>
> > If however there are offline analysis tools that require atime updates
> > to be counted, then we won't be able to do this. If that's the case, how
> > can we fix this such that serving xfs via NFSv4 doesn't suck?
> >
>
> If I read the arguments correctly, implicit atime updates could be relaxed
> as long as this behavior is clearly documented and coherent on all
> implementations.
>
> Forensics and other applications that care about atime updates can and
> should check atime and don't need i_version to know that it was changed.
> The reliability of atime as an audit tool has dropped considerably since
> the default in relatime.
> If we want to be paranoid, maybe we can leave i_version increment on
> atime updates in case the user opted-in to strict '-o atime' updates, but
> IMO, there is no need for that.
>
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-27  8:01     ` Amir Goldstein
@ 2022-08-27 13:14       ` Jeff Layton
  2022-08-27 15:46         ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2022-08-27 13:14 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Theodore Tso, Andreas Dilger, Darrick J. Wong, Dave Chinner,
	Trond Myklebust, Neil Brown, Al Viro, Mimi Zohar, xiubli,
	Chuck Lever, Lukas Czerner, Jan Kara, Christian Brauner,
	Linux API, Linux Btrfs, linux-fsdevel, linux-kernel, Ext4,
	Linux NFS Mailing List, linux-xfs, David Wysochanski, ceph-devel

On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > 
> > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > 
> > > xfs will update the i_version when updating only the atime value, which
> > > is not desirable for any of the current consumers of i_version. Doing so
> > > leads to unnecessary cache invalidations on NFS and extra measurement
> > > activity in IMA.
> > > 
> > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the
> > > transaction should not update the i_version. Set that value in
> > > xfs_vn_update_time if we're only updating the atime.
> > > 
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Cc: NeilBrown <neilb@suse.de>
> > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > 
> > > Dave has NACK'ed this patch, but I'm sending it as a way to illustrate
> > > the problem. I still think this approach should at least fix the worst
> > > problems with atime updates being counted. We can look to carve out
> > > other "spurious" i_version updates as we identify them.
> > > 
> > 
> > AFAIK, "spurious" is only inode blocks map changes due to writeback
> > of dirty pages. Anybody know about other cases?
> > 
> > Regarding inode blocks map changes, first of all, I don't think that there is
> > any practical loss from invalidating NFS client cache on dirty data writeback,
> > because NFS server should be serving cold data most of the time.
> > If there are a few unneeded cache invalidations they would only be temporary.
> > 
> 
> Unless there is an issue with a writer NFS client that invalidates its
> own attribute
> caches on server data writeback?
> 

The client just looks at the file attributes (of which i_version is but
one), and if certain attributes have changed (mtime, ctime, i_version,
etc...) then it invalidates its cache.

In the case of blocks map changes, could that mean a difference in the
observable sparse regions of the file? If so, then a READ_PLUS before
the change and a READ_PLUS after could give different results. Since
that difference is observable by the client, I'd think we'd want to bump
i_version for that anyway.

> > One may even consider if NFSv4 server should not flush dirty data of an inode
> > before granting a read lease to client.
> > After all, if read lease was granted, client cached data and then server crashed
> > before persisting the dirty data, then client will have cached a
> > "future" version
> > of the data and if i_version on the server did not roll back in that situation,
> > we are looking at possible data corruptions.
> > 

We don't hand out read leases if there are file descriptions open for
write. NFS clients usually issue a COMMIT before closing a stateid in
order to satisfy close-to-open cache coherency.

So in most cases, this is probably not an issue. It might still be
worthwhile to make sure of it by doing a filemap_write_and_wait before
we hand out a delegation, but that's likely to be a no-op in most cases
anyway.

Note too that the client will still revalidate its caches when it
receives attributes even when it holds a read delegation. In fact, this
behavior mostly papered over a rather nasty knfsd bug we found recently
where it was allowing conflicting activity to proceed even when there
was a read delegation outstanding.
 
> > Same goes for IMA. IIUC, IMA data checksum would be stored in xattr?
> > Storing in xattr a data checksum for data that is not persistent on disk
> > would be an odd choice.
> > 
> > So in my view, I only see benefits to current i_version users in the xfs
> > i_version implementations and I don't think that it contradicts the
> > i_version definition in the man page patch.
> > 
> > > If however there are offline analysis tools that require atime updates
> > > to be counted, then we won't be able to do this. If that's the case, how
> > > can we fix this such that serving xfs via NFSv4 doesn't suck?
> > > 
> > 
> > If I read the arguments correctly, implicit atime updates could be relaxed
> > as long as this behavior is clearly documented and coherent on all
> > implementations.
> > 
> > Forensics and other applications that care about atime updates can and
> > should check atime and don't need i_version to know that it was changed.
> > The reliability of atime as an audit tool has dropped considerably since
> > the default in relatime.
> > If we want to be paranoid, maybe we can leave i_version increment on
> > atime updates in case the user opted-in to strict '-o atime' updates, but
> > IMO, there is no need for that.
> > 

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-27 13:14       ` Jeff Layton
@ 2022-08-27 15:46         ` Darrick J. Wong
  2022-08-27 16:03           ` Trond Myklebust
  2022-08-28 17:30           ` Amir Goldstein
  0 siblings, 2 replies; 46+ messages in thread
From: Darrick J. Wong @ 2022-08-27 15:46 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Theodore Tso, Andreas Dilger, Dave Chinner,
	Trond Myklebust, Neil Brown, Al Viro, Mimi Zohar, xiubli,
	Chuck Lever, Lukas Czerner, Jan Kara, Christian Brauner,
	Linux API, Linux Btrfs, linux-fsdevel, linux-kernel, Ext4,
	Linux NFS Mailing List, linux-xfs, David Wysochanski, ceph-devel

On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > 
> > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > xfs will update the i_version when updating only the atime value, which
> > > > is not desirable for any of the current consumers of i_version. Doing so
> > > > leads to unnecessary cache invalidations on NFS and extra measurement
> > > > activity in IMA.
> > > > 
> > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that the
> > > > transaction should not update the i_version. Set that value in
> > > > xfs_vn_update_time if we're only updating the atime.
> > > > 
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > Cc: NeilBrown <neilb@suse.de>
> > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > 
> > > > Dave has NACK'ed this patch, but I'm sending it as a way to illustrate
> > > > the problem. I still think this approach should at least fix the worst
> > > > problems with atime updates being counted. We can look to carve out
> > > > other "spurious" i_version updates as we identify them.
> > > > 
> > > 
> > > AFAIK, "spurious" is only inode blocks map changes due to writeback
> > > of dirty pages. Anybody know about other cases?
> > > 
> > > Regarding inode blocks map changes, first of all, I don't think that there is
> > > any practical loss from invalidating NFS client cache on dirty data writeback,
> > > because NFS server should be serving cold data most of the time.
> > > If there are a few unneeded cache invalidations they would only be temporary.
> > > 
> > 
> > Unless there is an issue with a writer NFS client that invalidates its
> > own attribute
> > caches on server data writeback?
> > 
> 
> The client just looks at the file attributes (of which i_version is but
> one), and if certain attributes have changed (mtime, ctime, i_version,
> etc...) then it invalidates its cache.
> 
> In the case of blocks map changes, could that mean a difference in the
> observable sparse regions of the file? If so, then a READ_PLUS before
> the change and a READ_PLUS after could give different results. Since
> that difference is observable by the client, I'd think we'd want to bump
> i_version for that anyway.

How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I know
that's been the subject of recent debate.  At least as far as XFS is
concerned, a file range can go from hole -> delayed allocation
reservation -> unwritten extent -> (actual writeback) -> written extent.
The dance became rather more complex when we added COW.  If any of that
will make a difference for READ_PLUS, then yes, I think you'd want file
writeback activities to bump iversion to cause client invalidations,
like (I think) Dave said.

The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for
written and delalloc extents; and an unwritten extent will report data
for any pagecache it finds.

> > > One may even consider if NFSv4 server should not flush dirty data of an inode
> > > before granting a read lease to client.
> > > After all, if read lease was granted, client cached data and then server crashed
> > > before persisting the dirty data, then client will have cached a
> > > "future" version
> > > of the data and if i_version on the server did not roll back in that situation,
> > > we are looking at possible data corruptions.
> > > 
> 
> We don't hand out read leases if there are file descriptions open for
> write. NFS clients usually issue a COMMIT before closing a stateid in
> order to satisfy close-to-open cache coherency.
> 
> So in most cases, this is probably not an issue. It might still be
> worthwhile to make sure of it by doing a filemap_write_and_wait before
> we hand out a delegation, but that's likely to be a no-op in most cases
> anyway.
> 
> Note too that the client will still revalidate its caches when it
> receives attributes even when it holds a read delegation. In fact, this
> behavior mostly papered over a rather nasty knfsd bug we found recently
> where it was allowing conflicting activity to proceed even when there
> was a read delegation outstanding.
>  
> > > Same goes for IMA. IIUC, IMA data checksum would be stored in xattr?
> > > Storing in xattr a data checksum for data that is not persistent on disk
> > > would be an odd choice.
> > > 
> > > So in my view, I only see benefits to current i_version users in the xfs
> > > i_version implementations and I don't think that it contradicts the
> > > i_version definition in the man page patch.
> > > 
> > > > If however there are offline analysis tools that require atime updates
> > > > to be counted, then we won't be able to do this. If that's the case, how
> > > > can we fix this such that serving xfs via NFSv4 doesn't suck?
> > > > 
> > > 
> > > If I read the arguments correctly, implicit atime updates could be relaxed
> > > as long as this behavior is clearly documented and coherent on all
> > > implementations.
> > > 
> > > Forensics and other applications that care about atime updates can and
> > > should check atime and don't need i_version to know that it was changed.
> > > The reliability of atime as an audit tool has dropped considerably since
> > > the default in relatime.

I've been waiting for Amir to appear in this discussion -- ISTR that a
few years ago you were wanting the ability to scan a filesystem to look
for files that have changed since a given point.  If XFS exported its
di_changecount file attribute (as it currently behaves) via BULKSTAT,
you'd have the ability to do that, so long as your application could
persist bulkstat data and compare.

--D

> > > If we want to be paranoid, maybe we can leave i_version increment on
> > > atime updates in case the user opted-in to strict '-o atime' updates, but
> > > IMO, there is no need for that.
> > > 
> 
> Thanks,
> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-27 15:46         ` Darrick J. Wong
@ 2022-08-27 16:03           ` Trond Myklebust
  2022-08-27 16:10             ` Jeff Layton
  2022-08-28 17:30           ` Amir Goldstein
  1 sibling, 1 reply; 46+ messages in thread
From: Trond Myklebust @ 2022-08-27 16:03 UTC (permalink / raw)
  To: djwong, jlayton
  Cc: zohar, brauner, xiubli, neilb, linux-api, linux-xfs, dwysocha,
	david, linux-kernel, chuck.lever, linux-nfs, tytso, viro, jack,
	linux-ext4, amir73il, linux-btrfs, linux-fsdevel, lczerner,
	adilger.kernel, ceph-devel

On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > <amir73il@gmail.com> wrote:
> > > > 
> > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > <jlayton@kernel.org> wrote:
> > > > > 
> > > > > xfs will update the i_version when updating only the atime
> > > > > value, which
> > > > > is not desirable for any of the current consumers of
> > > > > i_version. Doing so
> > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > measurement
> > > > > activity in IMA.
> > > > > 
> > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that
> > > > > the
> > > > > transaction should not update the i_version. Set that value
> > > > > in
> > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > 
> > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > Cc: NeilBrown <neilb@suse.de>
> > > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > 
> > > > > Dave has NACK'ed this patch, but I'm sending it as a way to
> > > > > illustrate
> > > > > the problem. I still think this approach should at least fix
> > > > > the worst
> > > > > problems with atime updates being counted. We can look to
> > > > > carve out
> > > > > other "spurious" i_version updates as we identify them.
> > > > > 
> > > > 
> > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > writeback
> > > > of dirty pages. Anybody know about other cases?
> > > > 
> > > > Regarding inode blocks map changes, first of all, I don't think
> > > > that there is
> > > > any practical loss from invalidating NFS client cache on dirty
> > > > data writeback,
> > > > because NFS server should be serving cold data most of the
> > > > time.
> > > > If there are a few unneeded cache invalidations they would only
> > > > be temporary.
> > > > 
> > > 
> > > Unless there is an issue with a writer NFS client that
> > > invalidates its
> > > own attribute
> > > caches on server data writeback?
> > > 
> > 
> > The client just looks at the file attributes (of which i_version is
> > but
> > one), and if certain attributes have changed (mtime, ctime,
> > i_version,
> > etc...) then it invalidates its cache.
> > 
> > In the case of blocks map changes, could that mean a difference in
> > the
> > observable sparse regions of the file? If so, then a READ_PLUS
> > before
> > the change and a READ_PLUS after could give different results.
> > Since
> > that difference is observable by the client, I'd think we'd want to
> > bump
> > i_version for that anyway.
> 
> How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I know
> that's been the subject of recent debate.  At least as far as XFS is
> concerned, a file range can go from hole -> delayed allocation
> reservation -> unwritten extent -> (actual writeback) -> written
> extent.
> The dance became rather more complex when we added COW.  If any of
> that
> will make a difference for READ_PLUS, then yes, I think you'd want
> file
> writeback activities to bump iversion to cause client invalidations,
> like (I think) Dave said.
> 
> The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for
> written and delalloc extents; and an unwritten extent will report
> data
> for any pagecache it finds.
> 

READ_PLUS should never return anything different than a read() system
call would return for any given area. The way it reports sparse regions
vs. data regions is purely an RPC formatting convenience.

The only point to note about NFS READ and READ_PLUS is that because the
client is forced to send multiple RPC calls if the user is trying to
read a region that is larger than the 'rsize' value, it is possible
that these READ/READ_PLUS calls may be processed out of order, and so
the result may end up looking different than if you had executed a
read() call for the full region directly on the server.
However each individual READ / READ_PLUS reply should look as if the
user had called read() on that rsize-sized section of the file.
> > > 



-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-27 16:03           ` Trond Myklebust
@ 2022-08-27 16:10             ` Jeff Layton
  2022-08-27 17:06               ` Trond Myklebust
  2022-08-28 13:25               ` Amir Goldstein
  0 siblings, 2 replies; 46+ messages in thread
From: Jeff Layton @ 2022-08-27 16:10 UTC (permalink / raw)
  To: Trond Myklebust, djwong
  Cc: zohar, brauner, xiubli, neilb, linux-api, linux-xfs, dwysocha,
	david, linux-kernel, chuck.lever, linux-nfs, tytso, viro, jack,
	linux-ext4, amir73il, linux-btrfs, linux-fsdevel, lczerner,
	adilger.kernel, ceph-devel

On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote:
> On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > > <amir73il@gmail.com> wrote:
> > > > > 
> > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > > <jlayton@kernel.org> wrote:
> > > > > > 
> > > > > > xfs will update the i_version when updating only the atime
> > > > > > value, which
> > > > > > is not desirable for any of the current consumers of
> > > > > > i_version. Doing so
> > > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > > measurement
> > > > > > activity in IMA.
> > > > > > 
> > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that
> > > > > > the
> > > > > > transaction should not update the i_version. Set that value
> > > > > > in
> > > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > > 
> > > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > > Cc: NeilBrown <neilb@suse.de>
> > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > > ---
> > > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > 
> > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to
> > > > > > illustrate
> > > > > > the problem. I still think this approach should at least fix
> > > > > > the worst
> > > > > > problems with atime updates being counted. We can look to
> > > > > > carve out
> > > > > > other "spurious" i_version updates as we identify them.
> > > > > > 
> > > > > 
> > > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > > writeback
> > > > > of dirty pages. Anybody know about other cases?
> > > > > 
> > > > > Regarding inode blocks map changes, first of all, I don't think
> > > > > that there is
> > > > > any practical loss from invalidating NFS client cache on dirty
> > > > > data writeback,
> > > > > because NFS server should be serving cold data most of the
> > > > > time.
> > > > > If there are a few unneeded cache invalidations they would only
> > > > > be temporary.
> > > > > 
> > > > 
> > > > Unless there is an issue with a writer NFS client that
> > > > invalidates its
> > > > own attribute
> > > > caches on server data writeback?
> > > > 
> > > 
> > > The client just looks at the file attributes (of which i_version is
> > > but
> > > one), and if certain attributes have changed (mtime, ctime,
> > > i_version,
> > > etc...) then it invalidates its cache.
> > > 
> > > In the case of blocks map changes, could that mean a difference in
> > > the
> > > observable sparse regions of the file? If so, then a READ_PLUS
> > > before
> > > the change and a READ_PLUS after could give different results.
> > > Since
> > > that difference is observable by the client, I'd think we'd want to
> > > bump
> > > i_version for that anyway.
> > 
> > How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I know
> > that's been the subject of recent debate.  At least as far as XFS is
> > concerned, a file range can go from hole -> delayed allocation
> > reservation -> unwritten extent -> (actual writeback) -> written
> > extent.
> > The dance became rather more complex when we added COW.  If any of
> > that
> > will make a difference for READ_PLUS, then yes, I think you'd want
> > file
> > writeback activities to bump iversion to cause client invalidations,
> > like (I think) Dave said.
> > 
> > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for
> > written and delalloc extents; and an unwritten extent will report
> > data
> > for any pagecache it finds.
> > 
> 
> READ_PLUS should never return anything different than a read() system
> call would return for any given area. The way it reports sparse regions
> vs. data regions is purely an RPC formatting convenience.
> 
> The only point to note about NFS READ and READ_PLUS is that because the
> client is forced to send multiple RPC calls if the user is trying to
> read a region that is larger than the 'rsize' value, it is possible
> that these READ/READ_PLUS calls may be processed out of order, and so
> the result may end up looking different than if you had executed a
> read() call for the full region directly on the server.
> However each individual READ / READ_PLUS reply should look as if the
> user had called read() on that rsize-sized section of the file.
> > > 

Yeah, thinking about it some more, simply changing the block allocation
is not something that should affect the ctime, so we probably don't want
to bump i_version on it. It's an implicit change, IOW, not an explicit
one.

The fact that xfs might do that is unfortunate, but it's not the end of
the world and it still would conform to the proposed definition for
i_version. In practice, this sort of allocation change should come soon
after the file was written, so one would hope that any damage due to the
false i_version bump would be minimized.

It would be nice to teach it not to do that however. Maybe we can insert
the NOIVER flag at a strategic place to avoid it?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-27 16:10             ` Jeff Layton
@ 2022-08-27 17:06               ` Trond Myklebust
  2022-08-28 13:25               ` Amir Goldstein
  1 sibling, 0 replies; 46+ messages in thread
From: Trond Myklebust @ 2022-08-27 17:06 UTC (permalink / raw)
  To: djwong, jlayton
  Cc: zohar, xiubli, brauner, dwysocha, linux-api, linux-xfs, david,
	neilb, linux-kernel, chuck.lever, linux-nfs, tytso, linux-ext4,
	jack, amir73il, linux-btrfs, viro, linux-fsdevel, lczerner,
	adilger.kernel, ceph-devel

On Sat, 2022-08-27 at 12:10 -0400, Jeff Layton wrote:
> On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote:
> > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > > > <amir73il@gmail.com> wrote:
> > > > > > 
> > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > > > <jlayton@kernel.org> wrote:
> > > > > > > 
> > > > > > > xfs will update the i_version when updating only the
> > > > > > > atime
> > > > > > > value, which
> > > > > > > is not desirable for any of the current consumers of
> > > > > > > i_version. Doing so
> > > > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > > > measurement
> > > > > > > activity in IMA.
> > > > > > > 
> > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate
> > > > > > > that
> > > > > > > the
> > > > > > > transaction should not update the i_version. Set that
> > > > > > > value
> > > > > > > in
> > > > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > > > 
> > > > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > > > Cc: NeilBrown <neilb@suse.de>
> > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > > > ---
> > > > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > > 
> > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way
> > > > > > > to
> > > > > > > illustrate
> > > > > > > the problem. I still think this approach should at least
> > > > > > > fix
> > > > > > > the worst
> > > > > > > problems with atime updates being counted. We can look to
> > > > > > > carve out
> > > > > > > other "spurious" i_version updates as we identify them.
> > > > > > > 
> > > > > > 
> > > > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > > > writeback
> > > > > > of dirty pages. Anybody know about other cases?
> > > > > > 
> > > > > > Regarding inode blocks map changes, first of all, I don't
> > > > > > think
> > > > > > that there is
> > > > > > any practical loss from invalidating NFS client cache on
> > > > > > dirty
> > > > > > data writeback,
> > > > > > because NFS server should be serving cold data most of the
> > > > > > time.
> > > > > > If there are a few unneeded cache invalidations they would
> > > > > > only
> > > > > > be temporary.
> > > > > > 
> > > > > 
> > > > > Unless there is an issue with a writer NFS client that
> > > > > invalidates its
> > > > > own attribute
> > > > > caches on server data writeback?
> > > > > 
> > > > 
> > > > The client just looks at the file attributes (of which
> > > > i_version is
> > > > but
> > > > one), and if certain attributes have changed (mtime, ctime,
> > > > i_version,
> > > > etc...) then it invalidates its cache.
> > > > 
> > > > In the case of blocks map changes, could that mean a difference
> > > > in
> > > > the
> > > > observable sparse regions of the file? If so, then a READ_PLUS
> > > > before
> > > > the change and a READ_PLUS after could give different results.
> > > > Since
> > > > that difference is observable by the client, I'd think we'd
> > > > want to
> > > > bump
> > > > i_version for that anyway.
> > > 
> > > How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I
> > > know
> > > that's been the subject of recent debate.  At least as far as XFS
> > > is
> > > concerned, a file range can go from hole -> delayed allocation
> > > reservation -> unwritten extent -> (actual writeback) -> written
> > > extent.
> > > The dance became rather more complex when we added COW.  If any
> > > of
> > > that
> > > will make a difference for READ_PLUS, then yes, I think you'd
> > > want
> > > file
> > > writeback activities to bump iversion to cause client
> > > invalidations,
> > > like (I think) Dave said.
> > > 
> > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data
> > > for
> > > written and delalloc extents; and an unwritten extent will report
> > > data
> > > for any pagecache it finds.
> > > 
> > 
> > READ_PLUS should never return anything different than a read()
> > system
> > call would return for any given area. The way it reports sparse
> > regions
> > vs. data regions is purely an RPC formatting convenience.
> > 
> > The only point to note about NFS READ and READ_PLUS is that because
> > the
> > client is forced to send multiple RPC calls if the user is trying
> > to
> > read a region that is larger than the 'rsize' value, it is possible
> > that these READ/READ_PLUS calls may be processed out of order, and
> > so
> > the result may end up looking different than if you had executed a
> > read() call for the full region directly on the server.
> > However each individual READ / READ_PLUS reply should look as if
> > the
> > user had called read() on that rsize-sized section of the file.
> > > > 
> 
> Yeah, thinking about it some more, simply changing the block
> allocation
> is not something that should affect the ctime, so we probably don't
> want
> to bump i_version on it. It's an implicit change, IOW, not an
> explicit
> one.

As you say, it is unfortunate that XFS does this, and it is unfortunate
that it then changes the blocks allocated attribute post-fsync(), since
all that does cause confusion for certain applications.
However I agree 100% that this is an implicit change that is driven by
the filesystem and not the user application. Hence it is not an action
that needs to be recorded with a change attribute bump.


-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-27 16:10             ` Jeff Layton
  2022-08-27 17:06               ` Trond Myklebust
@ 2022-08-28 13:25               ` Amir Goldstein
  2022-08-28 14:37                 ` Jeff Layton
  1 sibling, 1 reply; 46+ messages in thread
From: Amir Goldstein @ 2022-08-28 13:25 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Trond Myklebust, djwong, zohar, brauner, xiubli, neilb,
	linux-api, linux-xfs, dwysocha, david, linux-kernel, chuck.lever,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, ceph-devel

On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote:
> > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > > > <amir73il@gmail.com> wrote:
> > > > > >
> > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > > > <jlayton@kernel.org> wrote:
> > > > > > >
> > > > > > > xfs will update the i_version when updating only the atime
> > > > > > > value, which
> > > > > > > is not desirable for any of the current consumers of
> > > > > > > i_version. Doing so
> > > > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > > > measurement
> > > > > > > activity in IMA.
> > > > > > >
> > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that
> > > > > > > the
> > > > > > > transaction should not update the i_version. Set that value
> > > > > > > in
> > > > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > > >
> > > > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > > > Cc: NeilBrown <neilb@suse.de>
> > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > > > ---
> > > > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > >
> > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to
> > > > > > > illustrate
> > > > > > > the problem. I still think this approach should at least fix
> > > > > > > the worst
> > > > > > > problems with atime updates being counted. We can look to
> > > > > > > carve out
> > > > > > > other "spurious" i_version updates as we identify them.
> > > > > > >
> > > > > >
> > > > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > > > writeback
> > > > > > of dirty pages. Anybody know about other cases?
> > > > > >
> > > > > > Regarding inode blocks map changes, first of all, I don't think
> > > > > > that there is
> > > > > > any practical loss from invalidating NFS client cache on dirty
> > > > > > data writeback,
> > > > > > because NFS server should be serving cold data most of the
> > > > > > time.
> > > > > > If there are a few unneeded cache invalidations they would only
> > > > > > be temporary.
> > > > > >
> > > > >
> > > > > Unless there is an issue with a writer NFS client that
> > > > > invalidates its
> > > > > own attribute
> > > > > caches on server data writeback?
> > > > >
> > > >
> > > > The client just looks at the file attributes (of which i_version is
> > > > but
> > > > one), and if certain attributes have changed (mtime, ctime,
> > > > i_version,
> > > > etc...) then it invalidates its cache.
> > > >
> > > > In the case of blocks map changes, could that mean a difference in
> > > > the
> > > > observable sparse regions of the file? If so, then a READ_PLUS
> > > > before
> > > > the change and a READ_PLUS after could give different results.
> > > > Since
> > > > that difference is observable by the client, I'd think we'd want to
> > > > bump
> > > > i_version for that anyway.
> > >
> > > How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I know
> > > that's been the subject of recent debate.  At least as far as XFS is
> > > concerned, a file range can go from hole -> delayed allocation
> > > reservation -> unwritten extent -> (actual writeback) -> written
> > > extent.
> > > The dance became rather more complex when we added COW.  If any of
> > > that
> > > will make a difference for READ_PLUS, then yes, I think you'd want
> > > file
> > > writeback activities to bump iversion to cause client invalidations,
> > > like (I think) Dave said.
> > >
> > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for
> > > written and delalloc extents; and an unwritten extent will report
> > > data
> > > for any pagecache it finds.
> > >
> >
> > READ_PLUS should never return anything different than a read() system
> > call would return for any given area. The way it reports sparse regions
> > vs. data regions is purely an RPC formatting convenience.
> >
> > The only point to note about NFS READ and READ_PLUS is that because the
> > client is forced to send multiple RPC calls if the user is trying to
> > read a region that is larger than the 'rsize' value, it is possible
> > that these READ/READ_PLUS calls may be processed out of order, and so
> > the result may end up looking different than if you had executed a
> > read() call for the full region directly on the server.
> > However each individual READ / READ_PLUS reply should look as if the
> > user had called read() on that rsize-sized section of the file.
> > > >
>
> Yeah, thinking about it some more, simply changing the block allocation
> is not something that should affect the ctime, so we probably don't want
> to bump i_version on it. It's an implicit change, IOW, not an explicit
> one.
>
> The fact that xfs might do that is unfortunate, but it's not the end of
> the world and it still would conform to the proposed definition for
> i_version. In practice, this sort of allocation change should come soon
> after the file was written, so one would hope that any damage due to the
> false i_version bump would be minimized.
>

That was exactly my point.

> It would be nice to teach it not to do that however. Maybe we can insert
> the NOIVER flag at a strategic place to avoid it?

Why would that be nice to avoid?
You did not specify any use case where incrementing i_version
on block mapping change matters in practice.
On the contrary, you said that NFS client writer sends COMMIT on close,
which should stabilize i_version for the next readers.

Given that we already have an xfs implementation that does increment
i_version on block mapping changes and it would be a pain to change
that or add a new user options, I don't see the point in discussing it further
unless there is a good incentive for avoiding i_version updates in those cases.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-28 13:25               ` Amir Goldstein
@ 2022-08-28 14:37                 ` Jeff Layton
  2022-08-28 16:53                   ` Amir Goldstein
  2022-08-29  5:48                   ` Dave Chinner
  0 siblings, 2 replies; 46+ messages in thread
From: Jeff Layton @ 2022-08-28 14:37 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Trond Myklebust, djwong, zohar, brauner, xiubli, neilb,
	linux-api, linux-xfs, dwysocha, david, linux-kernel, chuck.lever,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, ceph-devel

On Sun, 2022-08-28 at 16:25 +0300, Amir Goldstein wrote:
> On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote:
> > > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> > > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > > > > <amir73il@gmail.com> wrote:
> > > > > > > 
> > > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > > > > <jlayton@kernel.org> wrote:
> > > > > > > > 
> > > > > > > > xfs will update the i_version when updating only the atime
> > > > > > > > value, which
> > > > > > > > is not desirable for any of the current consumers of
> > > > > > > > i_version. Doing so
> > > > > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > > > > measurement
> > > > > > > > activity in IMA.
> > > > > > > > 
> > > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that
> > > > > > > > the
> > > > > > > > transaction should not update the i_version. Set that value
> > > > > > > > in
> > > > > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > > > > 
> > > > > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > > > > Cc: NeilBrown <neilb@suse.de>
> > > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > > > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > > > > ---
> > > > > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > > > 
> > > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to
> > > > > > > > illustrate
> > > > > > > > the problem. I still think this approach should at least fix
> > > > > > > > the worst
> > > > > > > > problems with atime updates being counted. We can look to
> > > > > > > > carve out
> > > > > > > > other "spurious" i_version updates as we identify them.
> > > > > > > > 
> > > > > > > 
> > > > > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > > > > writeback
> > > > > > > of dirty pages. Anybody know about other cases?
> > > > > > > 
> > > > > > > Regarding inode blocks map changes, first of all, I don't think
> > > > > > > that there is
> > > > > > > any practical loss from invalidating NFS client cache on dirty
> > > > > > > data writeback,
> > > > > > > because NFS server should be serving cold data most of the
> > > > > > > time.
> > > > > > > If there are a few unneeded cache invalidations they would only
> > > > > > > be temporary.
> > > > > > > 
> > > > > > 
> > > > > > Unless there is an issue with a writer NFS client that
> > > > > > invalidates its
> > > > > > own attribute
> > > > > > caches on server data writeback?
> > > > > > 
> > > > > 
> > > > > The client just looks at the file attributes (of which i_version is
> > > > > but
> > > > > one), and if certain attributes have changed (mtime, ctime,
> > > > > i_version,
> > > > > etc...) then it invalidates its cache.
> > > > > 
> > > > > In the case of blocks map changes, could that mean a difference in
> > > > > the
> > > > > observable sparse regions of the file? If so, then a READ_PLUS
> > > > > before
> > > > > the change and a READ_PLUS after could give different results.
> > > > > Since
> > > > > that difference is observable by the client, I'd think we'd want to
> > > > > bump
> > > > > i_version for that anyway.
> > > > 
> > > > How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I know
> > > > that's been the subject of recent debate.  At least as far as XFS is
> > > > concerned, a file range can go from hole -> delayed allocation
> > > > reservation -> unwritten extent -> (actual writeback) -> written
> > > > extent.
> > > > The dance became rather more complex when we added COW.  If any of
> > > > that
> > > > will make a difference for READ_PLUS, then yes, I think you'd want
> > > > file
> > > > writeback activities to bump iversion to cause client invalidations,
> > > > like (I think) Dave said.
> > > > 
> > > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for
> > > > written and delalloc extents; and an unwritten extent will report
> > > > data
> > > > for any pagecache it finds.
> > > > 
> > > 
> > > READ_PLUS should never return anything different than a read() system
> > > call would return for any given area. The way it reports sparse regions
> > > vs. data regions is purely an RPC formatting convenience.
> > > 
> > > The only point to note about NFS READ and READ_PLUS is that because the
> > > client is forced to send multiple RPC calls if the user is trying to
> > > read a region that is larger than the 'rsize' value, it is possible
> > > that these READ/READ_PLUS calls may be processed out of order, and so
> > > the result may end up looking different than if you had executed a
> > > read() call for the full region directly on the server.
> > > However each individual READ / READ_PLUS reply should look as if the
> > > user had called read() on that rsize-sized section of the file.
> > > > > 
> > 
> > Yeah, thinking about it some more, simply changing the block allocation
> > is not something that should affect the ctime, so we probably don't want
> > to bump i_version on it. It's an implicit change, IOW, not an explicit
> > one.
> > 
> > The fact that xfs might do that is unfortunate, but it's not the end of
> > the world and it still would conform to the proposed definition for
> > i_version. In practice, this sort of allocation change should come soon
> > after the file was written, so one would hope that any damage due to the
> > false i_version bump would be minimized.
> > 
> 
> That was exactly my point.
> 
> > It would be nice to teach it not to do that however. Maybe we can insert
> > the NOIVER flag at a strategic place to avoid it?
> 
> Why would that be nice to avoid?
> You did not specify any use case where incrementing i_version
> on block mapping change matters in practice.
> On the contrary, you said that NFS client writer sends COMMIT on close,
> which should stabilize i_version for the next readers.
> 
> Given that we already have an xfs implementation that does increment
> i_version on block mapping changes and it would be a pain to change
> that or add a new user options, I don't see the point in discussing it further
> unless there is a good incentive for avoiding i_version updates in those cases.
> 

Because the change to the block allocation doesn't represent an
"explicit" change to the inode. We will have bumped the ctime on the
original write (in update_time), but the follow-on changes that occur
due to that write needn't be counted as they aren't visible to the
client.

It's possible for a client to issue a read between the write and the
flush and get the interim value for i_version. Then, once the write
happens and the i_version gets bumped again, the client invalidates its
cache even though it needn't do so.

The race window ought to be relatively small, and this wouldn't result
in incorrect behavior that you'd notice (other than loss of
performance), but it's not ideal. We're doing more on-the-wire reads
than are necessary in this case.

It would be nice to have it not do that. If we end up taking this patch
to make it elide the i_version bumps on atime updates, we may be able to
set the the NOIVER flag in other cases as well, and avoid some of these
extra bumps.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-28 14:37                 ` Jeff Layton
@ 2022-08-28 16:53                   ` Amir Goldstein
  2022-08-29  5:48                   ` Dave Chinner
  1 sibling, 0 replies; 46+ messages in thread
From: Amir Goldstein @ 2022-08-28 16:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Trond Myklebust, djwong, zohar, brauner, xiubli, neilb,
	linux-api, linux-xfs, dwysocha, david, linux-kernel, chuck.lever,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, ceph-devel

On Sun, Aug 28, 2022 at 5:37 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Sun, 2022-08-28 at 16:25 +0300, Amir Goldstein wrote:
> > On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > On Sat, 2022-08-27 at 16:03 +0000, Trond Myklebust wrote:
> > > > On Sat, 2022-08-27 at 08:46 -0700, Darrick J. Wong wrote:
> > > > > On Sat, Aug 27, 2022 at 09:14:30AM -0400, Jeff Layton wrote:
> > > > > > On Sat, 2022-08-27 at 11:01 +0300, Amir Goldstein wrote:
> > > > > > > On Sat, Aug 27, 2022 at 10:26 AM Amir Goldstein
> > > > > > > <amir73il@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Sat, Aug 27, 2022 at 12:49 AM Jeff Layton
> > > > > > > > <jlayton@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > xfs will update the i_version when updating only the atime
> > > > > > > > > value, which
> > > > > > > > > is not desirable for any of the current consumers of
> > > > > > > > > i_version. Doing so
> > > > > > > > > leads to unnecessary cache invalidations on NFS and extra
> > > > > > > > > measurement
> > > > > > > > > activity in IMA.
> > > > > > > > >
> > > > > > > > > Add a new XFS_ILOG_NOIVER flag, and use that to indicate that
> > > > > > > > > the
> > > > > > > > > transaction should not update the i_version. Set that value
> > > > > > > > > in
> > > > > > > > > xfs_vn_update_time if we're only updating the atime.
> > > > > > > > >
> > > > > > > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > > > > > > Cc: NeilBrown <neilb@suse.de>
> > > > > > > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > > > > > > Cc: David Wysochanski <dwysocha@redhat.com>
> > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > > > > > ---
> > > > > > > > >  fs/xfs/libxfs/xfs_log_format.h  |  2 +-
> > > > > > > > >  fs/xfs/libxfs/xfs_trans_inode.c |  2 +-
> > > > > > > > >  fs/xfs/xfs_iops.c               | 11 +++++++++--
> > > > > > > > >  3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > > > >
> > > > > > > > > Dave has NACK'ed this patch, but I'm sending it as a way to
> > > > > > > > > illustrate
> > > > > > > > > the problem. I still think this approach should at least fix
> > > > > > > > > the worst
> > > > > > > > > problems with atime updates being counted. We can look to
> > > > > > > > > carve out
> > > > > > > > > other "spurious" i_version updates as we identify them.
> > > > > > > > >
> > > > > > > >
> > > > > > > > AFAIK, "spurious" is only inode blocks map changes due to
> > > > > > > > writeback
> > > > > > > > of dirty pages. Anybody know about other cases?
> > > > > > > >
> > > > > > > > Regarding inode blocks map changes, first of all, I don't think
> > > > > > > > that there is
> > > > > > > > any practical loss from invalidating NFS client cache on dirty
> > > > > > > > data writeback,
> > > > > > > > because NFS server should be serving cold data most of the
> > > > > > > > time.
> > > > > > > > If there are a few unneeded cache invalidations they would only
> > > > > > > > be temporary.
> > > > > > > >
> > > > > > >
> > > > > > > Unless there is an issue with a writer NFS client that
> > > > > > > invalidates its
> > > > > > > own attribute
> > > > > > > caches on server data writeback?
> > > > > > >
> > > > > >
> > > > > > The client just looks at the file attributes (of which i_version is
> > > > > > but
> > > > > > one), and if certain attributes have changed (mtime, ctime,
> > > > > > i_version,
> > > > > > etc...) then it invalidates its cache.
> > > > > >
> > > > > > In the case of blocks map changes, could that mean a difference in
> > > > > > the
> > > > > > observable sparse regions of the file? If so, then a READ_PLUS
> > > > > > before
> > > > > > the change and a READ_PLUS after could give different results.
> > > > > > Since
> > > > > > that difference is observable by the client, I'd think we'd want to
> > > > > > bump
> > > > > > i_version for that anyway.
> > > > >
> > > > > How /is/ READ_PLUS supposed to detect sparse regions, anyway?  I know
> > > > > that's been the subject of recent debate.  At least as far as XFS is
> > > > > concerned, a file range can go from hole -> delayed allocation
> > > > > reservation -> unwritten extent -> (actual writeback) -> written
> > > > > extent.
> > > > > The dance became rather more complex when we added COW.  If any of
> > > > > that
> > > > > will make a difference for READ_PLUS, then yes, I think you'd want
> > > > > file
> > > > > writeback activities to bump iversion to cause client invalidations,
> > > > > like (I think) Dave said.
> > > > >
> > > > > The fs/iomap/ implementation of SEEK_DATA/SEEK_HOLE reports data for
> > > > > written and delalloc extents; and an unwritten extent will report
> > > > > data
> > > > > for any pagecache it finds.
> > > > >
> > > >
> > > > READ_PLUS should never return anything different than a read() system
> > > > call would return for any given area. The way it reports sparse regions
> > > > vs. data regions is purely an RPC formatting convenience.
> > > >
> > > > The only point to note about NFS READ and READ_PLUS is that because the
> > > > client is forced to send multiple RPC calls if the user is trying to
> > > > read a region that is larger than the 'rsize' value, it is possible
> > > > that these READ/READ_PLUS calls may be processed out of order, and so
> > > > the result may end up looking different than if you had executed a
> > > > read() call for the full region directly on the server.
> > > > However each individual READ / READ_PLUS reply should look as if the
> > > > user had called read() on that rsize-sized section of the file.
> > > > > >
> > >
> > > Yeah, thinking about it some more, simply changing the block allocation
> > > is not something that should affect the ctime, so we probably don't want
> > > to bump i_version on it. It's an implicit change, IOW, not an explicit
> > > one.
> > >
> > > The fact that xfs might do that is unfortunate, but it's not the end of
> > > the world and it still would conform to the proposed definition for
> > > i_version. In practice, this sort of allocation change should come soon
> > > after the file was written, so one would hope that any damage due to the
> > > false i_version bump would be minimized.
> > >
> >
> > That was exactly my point.
> >
> > > It would be nice to teach it not to do that however. Maybe we can insert
> > > the NOIVER flag at a strategic place to avoid it?
> >
> > Why would that be nice to avoid?
> > You did not specify any use case where incrementing i_version
> > on block mapping change matters in practice.
> > On the contrary, you said that NFS client writer sends COMMIT on close,
> > which should stabilize i_version for the next readers.
> >
> > Given that we already have an xfs implementation that does increment
> > i_version on block mapping changes and it would be a pain to change
> > that or add a new user options, I don't see the point in discussing it further
> > unless there is a good incentive for avoiding i_version updates in those cases.
> >
>
> Because the change to the block allocation doesn't represent an
> "explicit" change to the inode. We will have bumped the ctime on the
> original write (in update_time), but the follow-on changes that occur
> due to that write needn't be counted as they aren't visible to the
> client.
>
> It's possible for a client to issue a read between the write and the
> flush and get the interim value for i_version. Then, once the write
> happens and the i_version gets bumped again, the client invalidates its
> cache even though it needn't do so.
>
> The race window ought to be relatively small, and this wouldn't result
> in incorrect behavior that you'd notice (other than loss of
> performance), but it's not ideal. We're doing more on-the-wire reads
> than are necessary in this case.
>
> It would be nice to have it not do that. If we end up taking this patch
> to make it elide the i_version bumps on atime updates, we may be able to
> set the the NOIVER flag in other cases as well, and avoid some of these
> extra bumps.

Ok. The use case is that a client's writes on an open file can end up
invalidating the same client's read cache on the same file.
Sounds like implementing write delegations would have been better
in this case than relying on i_version, but I do not know how hard that is
and how many clients would request write delegations.

Anyway, if you ever send a proposal to change the semantics of
xfs i_version bumping on writeback, please specify the target use case
and how likely it is for real workloads to suffer from it.

Another question to the NFS experts - what happens when the
server crashes before dirty data hits persistent storage but the client
has already cached the dirty data.

After the crash, the filesystem can come up with the same i_version
that the NFS client has cached, but without the dirty data.
Does the spec address this scenario?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-27 15:46         ` Darrick J. Wong
  2022-08-27 16:03           ` Trond Myklebust
@ 2022-08-28 17:30           ` Amir Goldstein
  1 sibling, 0 replies; 46+ messages in thread
From: Amir Goldstein @ 2022-08-28 17:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jeff Layton, Theodore Tso, Andreas Dilger, Dave Chinner,
	Trond Myklebust, Neil Brown, Al Viro, Mimi Zohar, xiubli,
	Chuck Lever, Lukas Czerner, Jan Kara, Christian Brauner,
	Linux API, Linux Btrfs, linux-fsdevel, linux-kernel, Ext4,
	Linux NFS Mailing List, linux-xfs, David Wysochanski, ceph-devel

> > > >
> > > > Forensics and other applications that care about atime updates can and
> > > > should check atime and don't need i_version to know that it was changed.
> > > > The reliability of atime as an audit tool has dropped considerably since
> > > > the default in relatime.
>
> I've been waiting for Amir to appear in this discussion -- ISTR that a
> few years ago you were wanting the ability to scan a filesystem to look
> for files that have changed since a given point.  If XFS exported its
> di_changecount file attribute (as it currently behaves) via BULKSTAT,
> you'd have the ability to do that, so long as your application could
> persist bulkstat data and compare.
>

It's true that exporting i_version via BULKSTAT could be useful
to some backup/sync applications.

For my case, I was interested in something a bit different -
finding all the changes in a very large fs tree - or IOW finding if
anything has changed inside a large tree since a point in time.
So not sure if i_version would help for that use case.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-28 14:37                 ` Jeff Layton
  2022-08-28 16:53                   ` Amir Goldstein
@ 2022-08-29  5:48                   ` Dave Chinner
  2022-08-29 10:33                     ` Jeff Layton
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2022-08-29  5:48 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Trond Myklebust, djwong, zohar, brauner, xiubli,
	neilb, linux-api, linux-xfs, dwysocha, linux-kernel, chuck.lever,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, ceph-devel

On Sun, Aug 28, 2022 at 10:37:37AM -0400, Jeff Layton wrote:
> On Sun, 2022-08-28 at 16:25 +0300, Amir Goldstein wrote:
> > On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > Yeah, thinking about it some more, simply changing the block allocation
> > > is not something that should affect the ctime, so we probably don't want
> > > to bump i_version on it. It's an implicit change, IOW, not an explicit
> > > one.
> > > 
> > > The fact that xfs might do that is unfortunate, but it's not the end of
> > > the world and it still would conform to the proposed definition for
> > > i_version. In practice, this sort of allocation change should come soon
> > > after the file was written, so one would hope that any damage due to the
> > > false i_version bump would be minimized.
> > > 
> > 
> > That was exactly my point.
> > 
> > > It would be nice to teach it not to do that however. Maybe we can insert
> > > the NOIVER flag at a strategic place to avoid it?

No, absolutely not.

I've already explained this: The XFS *disk format specification*
says that di_changecount is bumped for every change that is made to
the inode.

Applications that are written from this specification expect the on
disk format for a XFS given filesystem feature to remain the same
until it is either deprecated and removed or we add feature flags to
indicate it has different behaviour.  We can't just change the
behaviour at a whim.

And that's ignoring the fact that randomly spewing NOIVER
into transactions that modify inode metadata is a nasty hack - it
is not desirable from a design or documentation POV, nor is it
maintainable.

> > Why would that be nice to avoid?
> > You did not specify any use case where incrementing i_version
> > on block mapping change matters in practice.
> > On the contrary, you said that NFS client writer sends COMMIT on close,
> > which should stabilize i_version for the next readers.
> > 
> > Given that we already have an xfs implementation that does increment
> > i_version on block mapping changes and it would be a pain to change
> > that or add a new user options, I don't see the point in discussing it further
> > unless there is a good incentive for avoiding i_version updates in those cases.
> > 
> 
> Because the change to the block allocation doesn't represent an
> "explicit" change to the inode. We will have bumped the ctime on the
> original write (in update_time), but the follow-on changes that occur
> due to that write needn't be counted as they aren't visible to the
> client.
> 
> It's possible for a client to issue a read between the write and the
> flush and get the interim value for i_version. Then, once the write
> happens and the i_version gets bumped again, the client invalidates its
> cache even though it needn't do so.
> 
> The race window ought to be relatively small, and this wouldn't result
> in incorrect behavior that you'd notice (other than loss of
> performance), but it's not ideal. We're doing more on-the-wire reads
> than are necessary in this case.
> 
> It would be nice to have it not do that. If we end up taking this patch
> to make it elide the i_version bumps on atime updates, we may be able to
> set the the NOIVER flag in other cases as well, and avoid some of these
> extra bumps.


<sigh>

Please don't make me repeat myself for the third time.

Once we have decided on a solid, unchanging definition for the
*statx user API variable*, we'll implement a new on-disk field that
provides this information.  We will document it in the on-disk
specification as "this is how di_iversion behaves" so that it is
clear to everyone parsing the on-disk format or writing their own
XFS driver how to implement it and when to expect it to
change.

Then we can add a filesystem and inode feature flags that say "inode
has new iversion" and we use that to populate the kernel iversion
instead of di_changecount. We keep di_changecount exactly the way it
is now for the applications and use cases we already have for that
specific behaviour. If the kernel and/or filesystem don't support
the new di_iversion field, then we'll use di_changecount as it
currently exists for the kernel iversion code.

Keep in mind that we've been doing dynamic inode format updates in
XFS for a couple of decades - users don't even have to be aware that
they need to perform format upgrades because often they just happen
whenever an inode is accessed. IOWs, just because we have to change
the on-disk format to support this new iversion definition, it
doesn't mean users have to reformat filesystems before the new
feature can be used.

Hence, over time, as distros update kernels, the XFS iversion
behaviour will change automagically as we update inodes in existing
filesystems as they are accessed to add and then use the new
di_iversion field for the VFS change attribute field instead of the
di_changecount field...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-26 21:46 ` [PATCH v3 1/7] iversion: update comments with info about atime updates Jeff Layton
@ 2022-08-29  7:56   ` Dave Chinner
  2022-08-29 10:39     ` Jeff Layton
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2022-08-29  7:56 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, trondmy, neilb, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, brauner, linux-api,
	linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote:
> The i_version field in the kernel has had different semantics over
> the decades, but we're now proposing to expose it to userland via
> statx. This means that we need a clear, consistent definition of
> what it means and when it should change.
> 
> Update the comments in iversion.h to describe how a conformant
> i_version implementation is expected to behave. This definition
> suits the current users of i_version (NFSv4 and IMA), but is
> loose enough to allow for a wide range of possible implementations.
> 
> Cc: Colin Walters <walters@verbum.org>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Trond Myklebust <trondmy@hammerspace.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  include/linux/iversion.h | 23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 3bfebde5a1a6..45e93e1b4edc 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -9,8 +9,19 @@
>   * ---------------------------
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> - * appear different to observers if there was a change to the inode's data or
> - * metadata since it was last queried.
> + * appear different to observers if there was an explicit change to the inode's
> + * data or metadata since it was last queried.
> + *
> + * An explicit change is one that would ordinarily result in a change to the
> + * inode status change time (aka ctime). The version must appear to change, even
> + * if the ctime does not (since the whole point is to avoid missing updates due
> + * to timestamp granularity). If POSIX mandates that the ctime must change due
> + * to an operation, then the i_version counter must be incremented as well.
> + *
> + * A conformant implementation is allowed to increment the counter in other
> + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
> + * whether caches are up to date. Spurious increments can cause false cache
> + * invalidations.

"not optimal", but never-the-less allowed - that's "unspecified
behaviour" if I've ever seen it. How is userspace supposed to
know/deal with this?

Indeed, this loophole clause doesn't exist in the man pages that
define what statx.stx_ino_version means. The man pages explicitly
define that stx_ino_version only ever changes when stx_ctime
changes.

IOWs, the behaviour userspace developers are going to expect *does
not include* stx_ino_version changing it more often than ctime is
changed. Hence a kernel iversion implementation that bumps the
counter more often than ctime changes *is not conformant with the
statx version counter specification*. IOWs, we can't export such
behaviour to userspace *ever* - it is a non-conformant
implementation.

Hence I think anything that bumps iversion outside the bounds of the
statx definition should be declared as such:

"Non-conformant iversion implementations:
	- MUST NOT be exported by statx() to userspace
	- MUST be -tolerated- by kernel internal applications that
	  use iversion for their own purposes."

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-29  5:48                   ` Dave Chinner
@ 2022-08-29 10:33                     ` Jeff Layton
  2022-08-30  0:08                       ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2022-08-29 10:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Trond Myklebust, djwong, zohar, brauner, xiubli,
	neilb, linux-api, linux-xfs, dwysocha, linux-kernel, chuck.lever,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, ceph-devel

On Mon, 2022-08-29 at 15:48 +1000, Dave Chinner wrote:
> On Sun, Aug 28, 2022 at 10:37:37AM -0400, Jeff Layton wrote:
> > On Sun, 2022-08-28 at 16:25 +0300, Amir Goldstein wrote:
> > > On Sat, Aug 27, 2022 at 7:10 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > Yeah, thinking about it some more, simply changing the block allocation
> > > > is not something that should affect the ctime, so we probably don't want
> > > > to bump i_version on it. It's an implicit change, IOW, not an explicit
> > > > one.
> > > > 
> > > > The fact that xfs might do that is unfortunate, but it's not the end of
> > > > the world and it still would conform to the proposed definition for
> > > > i_version. In practice, this sort of allocation change should come soon
> > > > after the file was written, so one would hope that any damage due to the
> > > > false i_version bump would be minimized.
> > > > 
> > > 
> > > That was exactly my point.
> > > 
> > > > It would be nice to teach it not to do that however. Maybe we can insert
> > > > the NOIVER flag at a strategic place to avoid it?
> 
> No, absolutely not.
> 
> I've already explained this: The XFS *disk format specification*
> says that di_changecount is bumped for every change that is made to
> the inode.
> 
> Applications that are written from this specification expect the on
> disk format for a XFS given filesystem feature to remain the same
> until it is either deprecated and removed or we add feature flags to
> indicate it has different behaviour.  We can't just change the
> behaviour at a whim.
> 
> And that's ignoring the fact that randomly spewing NOIVER
> into transactions that modify inode metadata is a nasty hack - it
> is not desirable from a design or documentation POV, nor is it
> maintainable.
> 
> > > Why would that be nice to avoid?
> > > You did not specify any use case where incrementing i_version
> > > on block mapping change matters in practice.
> > > On the contrary, you said that NFS client writer sends COMMIT on close,
> > > which should stabilize i_version for the next readers.
> > > 
> > > Given that we already have an xfs implementation that does increment
> > > i_version on block mapping changes and it would be a pain to change
> > > that or add a new user options, I don't see the point in discussing it further
> > > unless there is a good incentive for avoiding i_version updates in those cases.
> > > 
> > 
> > Because the change to the block allocation doesn't represent an
> > "explicit" change to the inode. We will have bumped the ctime on the
> > original write (in update_time), but the follow-on changes that occur
> > due to that write needn't be counted as they aren't visible to the
> > client.
> > 
> > It's possible for a client to issue a read between the write and the
> > flush and get the interim value for i_version. Then, once the write
> > happens and the i_version gets bumped again, the client invalidates its
> > cache even though it needn't do so.
> > 
> > The race window ought to be relatively small, and this wouldn't result
> > in incorrect behavior that you'd notice (other than loss of
> > performance), but it's not ideal. We're doing more on-the-wire reads
> > than are necessary in this case.
> > 
> > It would be nice to have it not do that. If we end up taking this patch
> > to make it elide the i_version bumps on atime updates, we may be able to
> > set the the NOIVER flag in other cases as well, and avoid some of these
> > extra bumps.
> 
> 
> <sigh>
> 
> Please don't make me repeat myself for the third time.
> 
> Once we have decided on a solid, unchanging definition for the
> *statx user API variable*, we'll implement a new on-disk field that
> provides this information.  We will document it in the on-disk
> specification as "this is how di_iversion behaves" so that it is
> clear to everyone parsing the on-disk format or writing their own
> XFS driver how to implement it and when to expect it to
> change.
> 
> Then we can add a filesystem and inode feature flags that say "inode
> has new iversion" and we use that to populate the kernel iversion
> instead of di_changecount. We keep di_changecount exactly the way it
> is now for the applications and use cases we already have for that
> specific behaviour. If the kernel and/or filesystem don't support
> the new di_iversion field, then we'll use di_changecount as it
> currently exists for the kernel iversion code.
> 

Aside from NFS and IMA, what applications are dependent on the current
definition and how do they rely on i_version today?

> Keep in mind that we've been doing dynamic inode format updates in
> XFS for a couple of decades - users don't even have to be aware that
> they need to perform format upgrades because often they just happen
> whenever an inode is accessed. IOWs, just because we have to change
> the on-disk format to support this new iversion definition, it
> doesn't mean users have to reformat filesystems before the new
> feature can be used.
> 
> Hence, over time, as distros update kernels, the XFS iversion
> behaviour will change automagically as we update inodes in existing
> filesystems as they are accessed to add and then use the new
> di_iversion field for the VFS change attribute field instead of the
> di_changecount field...
> 

If you want to create a whole new on-disk field for this, then that's
your prerogative, but before you do that, I'd like to better understand
why and how the constraints on this field changed.

The original log message from the commit that added a change counter
(below) stated that you were adding it for network filesystems like NFS.
When did this change and why?

    commit dc037ad7d24f3711e431a45c053b5d425995e9e4
    Author: Dave Chinner <dchinner@redhat.com>
    Date:   Thu Jun 27 16:04:59 2013 +1000

        xfs: implement inode change count

        For CRC enabled filesystems, add support for the monotonic inode
        version change counter that is needed by protocols like NFSv4 for
        determining if the inode has changed in any way at all between two
        unrelated operations on the inode.

        This bumps the change count the first time an inode is dirtied in a
        transaction. Since all modifications to the inode are logged, this
        will catch all changes that are made to the inode, including
        timestamp updates that occur during data writes.

        Signed-off-by: Dave Chinner <dchinner@redhat.com>
        Reviewed-by: Mark Tinguely <tinguely@sgi.com>
        Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com>
        Signed-off-by: Ben Myers <bpm@sgi.com>

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-29  7:56   ` Dave Chinner
@ 2022-08-29 10:39     ` Jeff Layton
  2022-08-29 22:58       ` NeilBrown
  2022-08-30  1:04       ` Dave Chinner
  0 siblings, 2 replies; 46+ messages in thread
From: Jeff Layton @ 2022-08-29 10:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: tytso, adilger.kernel, djwong, trondmy, neilb, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, brauner, linux-api,
	linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote:
> On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote:
> > The i_version field in the kernel has had different semantics over
> > the decades, but we're now proposing to expose it to userland via
> > statx. This means that we need a clear, consistent definition of
> > what it means and when it should change.
> > 
> > Update the comments in iversion.h to describe how a conformant
> > i_version implementation is expected to behave. This definition
> > suits the current users of i_version (NFSv4 and IMA), but is
> > loose enough to allow for a wide range of possible implementations.
> > 
> > Cc: Colin Walters <walters@verbum.org>
> > Cc: NeilBrown <neilb@suse.de>
> > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  include/linux/iversion.h | 23 +++++++++++++++++++++--
> >  1 file changed, 21 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 3bfebde5a1a6..45e93e1b4edc 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -9,8 +9,19 @@
> >   * ---------------------------
> >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > - * appear different to observers if there was a change to the inode's data or
> > - * metadata since it was last queried.
> > + * appear different to observers if there was an explicit change to the inode's
> > + * data or metadata since it was last queried.
> > + *
> > + * An explicit change is one that would ordinarily result in a change to the
> > + * inode status change time (aka ctime). The version must appear to change, even
> > + * if the ctime does not (since the whole point is to avoid missing updates due
> > + * to timestamp granularity). If POSIX mandates that the ctime must change due
> > + * to an operation, then the i_version counter must be incremented as well.
> > + *
> > + * A conformant implementation is allowed to increment the counter in other
> > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
> > + * whether caches are up to date. Spurious increments can cause false cache
> > + * invalidations.
> 
> "not optimal", but never-the-less allowed - that's "unspecified
> behaviour" if I've ever seen it. How is userspace supposed to
> know/deal with this?
> 
> Indeed, this loophole clause doesn't exist in the man pages that
> define what statx.stx_ino_version means. The man pages explicitly
> define that stx_ino_version only ever changes when stx_ctime
> changes.
> 

We can fix the manpage to make this more clear.

> IOWs, the behaviour userspace developers are going to expect *does
> not include* stx_ino_version changing it more often than ctime is
> changed. Hence a kernel iversion implementation that bumps the
> counter more often than ctime changes *is not conformant with the
> statx version counter specification*. IOWs, we can't export such
> behaviour to userspace *ever* - it is a non-conformant
> implementation.
> 

Nonsense. The statx version counter specification is *whatever we decide
to make it*. If we define it to allow for spurious version bumps, then
these implementations would be conformant.

Given that you can't tell what or how much changed in the inode whenever
the value changes, allowing it to be bumped on non-observable changes is
ok and the counter is still useful. When you see it change you need to
go stat/read/getxattr etc, to see what actually happened anyway.

Most applications won't be interested in every possible explicit change
that can happen to an inode. It's likely these applications would check
the parts of the inode they're interested in, and then go back to
waiting for the next bump if the change wasn't significant to them.


> Hence I think anything that bumps iversion outside the bounds of the
> statx definition should be declared as such:
> 
> "Non-conformant iversion implementations:
> 	- MUST NOT be exported by statx() to userspace
> 	- MUST be -tolerated- by kernel internal applications that
> 	  use iversion for their own purposes."
> 

I think this is more strict than is needed. An implementation that bumps
this value more often than is necessary is still useful. It's not
_ideal_, but it still meets the needs of NFSv4, IMA and other potential
users of it. After all, this is basically the definition of i_version
today and it's still useful, even if atime update i_version bumps are
currently harmful for performance.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 3/7] ext4: unconditionally enable the i_version counter
  2022-08-26 21:46 ` [PATCH v3 3/7] ext4: unconditionally enable the i_version counter Jeff Layton
@ 2022-08-29 14:51   ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2022-08-29 14:51 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, brauner, linux-api,
	linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph, linux-ext4,
	linux-nfs, linux-xfs, Benjamin Coddington, Christoph Hellwig

On Fri 26-08-22 17:46:59, Jeff Layton wrote:
> The original i_version implementation was pretty expensive, requiring a
> log flush on every change. Because of this, it was gated behind a mount
> option (implemented via the MS_I_VERSION mountoption flag).

I don't think "log flush on every change" was really required. But "logging
an inode on every change" was required and even that can get expensive.

> Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
> i_version flag much less expensive, so there is no longer a performance
> penalty from enabling it. xfs and btrfs already enable it
> unconditionally when the on-disk format can support it.
> 
> Have ext4 ignore the SB_I_VERSION flag, and just enable it
> unconditionally. While we're in here, remove the handling of
> Opt_i_version as well since it's due for deprecation anyway.
> 
> Ideally, we'd couple this change with a way to disable the i_version
> counter (just in case), but the way the iversion mount option was
> implemented makes that difficult to do. We'd need to add a new mount
> option altogether or do something with tune2fs. That's probably best
> left to later patches if it turns out to be needed.
> 
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Benjamin Coddington <bcodding@redhat.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Otherwise the change looks good to me so feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c       |  2 +-
>  fs/ext4/ioctl.c       | 12 ++++--------
>  fs/ext4/move_extent.c |  6 ++----
>  fs/ext4/super.c       | 13 ++++---------
>  fs/ext4/xattr.c       |  3 +--
>  5 files changed, 12 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index aa37bce4c541..6ef37269e7c0 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5342,7 +5342,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
>  	int error, rc = 0;
>  	int orphan = 0;
>  	const unsigned int ia_valid = attr->ia_valid;
> -	bool inc_ivers = IS_I_VERSION(inode);
> +	bool inc_ivers = true;
>  
>  	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
>  		return -EIO;
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 60e77ae9342d..ad3a294a88eb 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -452,8 +452,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
>  	swap_inode_data(inode, inode_bl);
>  
>  	inode->i_ctime = inode_bl->i_ctime = current_time(inode);
> -	if (IS_I_VERSION(inode))
> -		inode_inc_iversion(inode);
> +	inode_inc_iversion(inode);
>  
>  	inode->i_generation = prandom_u32();
>  	inode_bl->i_generation = prandom_u32();
> @@ -667,8 +666,7 @@ static int ext4_ioctl_setflags(struct inode *inode,
>  	ext4_set_inode_flags(inode, false);
>  
>  	inode->i_ctime = current_time(inode);
> -	if (IS_I_VERSION(inode))
> -		inode_inc_iversion(inode);
> +	inode_inc_iversion(inode);
>  
>  	err = ext4_mark_iloc_dirty(handle, inode, &iloc);
>  flags_err:
> @@ -779,8 +777,7 @@ static int ext4_ioctl_setproject(struct inode *inode, __u32 projid)
>  
>  	EXT4_I(inode)->i_projid = kprojid;
>  	inode->i_ctime = current_time(inode);
> -	if (IS_I_VERSION(inode))
> -		inode_inc_iversion(inode);
> +	inode_inc_iversion(inode);
>  out_dirty:
>  	rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
>  	if (!err)
> @@ -1263,8 +1260,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>  		err = ext4_reserve_inode_write(handle, inode, &iloc);
>  		if (err == 0) {
>  			inode->i_ctime = current_time(inode);
> -			if (IS_I_VERSION(inode))
> -				inode_inc_iversion(inode);
> +			inode_inc_iversion(inode);
>  			inode->i_generation = generation;
>  			err = ext4_mark_iloc_dirty(handle, inode, &iloc);
>  		}
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index d73ab3153218..285700b00d38 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -687,10 +687,8 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
>  
>  		orig_inode->i_ctime = current_time(orig_inode);
>  		donor_inode->i_ctime = current_time(donor_inode);
> -		if (IS_I_VERSION(orig_inode))
> -			inode_inc_iversion(orig_inode);
> -		if (IS_I_VERSION(donor_inode))
> -			inode_inc_iversion(donor_inode);
> +		inode_inc_iversion(orig_inode);
> +		inode_inc_iversion(donor_inode);
>  	}
>  	*moved_len = o_start - orig_blk;
>  	if (*moved_len > len)
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 9a66abcca1a8..e7cf5361245a 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1585,7 +1585,7 @@ enum {
>  	Opt_inlinecrypt,
>  	Opt_usrjquota, Opt_grpjquota, Opt_quota,
>  	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
> -	Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
> +	Opt_usrquota, Opt_grpquota, Opt_prjquota,
>  	Opt_dax, Opt_dax_always, Opt_dax_inode, Opt_dax_never,
>  	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
>  	Opt_nowarn_on_error, Opt_mblk_io_submit, Opt_debug_want_extra_isize,
> @@ -1694,7 +1694,6 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
>  	fsparam_flag	("barrier",		Opt_barrier),
>  	fsparam_u32	("barrier",		Opt_barrier),
>  	fsparam_flag	("nobarrier",		Opt_nobarrier),
> -	fsparam_flag	("i_version",		Opt_i_version),
>  	fsparam_flag	("dax",			Opt_dax),
>  	fsparam_enum	("dax",			Opt_dax_type, ext4_param_dax),
>  	fsparam_u32	("stripe",		Opt_stripe),
> @@ -2140,11 +2139,6 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  	case Opt_abort:
>  		ctx_set_mount_flag(ctx, EXT4_MF_FS_ABORTED);
>  		return 0;
> -	case Opt_i_version:
> -		ext4_msg(NULL, KERN_WARNING, deprecated_msg, param->key, "5.20");
> -		ext4_msg(NULL, KERN_WARNING, "Use iversion instead\n");
> -		ctx_set_flags(ctx, SB_I_VERSION);
> -		return 0;
>  	case Opt_inlinecrypt:
>  #ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
>  		ctx_set_flags(ctx, SB_INLINECRYPT);
> @@ -2970,8 +2964,6 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
>  		SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
>  	if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
>  		SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
> -	if (sb->s_flags & SB_I_VERSION)
> -		SEQ_OPTS_PUTS("i_version");
>  	if (nodefs || sbi->s_stripe)
>  		SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
>  	if (nodefs || EXT4_MOUNT_DATA_FLAGS &
> @@ -4640,6 +4632,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  	sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
>  		(test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);
>  
> +	/* i_version is always enabled now */
> +	sb->s_flags |= SB_I_VERSION;
> +
>  	if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
>  	    (ext4_has_compat_features(sb) ||
>  	     ext4_has_ro_compat_features(sb) ||
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index e975442e4ab2..36d6ba7190b6 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -2412,8 +2412,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
>  	if (!error) {
>  		ext4_xattr_update_super_block(handle, inode->i_sb);
>  		inode->i_ctime = current_time(inode);
> -		if (IS_I_VERSION(inode))
> -			inode_inc_iversion(inode);
> +		inode_inc_iversion(inode);
>  		if (!value)
>  			no_expand = 0;
>  		error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
> -- 
> 2.37.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-29 10:39     ` Jeff Layton
@ 2022-08-29 22:58       ` NeilBrown
  2022-08-30 11:40         ` Jeff Layton
  2022-08-30  1:04       ` Dave Chinner
  1 sibling, 1 reply; 46+ messages in thread
From: NeilBrown @ 2022-08-29 22:58 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dave Chinner, tytso, adilger.kernel, djwong, trondmy, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, brauner, linux-api,
	linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

On Mon, 29 Aug 2022, Jeff Layton wrote:
> On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote:
> > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote:
> > > The i_version field in the kernel has had different semantics over
> > > the decades, but we're now proposing to expose it to userland via
> > > statx. This means that we need a clear, consistent definition of
> > > what it means and when it should change.
> > > 
> > > Update the comments in iversion.h to describe how a conformant
> > > i_version implementation is expected to behave. This definition
> > > suits the current users of i_version (NFSv4 and IMA), but is
> > > loose enough to allow for a wide range of possible implementations.
> > > 
> > > Cc: Colin Walters <walters@verbum.org>
> > > Cc: NeilBrown <neilb@suse.de>
> > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  include/linux/iversion.h | 23 +++++++++++++++++++++--
> > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > index 3bfebde5a1a6..45e93e1b4edc 100644
> > > --- a/include/linux/iversion.h
> > > +++ b/include/linux/iversion.h
> > > @@ -9,8 +9,19 @@
> > >   * ---------------------------
> > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > - * appear different to observers if there was a change to the inode's data or
> > > - * metadata since it was last queried.
> > > + * appear different to observers if there was an explicit change to the inode's
> > > + * data or metadata since it was last queried.
> > > + *
> > > + * An explicit change is one that would ordinarily result in a change to the
> > > + * inode status change time (aka ctime). The version must appear to change, even
> > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > + * to timestamp granularity). If POSIX mandates that the ctime must change due
> > > + * to an operation, then the i_version counter must be incremented as well.
> > > + *
> > > + * A conformant implementation is allowed to increment the counter in other
> > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
> > > + * whether caches are up to date. Spurious increments can cause false cache
> > > + * invalidations.
> > 
> > "not optimal", but never-the-less allowed - that's "unspecified
> > behaviour" if I've ever seen it. How is userspace supposed to
> > know/deal with this?
> > 
> > Indeed, this loophole clause doesn't exist in the man pages that
> > define what statx.stx_ino_version means. The man pages explicitly
> > define that stx_ino_version only ever changes when stx_ctime
> > changes.
> > 
> 
> We can fix the manpage to make this more clear.
> 
> > IOWs, the behaviour userspace developers are going to expect *does
> > not include* stx_ino_version changing it more often than ctime is
> > changed. Hence a kernel iversion implementation that bumps the
> > counter more often than ctime changes *is not conformant with the
> > statx version counter specification*. IOWs, we can't export such
> > behaviour to userspace *ever* - it is a non-conformant
> > implementation.
> > 
> 
> Nonsense. The statx version counter specification is *whatever we decide
> to make it*. If we define it to allow for spurious version bumps, then
> these implementations would be conformant.
> 
> Given that you can't tell what or how much changed in the inode whenever
> the value changes, allowing it to be bumped on non-observable changes is
> ok and the counter is still useful. When you see it change you need to
> go stat/read/getxattr etc, to see what actually happened anyway.
> 
> Most applications won't be interested in every possible explicit change
> that can happen to an inode. It's likely these applications would check
> the parts of the inode they're interested in, and then go back to
> waiting for the next bump if the change wasn't significant to them.
> 
> 
> > Hence I think anything that bumps iversion outside the bounds of the
> > statx definition should be declared as such:
> > 
> > "Non-conformant iversion implementations:
> > 	- MUST NOT be exported by statx() to userspace
> > 	- MUST be -tolerated- by kernel internal applications that
> > 	  use iversion for their own purposes."
> > 
> 
> I think this is more strict than is needed. An implementation that bumps
> this value more often than is necessary is still useful. It's not
> _ideal_, but it still meets the needs of NFSv4, IMA and other potential
> users of it. After all, this is basically the definition of i_version
> today and it's still useful, even if atime update i_version bumps are
> currently harmful for performance.

Why do you want to let it be OK?  Who is hurt by it being "more strict
than needed"?  There is an obvious cost in not being strict as an
implementation can be compliant but completely useless (increment every
nanosecond).  So there needs to be a clear benefit to balance this.  Who
benefits by not being strict?

Also: Your spec doesn't say it must increase, only it must be different.
So would as hash of all data and metadata be allowed (sysfs might be
able to provide that, but probably wouldn't bother).

Also: if stray updates are still conformant, can occasional repeated
values be still conformant?  I would like for a high-precision ctime
timestamp to be acceptable, but as time can go backwards it is currently
not conformant (even though the xfs iversion which is less useful is
actually conformant).

NeilBrown

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-29 10:33                     ` Jeff Layton
@ 2022-08-30  0:08                       ` Dave Chinner
  2022-08-30 11:20                         ` Jeff Layton
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2022-08-30  0:08 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Trond Myklebust, djwong, zohar, brauner, xiubli,
	neilb, linux-api, linux-xfs, dwysocha, linux-kernel, chuck.lever,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, ceph-devel

On Mon, Aug 29, 2022 at 06:33:48AM -0400, Jeff Layton wrote:
> On Mon, 2022-08-29 at 15:48 +1000, Dave Chinner wrote:
> > > 
> > > The race window ought to be relatively small, and this wouldn't result
> > > in incorrect behavior that you'd notice (other than loss of
> > > performance), but it's not ideal. We're doing more on-the-wire reads
> > > than are necessary in this case.
> > > 
> > > It would be nice to have it not do that. If we end up taking this patch
> > > to make it elide the i_version bumps on atime updates, we may be able to
> > > set the the NOIVER flag in other cases as well, and avoid some of these
> > > extra bumps.
> > 
> > 
> > <sigh>
> > 
> > Please don't make me repeat myself for the third time.
> > 
> > Once we have decided on a solid, unchanging definition for the
> > *statx user API variable*, we'll implement a new on-disk field that
> > provides this information.  We will document it in the on-disk
> > specification as "this is how di_iversion behaves" so that it is
> > clear to everyone parsing the on-disk format or writing their own
> > XFS driver how to implement it and when to expect it to
> > change.
> > 
> > Then we can add a filesystem and inode feature flags that say "inode
> > has new iversion" and we use that to populate the kernel iversion
> > instead of di_changecount. We keep di_changecount exactly the way it
> > is now for the applications and use cases we already have for that
> > specific behaviour. If the kernel and/or filesystem don't support
> > the new di_iversion field, then we'll use di_changecount as it
> > currently exists for the kernel iversion code.
> > 
> 
> Aside from NFS and IMA, what applications are dependent on the current
> definition and how do they rely on i_version today?

I've answered this multiple times already: the di_changecount
behaviour is defined in the on-disk specification and hence we
*cannot change the behaviour* without changing the on-disk format
specification.

Apart from the forensics aspect of the change counter (which nobody
but us XFS developers seem to understand just how damn important
this is), there are *many* third party applications that parse the
XFS on-disk format directly. This:

https://codesearch.debian.net/search?q=XFS_SB_VERSION_DIRV2&literal=1

Shows grub2, libparted, syslinux, partclone and fsarchiver as
knowing about XFS on-disk superblock flags that tell them what
format the directory structure is in. That alone is enough to
indicate they parse on-disk inodes directly, and hence may expect
di_changecount to have specific meaning and use it to detect
unexpected changes to files/directories they care about.

If I go looking for XFS_SB_MAGIC, I find things like libblkid,
klibc, qemu, Xen, testdisk, gpart, and virtualbox all parse the
on-disk superblocks directly from the block device, too. They also
rely directly on XFS developers ensuring there are no silent
incomaptible changes to the on disk format.

I also know of many other utilities that people and companies have
written that parse the on disk format directly from userspace. The
functions these perform include low level storage management tools,
copying and managing disk images (e.g. offline configuration for
cluster deployments), data recovery tools that scrape all the data
out of broken filesystems, etc.

These applications are reliant on the guarantee we provide that the
on-disk format will not silently change and that behaviour/structure
can always easily be discovered by feature flags in the superblock
and/or inodes.

IOWs, just because there aren't obvious "traditional" application on
top of the kernel filesystem that consumes the in-memory kernel
iversion field, it does not mean that the defined behaviour of the
on-disk di_changecount field is not used or relied on by other tools
that work directly on the on-disk format.

You might be right that NFS doesn't care about this, but the point
remains that NFS does not control the XFS on-disk format, nor does
the fact that what NFS wants from the change attribute has changed
over time override the fact that maintaining XFS on-disk format
compatibility is the responsibility of XFS developers. We're willing
to change the on-disk format to support whatever the new definition
of the statx change attribute ends up being, and that should be the
end of the discussion.

> > Keep in mind that we've been doing dynamic inode format updates in
> > XFS for a couple of decades - users don't even have to be aware that
> > they need to perform format upgrades because often they just happen
> > whenever an inode is accessed. IOWs, just because we have to change
> > the on-disk format to support this new iversion definition, it
> > doesn't mean users have to reformat filesystems before the new
> > feature can be used.
> > 
> > Hence, over time, as distros update kernels, the XFS iversion
> > behaviour will change automagically as we update inodes in existing
> > filesystems as they are accessed to add and then use the new
> > di_iversion field for the VFS change attribute field instead of the
> > di_changecount field...
> > 
> 
> If you want to create a whole new on-disk field for this, then that's
> your prerogative, but before you do that, I'd like to better understand
> why and how the constraints on this field changed.
> 
> The original log message from the commit that added a change counter
> (below) stated that you were adding it for network filesystems like NFS.
> When did this change and why?

It never changed. I'll repeat what I've already explained twice
before:

https://lore.kernel.org/linux-xfs/20220818030048.GE3600936@dread.disaster.area/
https://lore.kernel.org/linux-xfs/20220818033731.GF3600936@dread.disaster.area/

tl; dr: NFS requirements were just one of *many* we had at the time
for an atomic persistent change counter.

The fact is that NFS users are just going to have to put up with
random cache invalidations on XFS for a while longer. Nobody noticed
this and/or cared about this enough to raise it as an issue for the
past decade, so waiting another few months for upstream XFS to
change to a different on-disk format for the NFS/statx change
attribute isn't a big deal.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-29 10:39     ` Jeff Layton
  2022-08-29 22:58       ` NeilBrown
@ 2022-08-30  1:04       ` Dave Chinner
  2022-08-30 12:38         ` Jeff Layton
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2022-08-30  1:04 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, trondmy, neilb, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, brauner, linux-api,
	linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

On Mon, Aug 29, 2022 at 06:39:04AM -0400, Jeff Layton wrote:
> On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote:
> > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote:
> > > The i_version field in the kernel has had different semantics over
> > > the decades, but we're now proposing to expose it to userland via
> > > statx. This means that we need a clear, consistent definition of
> > > what it means and when it should change.
> > > 
> > > Update the comments in iversion.h to describe how a conformant
> > > i_version implementation is expected to behave. This definition
> > > suits the current users of i_version (NFSv4 and IMA), but is
> > > loose enough to allow for a wide range of possible implementations.
> > > 
> > > Cc: Colin Walters <walters@verbum.org>
> > > Cc: NeilBrown <neilb@suse.de>
> > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  include/linux/iversion.h | 23 +++++++++++++++++++++--
> > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > index 3bfebde5a1a6..45e93e1b4edc 100644
> > > --- a/include/linux/iversion.h
> > > +++ b/include/linux/iversion.h
> > > @@ -9,8 +9,19 @@
> > >   * ---------------------------
> > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > - * appear different to observers if there was a change to the inode's data or
> > > - * metadata since it was last queried.
> > > + * appear different to observers if there was an explicit change to the inode's
> > > + * data or metadata since it was last queried.
> > > + *
> > > + * An explicit change is one that would ordinarily result in a change to the
> > > + * inode status change time (aka ctime). The version must appear to change, even
> > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > + * to timestamp granularity). If POSIX mandates that the ctime must change due
> > > + * to an operation, then the i_version counter must be incremented as well.
> > > + *
> > > + * A conformant implementation is allowed to increment the counter in other
> > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
> > > + * whether caches are up to date. Spurious increments can cause false cache
> > > + * invalidations.
> > 
> > "not optimal", but never-the-less allowed - that's "unspecified
> > behaviour" if I've ever seen it. How is userspace supposed to
> > know/deal with this?
> > 
> > Indeed, this loophole clause doesn't exist in the man pages that
> > define what statx.stx_ino_version means. The man pages explicitly
> > define that stx_ino_version only ever changes when stx_ctime
> > changes.
> > 
> 
> We can fix the manpage to make this more clear.
> 
> > IOWs, the behaviour userspace developers are going to expect *does
> > not include* stx_ino_version changing it more often than ctime is
> > changed. Hence a kernel iversion implementation that bumps the
> > counter more often than ctime changes *is not conformant with the
> > statx version counter specification*. IOWs, we can't export such
> > behaviour to userspace *ever* - it is a non-conformant
> > implementation.
> > 
> 
> Nonsense. The statx version counter specification is *whatever we decide
> to make it*.

Yes, but...

> If we define it to allow for spurious version bumps, then
> these implementations would be conformant.

... that's _not how you defined stx_ino_version to behave_!

> Given that you can't tell what or how much changed in the inode whenever
> the value changes, allowing it to be bumped on non-observable changes is
> ok and the counter is still useful. When you see it change you need to
> go stat/read/getxattr etc, to see what actually happened anyway.

IDGI. If this is acceptible, then you're forcing userspace into
"store and filter" implementations as the only viable method of
using the change notification usefully.

That means atime is just another attribute in the "store and
filter" algorithm, so if this is how we define stx_ino_version
behaviour, why carve out an explicit exception for atime?

> Most applications won't be interested in every possible explicit change
> that can happen to an inode. It's likely these applications would check
> the parts of the inode they're interested in, and then go back to
> waiting for the next bump if the change wasn't significant to them.

Yes, that is exactly my point.

You make the argument that we must not bump iversion in certain
situations (atime) because it will cause spurious cache
invalidations, but then say it is OK to bump it in others regardless
of the fact that it will cause spurious cache invalidations. And you
justify this latter behaviour by saying it is up to the application
to avoid spurious invalidations by using "store and filter"
algorithms.

If the application has to store state and filter changes indicated
by stx_ino_version changing, then by definition *it must be capable
of filtering iversion bumps as a result of atime changes*.

The iversion exception carved out for atime requires the application
to implement "store and filter" algorithms only if it needs to care
about atime changes. The "invisible bump" exception carved out here
*requires* applications to implement "store and filter" algorithms
to filter out invisible bumps.

Hence if we combine both these behaviours, atime bumping iversion
appears to userspace exactly the same as "invisible bump occurred,
followed by access that changes atime".  IOWs, userspace cannot tell the
difference between a filesystem implementation that doesn't bump
iversion on atime but has invisible bump, and a filesystem that
bumps iversion on atime updates and so it always needs to filter
atime changes if it doesn't care about them.

Hence if stx_ino_version can have invisible bumps, it makes no
difference to userspace if atime updates bump iversion or not. They
will have to filter atime if they don't care about it, and they have
to store the new stx_ino_version every time they filter out an
invisible bump that doesn't change anything their filters care
about (e.g. atime!).

At which point I have to ask: if we are expecting userspace to
filter out invisible iversion bumps because that's allowed,
conformant behaviour, then why aren't we requiring both the NFS
server and IMA applications to filter spurious iversion bumps as
well?

> > Hence I think anything that bumps iversion outside the bounds of the
> > statx definition should be declared as such:
> > 
> > "Non-conformant iversion implementations:
> > 	- MUST NOT be exported by statx() to userspace
> > 	- MUST be -tolerated- by kernel internal applications that
> > 	  use iversion for their own purposes."
> > 
> 
> I think this is more strict than is needed. An implementation that bumps
> this value more often than is necessary is still useful.

I never said that non-conformant implementations aren't useful. What
I said is they aren't conformant with the provided definition of
stx_ino_version, and as a result we should not allow them to be
exposed to userspace.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time
  2022-08-30  0:08                       ` Dave Chinner
@ 2022-08-30 11:20                         ` Jeff Layton
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2022-08-30 11:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Trond Myklebust, djwong, zohar, brauner, xiubli,
	neilb, linux-api, linux-xfs, dwysocha, linux-kernel, chuck.lever,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, ceph-devel

On Tue, 2022-08-30 at 10:08 +1000, Dave Chinner wrote:
> On Mon, Aug 29, 2022 at 06:33:48AM -0400, Jeff Layton wrote:
> > On Mon, 2022-08-29 at 15:48 +1000, Dave Chinner wrote:
> > > > 
> > > > The race window ought to be relatively small, and this wouldn't result
> > > > in incorrect behavior that you'd notice (other than loss of
> > > > performance), but it's not ideal. We're doing more on-the-wire reads
> > > > than are necessary in this case.
> > > > 
> > > > It would be nice to have it not do that. If we end up taking this patch
> > > > to make it elide the i_version bumps on atime updates, we may be able to
> > > > set the the NOIVER flag in other cases as well, and avoid some of these
> > > > extra bumps.
> > > 
> > > 
> > > <sigh>
> > > 
> > > Please don't make me repeat myself for the third time.
> > > 
> > > Once we have decided on a solid, unchanging definition for the
> > > *statx user API variable*, we'll implement a new on-disk field that
> > > provides this information.  We will document it in the on-disk
> > > specification as "this is how di_iversion behaves" so that it is
> > > clear to everyone parsing the on-disk format or writing their own
> > > XFS driver how to implement it and when to expect it to
> > > change.
> > > 
> > > Then we can add a filesystem and inode feature flags that say "inode
> > > has new iversion" and we use that to populate the kernel iversion
> > > instead of di_changecount. We keep di_changecount exactly the way it
> > > is now for the applications and use cases we already have for that
> > > specific behaviour. If the kernel and/or filesystem don't support
> > > the new di_iversion field, then we'll use di_changecount as it
> > > currently exists for the kernel iversion code.
> > > 
> > 
> > Aside from NFS and IMA, what applications are dependent on the current
> > definition and how do they rely on i_version today?
> 
> I've answered this multiple times already: the di_changecount
> behaviour is defined in the on-disk specification and hence we
> *cannot change the behaviour* without changing the on-disk format
> specification.
> 
> Apart from the forensics aspect of the change counter (which nobody
> but us XFS developers seem to understand just how damn important
> this is), there are *many* third party applications that parse the
> XFS on-disk format directly. This:
> 
> https://codesearch.debian.net/search?q=XFS_SB_VERSION_DIRV2&literal=1
> 
> Shows grub2, libparted, syslinux, partclone and fsarchiver as
> knowing about XFS on-disk superblock flags that tell them what
> format the directory structure is in. That alone is enough to
> indicate they parse on-disk inodes directly, and hence may expect
> di_changecount to have specific meaning and use it to detect
> unexpected changes to files/directories they care about.
> 
> If I go looking for XFS_SB_MAGIC, I find things like libblkid,
> klibc, qemu, Xen, testdisk, gpart, and virtualbox all parse the
> on-disk superblocks directly from the block device, too. They also
> rely directly on XFS developers ensuring there are no silent
> incomaptible changes to the on disk format.
> 
> I also know of many other utilities that people and companies have
> written that parse the on disk format directly from userspace. The
> functions these perform include low level storage management tools,
> copying and managing disk images (e.g. offline configuration for
> cluster deployments), data recovery tools that scrape all the data
> out of broken filesystems, etc.
> 
> These applications are reliant on the guarantee we provide that the
> on-disk format will not silently change and that behaviour/structure
> can always easily be discovered by feature flags in the superblock
> and/or inodes.
> 
> IOWs, just because there aren't obvious "traditional" application on
> top of the kernel filesystem that consumes the in-memory kernel
> iversion field, it does not mean that the defined behaviour of the
> on-disk di_changecount field is not used or relied on by other tools
> that work directly on the on-disk format.
> 
> You might be right that NFS doesn't care about this, but the point
> remains that NFS does not control the XFS on-disk format, nor does
> the fact that what NFS wants from the change attribute has changed
> over time override the fact that maintaining XFS on-disk format
> compatibility is the responsibility of XFS developers. We're willing
> to change the on-disk format to support whatever the new definition
> of the statx change attribute ends up being, and that should be the
> end of the discussion.
> 

Thanks for spelling this out in more detail.

> > > Keep in mind that we've been doing dynamic inode format updates in
> > > XFS for a couple of decades - users don't even have to be aware that
> > > they need to perform format upgrades because often they just happen
> > > whenever an inode is accessed. IOWs, just because we have to change
> > > the on-disk format to support this new iversion definition, it
> > > doesn't mean users have to reformat filesystems before the new
> > > feature can be used.
> > > 
> > > Hence, over time, as distros update kernels, the XFS iversion
> > > behaviour will change automagically as we update inodes in existing
> > > filesystems as they are accessed to add and then use the new
> > > di_iversion field for the VFS change attribute field instead of the
> > > di_changecount field...
> > > 
> > 
> > If you want to create a whole new on-disk field for this, then that's
> > your prerogative, but before you do that, I'd like to better understand
> > why and how the constraints on this field changed.
> > 
> > The original log message from the commit that added a change counter
> > (below) stated that you were adding it for network filesystems like NFS.
> > When did this change and why?
> 
> It never changed. I'll repeat what I've already explained twice
> before:
> 
> https://lore.kernel.org/linux-xfs/20220818030048.GE3600936@dread.disaster.area/
> https://lore.kernel.org/linux-xfs/20220818033731.GF3600936@dread.disaster.area/
> 
> tl; dr: NFS requirements were just one of *many* we had at the time
> for an atomic persistent change counter.
> 
> The fact is that NFS users are just going to have to put up with
> random cache invalidations on XFS for a while longer. Nobody noticed
> this and/or cared about this enough to raise it as an issue for the
> past decade, so waiting another few months for upstream XFS to
> change to a different on-disk format for the NFS/statx change
> attribute isn't a big deal.
> 

Fair enough. I'll plan to drop this patch from the series for now, with
the expectation that you guys will add a new i_version counter that
better conforms to what NFS and IMA need.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-29 22:58       ` NeilBrown
@ 2022-08-30 11:40         ` Jeff Layton
  2022-08-30 13:24           ` J. Bruce Fields
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2022-08-30 11:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dave Chinner, tytso, adilger.kernel, djwong, trondmy, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, brauner, linux-api,
	linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

On Tue, 2022-08-30 at 08:58 +1000, NeilBrown wrote:
> On Mon, 29 Aug 2022, Jeff Layton wrote:
> > On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote:
> > > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote:
> > > > The i_version field in the kernel has had different semantics over
> > > > the decades, but we're now proposing to expose it to userland via
> > > > statx. This means that we need a clear, consistent definition of
> > > > what it means and when it should change.
> > > > 
> > > > Update the comments in iversion.h to describe how a conformant
> > > > i_version implementation is expected to behave. This definition
> > > > suits the current users of i_version (NFSv4 and IMA), but is
> > > > loose enough to allow for a wide range of possible implementations.
> > > > 
> > > > Cc: Colin Walters <walters@verbum.org>
> > > > Cc: NeilBrown <neilb@suse.de>
> > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  include/linux/iversion.h | 23 +++++++++++++++++++++--
> > > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > index 3bfebde5a1a6..45e93e1b4edc 100644
> > > > --- a/include/linux/iversion.h
> > > > +++ b/include/linux/iversion.h
> > > > @@ -9,8 +9,19 @@
> > > >   * ---------------------------
> > > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > > - * appear different to observers if there was a change to the inode's data or
> > > > - * metadata since it was last queried.
> > > > + * appear different to observers if there was an explicit change to the inode's
> > > > + * data or metadata since it was last queried.
> > > > + *
> > > > + * An explicit change is one that would ordinarily result in a change to the
> > > > + * inode status change time (aka ctime). The version must appear to change, even
> > > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > > + * to timestamp granularity). If POSIX mandates that the ctime must change due
> > > > + * to an operation, then the i_version counter must be incremented as well.
> > > > + *
> > > > + * A conformant implementation is allowed to increment the counter in other
> > > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
> > > > + * whether caches are up to date. Spurious increments can cause false cache
> > > > + * invalidations.
> > > 
> > > "not optimal", but never-the-less allowed - that's "unspecified
> > > behaviour" if I've ever seen it. How is userspace supposed to
> > > know/deal with this?
> > > 
> > > Indeed, this loophole clause doesn't exist in the man pages that
> > > define what statx.stx_ino_version means. The man pages explicitly
> > > define that stx_ino_version only ever changes when stx_ctime
> > > changes.
> > > 
> > 
> > We can fix the manpage to make this more clear.
> > 
> > > IOWs, the behaviour userspace developers are going to expect *does
> > > not include* stx_ino_version changing it more often than ctime is
> > > changed. Hence a kernel iversion implementation that bumps the
> > > counter more often than ctime changes *is not conformant with the
> > > statx version counter specification*. IOWs, we can't export such
> > > behaviour to userspace *ever* - it is a non-conformant
> > > implementation.
> > > 
> > 
> > Nonsense. The statx version counter specification is *whatever we decide
> > to make it*. If we define it to allow for spurious version bumps, then
> > these implementations would be conformant.
> > 
> > Given that you can't tell what or how much changed in the inode whenever
> > the value changes, allowing it to be bumped on non-observable changes is
> > ok and the counter is still useful. When you see it change you need to
> > go stat/read/getxattr etc, to see what actually happened anyway.
> > 
> > Most applications won't be interested in every possible explicit change
> > that can happen to an inode. It's likely these applications would check
> > the parts of the inode they're interested in, and then go back to
> > waiting for the next bump if the change wasn't significant to them.
> > 
> > 
> > > Hence I think anything that bumps iversion outside the bounds of the
> > > statx definition should be declared as such:
> > > 
> > > "Non-conformant iversion implementations:
> > > 	- MUST NOT be exported by statx() to userspace
> > > 	- MUST be -tolerated- by kernel internal applications that
> > > 	  use iversion for their own purposes."
> > > 
> > 
> > I think this is more strict than is needed. An implementation that bumps
> > this value more often than is necessary is still useful. It's not
> > _ideal_, but it still meets the needs of NFSv4, IMA and other potential
> > users of it. After all, this is basically the definition of i_version
> > today and it's still useful, even if atime update i_version bumps are
> > currently harmful for performance.
> 
> Why do you want to let it be OK?  Who is hurt by it being "more strict
> than needed"?  There is an obvious cost in not being strict as an
> implementation can be compliant but completely useless (increment every
> nanosecond).  So there needs to be a clear benefit to balance this.  Who
> benefits by not being strict?
> 

Other filesystems that may not be able to provide the strict semantics
required. I don't have any names to name here -- I'm just trying to
ensure that we don't paint ourselves into a corner with rules that are
more strict than we really need.

If the consensus is that we should keep the definition strict, then Ican
live with that. That might narrow the number of filesystems that can
provide this attribute though.

> Also: Your spec doesn't say it must increase, only it must be different.
> So would as hash of all data and metadata be allowed (sysfs might be
> able to provide that, but probably wouldn't bother).
> 
> Also: if stray updates are still conformant, can occasional repeated
> values be still conformant?  I would like for a high-precision ctime
> timestamp to be acceptable, but as time can go backwards it is currently
> not conformant (even though the xfs iversion which is less useful is
> actually conformant).
> 

Yes, saying only that it must be different is intentional. What we
really want is for consumers to treat this as an opaque value for the
most part [1]. Therefore an implementation based on hashing would
conform to the spec, I'd think, as long as all of the relevant info is
part of the hash.

OTOH, hash collisions could be an issue here and I think we need to
avoid allowing duplicate values. When it comes to watching for changes
to an inode, false positives are generally ok since that should just
affect performance but not functionality.

False negatives are a different matter. They can lead to cache coherency
issues with NFS, or missing remeasurements in IMA. Userland applications
that use this could be subject to similar issues.
-- 
Jeff Layton <jlayton@kernel.org>


[1]: which may not be reasonable in the case of write delegations on
NFSv4, since the client is expected to increment it when it has cached
local changes.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30  1:04       ` Dave Chinner
@ 2022-08-30 12:38         ` Jeff Layton
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2022-08-30 12:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: tytso, adilger.kernel, djwong, trondmy, neilb, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, brauner, linux-api,
	linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

On Tue, 2022-08-30 at 11:04 +1000, Dave Chinner wrote:
> On Mon, Aug 29, 2022 at 06:39:04AM -0400, Jeff Layton wrote:
> > On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote:
> > > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote:
> > > > The i_version field in the kernel has had different semantics over
> > > > the decades, but we're now proposing to expose it to userland via
> > > > statx. This means that we need a clear, consistent definition of
> > > > what it means and when it should change.
> > > > 
> > > > Update the comments in iversion.h to describe how a conformant
> > > > i_version implementation is expected to behave. This definition
> > > > suits the current users of i_version (NFSv4 and IMA), but is
> > > > loose enough to allow for a wide range of possible implementations.
> > > > 
> > > > Cc: Colin Walters <walters@verbum.org>
> > > > Cc: NeilBrown <neilb@suse.de>
> > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  include/linux/iversion.h | 23 +++++++++++++++++++++--
> > > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > index 3bfebde5a1a6..45e93e1b4edc 100644
> > > > --- a/include/linux/iversion.h
> > > > +++ b/include/linux/iversion.h
> > > > @@ -9,8 +9,19 @@
> > > >   * ---------------------------
> > > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > > - * appear different to observers if there was a change to the inode's data or
> > > > - * metadata since it was last queried.
> > > > + * appear different to observers if there was an explicit change to the inode's
> > > > + * data or metadata since it was last queried.
> > > > + *
> > > > + * An explicit change is one that would ordinarily result in a change to the
> > > > + * inode status change time (aka ctime). The version must appear to change, even
> > > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > > + * to timestamp granularity). If POSIX mandates that the ctime must change due
> > > > + * to an operation, then the i_version counter must be incremented as well.
> > > > + *
> > > > + * A conformant implementation is allowed to increment the counter in other
> > > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
> > > > + * whether caches are up to date. Spurious increments can cause false cache
> > > > + * invalidations.
> > > 
> > > "not optimal", but never-the-less allowed - that's "unspecified
> > > behaviour" if I've ever seen it. How is userspace supposed to
> > > know/deal with this?
> > > 
> > > Indeed, this loophole clause doesn't exist in the man pages that
> > > define what statx.stx_ino_version means. The man pages explicitly
> > > define that stx_ino_version only ever changes when stx_ctime
> > > changes.
> > > 
> > 
> > We can fix the manpage to make this more clear.
> > 
> > > IOWs, the behaviour userspace developers are going to expect *does
> > > not include* stx_ino_version changing it more often than ctime is
> > > changed. Hence a kernel iversion implementation that bumps the
> > > counter more often than ctime changes *is not conformant with the
> > > statx version counter specification*. IOWs, we can't export such
> > > behaviour to userspace *ever* - it is a non-conformant
> > > implementation.
> > > 
> > 
> > Nonsense. The statx version counter specification is *whatever we decide
> > to make it*.
> 
> Yes, but...
> 
> > If we define it to allow for spurious version bumps, then
> > these implementations would be conformant.
> 
> ... that's _not how you defined stx_ino_version to behave_!
> 

I certainly didn't say that it must _only_ be incremented when the ctime
would change, only that if the ctime would change that it must be
incremented.

The weasel-words make all the difference. But, point taken, the spec
should be explicit about this. I'll plan to revise the manpage patch and
resend it.

> > Given that you can't tell what or how much changed in the inode whenever
> > the value changes, allowing it to be bumped on non-observable changes is
> > ok and the counter is still useful. When you see it change you need to
> > go stat/read/getxattr etc, to see what actually happened anyway.
> 
> IDGI. If this is acceptible, then you're forcing userspace into
> "store and filter" implementations as the only viable method of
> using the change notification usefully.
> 

Well, that's all it's really useful for anyway. The counter itself has
tells you nothing other than that something changed.

> That means atime is just another attribute in the "store and
> filter" algorithm, so if this is how we define stx_ino_version
> behaviour, why carve out an explicit exception for atime?
> 

Because atime updates are particularly problematic. Ideally, we'd filter
out all implicit updates, but that may not be feasible in all cases.

> > Most applications won't be interested in every possible explicit change
> > that can happen to an inode. It's likely these applications would check
> > the parts of the inode they're interested in, and then go back to
> > waiting for the next bump if the change wasn't significant to them.
> 
> Yes, that is exactly my point.
> 
> You make the argument that we must not bump iversion in certain
> situations (atime) because it will cause spurious cache
> invalidations, but then say it is OK to bump it in others regardless
> of the fact that it will cause spurious cache invalidations. And you
> justify this latter behaviour by saying it is up to the application
> to avoid spurious invalidations by using "store and filter"
> algorithms.
> 
> If the application has to store state and filter changes indicated
> by stx_ino_version changing, then by definition *it must be capable
> of filtering iversion bumps as a result of atime changes*.
> 
> The iversion exception carved out for atime requires the application
> to implement "store and filter" algorithms only if it needs to care
> about atime changes. The "invisible bump" exception carved out here
> *requires* applications to implement "store and filter" algorithms
> to filter out invisible bumps.
> 
> Hence if we combine both these behaviours, atime bumping iversion
> appears to userspace exactly the same as "invisible bump occurred,
> followed by access that changes atime".  IOWs, userspace cannot tell the
> difference between a filesystem implementation that doesn't bump
> iversion on atime but has invisible bump, and a filesystem that
> bumps iversion on atime updates and so it always needs to filter
> atime changes if it doesn't care about them.
> 
> Hence if stx_ino_version can have invisible bumps, it makes no
> difference to userspace if atime updates bump iversion or not. They
> will have to filter atime if they don't care about it, and they have
> to store the new stx_ino_version every time they filter out an
> invisible bump that doesn't change anything their filters care
> about (e.g. atime!).
> 
> At which point I have to ask: if we are expecting userspace to
> filter out invisible iversion bumps because that's allowed,
> conformant behaviour, then why aren't we requiring both the NFS
> server and IMA applications to filter spurious iversion bumps as
> well?
> 

I think you're reading too much into my attempt to carve out atime from
i_version updates. That is purely a pragmatic attempt to staunch the
worst of the bleeding from this problem.

atime updates are _very_ frequent, and the default relatime behavior
ensures that each NFS client reading file data or dir info after a
change to it will end up downloading it at least twice: once to read the
file itself, and then again after it drops its cache due to the atime
update from the prior read.

In an ideal world, an implementation would only bump the i_version on
explicit changes. If an implicit change only happens infrequently, or
always happens in very close succession after an explicit change (such
that readers can't race in as easily) then that's less harmful for
performance.

The i_version value itself can't tell you anything about the inode. It's
only useful for comparing to an earlier sample to see if it is has
changed. This attribute is mostly useful in the context of a store-and-
invalidate kind of system. We want to keep the invalidations to a
minimum, but eliminating spurious updates is more of an optimization
problem than one of correctness.

It would be best if we could eliminate all spurious updates, but I think
the stx_ino_version is just as useful without being that strict, and
that leaves the door open for other implementations that aren't able to
filter out all spurious updates.

> > > Hence I think anything that bumps iversion outside the bounds of the
> > > statx definition should be declared as such:
> > > 
> > > "Non-conformant iversion implementations:
> > > 	- MUST NOT be exported by statx() to userspace
> > > 	- MUST be -tolerated- by kernel internal applications that
> > > 	  use iversion for their own purposes."
> > > 
> > 
> > I think this is more strict than is needed. An implementation that bumps
> > this value more often than is necessary is still useful.
> 
> I never said that non-conformant implementations aren't useful. What
> I said is they aren't conformant with the provided definition of
> stx_ino_version, and as a result we should not allow them to be
> exposed to userspace.

As I said above, I'll respin the manpage patch to better specify what a
conformant implementation can and can't do, and we can discuss from
there.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 11:40         ` Jeff Layton
@ 2022-08-30 13:24           ` J. Bruce Fields
  2022-08-30 13:50             ` Jeff Layton
  0 siblings, 1 reply; 46+ messages in thread
From: J. Bruce Fields @ 2022-08-30 13:24 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, Dave Chinner, tytso, adilger.kernel, djwong, trondmy,
	viro, zohar, xiubli, chuck.lever, lczerner, jack, brauner,
	linux-api, linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph,
	linux-ext4, linux-nfs, linux-xfs, Colin Walters

On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> Yes, saying only that it must be different is intentional. What we
> really want is for consumers to treat this as an opaque value for the
> most part [1]. Therefore an implementation based on hashing would
> conform to the spec, I'd think, as long as all of the relevant info is
> part of the hash.

It'd conform, but it might not be as useful as an increasing value.

E.g. a client can use that to work out which of a series of reordered
write replies is the most recent, and I seem to recall that can prevent
unnecessary invalidations in some cases.

--b.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 13:24           ` J. Bruce Fields
@ 2022-08-30 13:50             ` Jeff Layton
  2022-08-30 14:44               ` J. Bruce Fields
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2022-08-30 13:50 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: NeilBrown, Dave Chinner, tytso, adilger.kernel, djwong, trondmy,
	viro, zohar, xiubli, chuck.lever, lczerner, jack, brauner,
	linux-api, linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph,
	linux-ext4, linux-nfs, linux-xfs, Colin Walters

On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> > Yes, saying only that it must be different is intentional. What we
> > really want is for consumers to treat this as an opaque value for the
> > most part [1]. Therefore an implementation based on hashing would
> > conform to the spec, I'd think, as long as all of the relevant info is
> > part of the hash.
> 
> It'd conform, but it might not be as useful as an increasing value.
> 
> E.g. a client can use that to work out which of a series of reordered
> write replies is the most recent, and I seem to recall that can prevent
> unnecessary invalidations in some cases.
> 

That's a good point; the linux client does this. That said, NFSv4 has a
way for the server to advertise its change attribute behavior [1]
(though nfsd hasn't implemented this yet). We don't have a good way to
do that in userland for now.

This is another place where fsinfo() would have been nice to have. I
think until we have something like that, we'd want to keep our promises
to userland to a minimum.

[1]: https://www.rfc-editor.org/rfc/rfc7862.html#section-12.2.3 . I
guess I should look at plumbing this in for IS_I_VERSION inodes...

-- 
Jeff Layton <jlayton@kernel.org>




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 13:50             ` Jeff Layton
@ 2022-08-30 14:44               ` J. Bruce Fields
  2022-08-30 14:58                 ` Trond Myklebust
  0 siblings, 1 reply; 46+ messages in thread
From: J. Bruce Fields @ 2022-08-30 14:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, Dave Chinner, tytso, adilger.kernel, djwong, trondmy,
	viro, zohar, xiubli, chuck.lever, lczerner, jack, brauner,
	linux-api, linux-btrfs, linux-fsdevel, linux-kernel, linux-ceph,
	linux-ext4, linux-nfs, linux-xfs, Colin Walters

On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> > > Yes, saying only that it must be different is intentional. What we
> > > really want is for consumers to treat this as an opaque value for the
> > > most part [1]. Therefore an implementation based on hashing would
> > > conform to the spec, I'd think, as long as all of the relevant info is
> > > part of the hash.
> > 
> > It'd conform, but it might not be as useful as an increasing value.
> > 
> > E.g. a client can use that to work out which of a series of reordered
> > write replies is the most recent, and I seem to recall that can prevent
> > unnecessary invalidations in some cases.
> > 
> 
> That's a good point; the linux client does this. That said, NFSv4 has a
> way for the server to advertise its change attribute behavior [1]
> (though nfsd hasn't implemented this yet).

It was implemented and reverted.  The issue was that I thought nfsd
should mix in the ctime to prevent the change attribute going backwards
on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but Trond was
concerned about the possibility of time going backwards.  See
1631087ba872 "Revert "nfsd4: support change_attr_type attribute"".
There's some mailing list discussion to that I'm not turning up right
now.

Did NFSv4 add change_attr_type because some implementations needed the
unordered case, or because they realized ordering was useful but wanted
to keep backwards compatibility?  I don't know which it was.

--b.

> We don't have a good way to
> do that in userland for now.
> 
> This is another place where fsinfo() would have been nice to have. I
> think until we have something like that, we'd want to keep our promises
> to userland to a minimum.
> 
> [1]: https://www.rfc-editor.org/rfc/rfc7862.html#section-12.2.3 . I
> guess I should look at plumbing this in for IS_I_VERSION inodes...
> 
> -- 
> Jeff Layton <jlayton@kernel.org>
> 
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 14:44               ` J. Bruce Fields
@ 2022-08-30 14:58                 ` Trond Myklebust
  2022-08-30 15:17                   ` J. Bruce Fields
  0 siblings, 1 reply; 46+ messages in thread
From: Trond Myklebust @ 2022-08-30 14:58 UTC (permalink / raw)
  To: bfields, jlayton
  Cc: zohar, djwong, xiubli, brauner, neilb, linux-api, linux-xfs,
	david, linux-kernel, chuck.lever, linux-ceph, linux-nfs, tytso,
	viro, jack, linux-ext4, linux-btrfs, linux-fsdevel, lczerner,
	adilger.kernel, walters

On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> > > > Yes, saying only that it must be different is intentional. What
> > > > we
> > > > really want is for consumers to treat this as an opaque value
> > > > for the
> > > > most part [1]. Therefore an implementation based on hashing
> > > > would
> > > > conform to the spec, I'd think, as long as all of the relevant
> > > > info is
> > > > part of the hash.
> > > 
> > > It'd conform, but it might not be as useful as an increasing
> > > value.
> > > 
> > > E.g. a client can use that to work out which of a series of
> > > reordered
> > > write replies is the most recent, and I seem to recall that can
> > > prevent
> > > unnecessary invalidations in some cases.
> > > 
> > 
> > That's a good point; the linux client does this. That said, NFSv4
> > has a
> > way for the server to advertise its change attribute behavior [1]
> > (though nfsd hasn't implemented this yet).
> 
> It was implemented and reverted.  The issue was that I thought nfsd
> should mix in the ctime to prevent the change attribute going
> backwards
> on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but Trond
> was
> concerned about the possibility of time going backwards.  See
> 1631087ba872 "Revert "nfsd4: support change_attr_type attribute"".
> There's some mailing list discussion to that I'm not turning up right
> now.

My main concern was that some filesystems (e.g. ext3) were failing to
provide sufficient timestamp resolution to actually label the resulting
'change attribute' as being updated monotonically. If the time stamp
doesn't change when the file data or metadata are changed, then the
client has to perform extra checks to try to figure out whether or not
its caches are up to date.

> 
> Did NFSv4 add change_attr_type because some implementations needed
> the
> unordered case, or because they realized ordering was useful but
> wanted
> to keep backwards compatibility?  I don't know which it was.

We implemented it because, as implied above, knowledge of whether or
not the change attribute behaves monotonically, or strictly
monotonically, enables a number of optimisations.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 14:58                 ` Trond Myklebust
@ 2022-08-30 15:17                   ` J. Bruce Fields
  2022-08-30 15:43                     ` Trond Myklebust
  0 siblings, 1 reply; 46+ messages in thread
From: J. Bruce Fields @ 2022-08-30 15:17 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: jlayton, zohar, djwong, xiubli, brauner, neilb, linux-api,
	linux-xfs, david, linux-kernel, chuck.lever, linux-ceph,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, walters

On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote:
> On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> > > > > Yes, saying only that it must be different is intentional. What
> > > > > we
> > > > > really want is for consumers to treat this as an opaque value
> > > > > for the
> > > > > most part [1]. Therefore an implementation based on hashing
> > > > > would
> > > > > conform to the spec, I'd think, as long as all of the relevant
> > > > > info is
> > > > > part of the hash.
> > > > 
> > > > It'd conform, but it might not be as useful as an increasing
> > > > value.
> > > > 
> > > > E.g. a client can use that to work out which of a series of
> > > > reordered
> > > > write replies is the most recent, and I seem to recall that can
> > > > prevent
> > > > unnecessary invalidations in some cases.
> > > > 
> > > 
> > > That's a good point; the linux client does this. That said, NFSv4
> > > has a
> > > way for the server to advertise its change attribute behavior [1]
> > > (though nfsd hasn't implemented this yet).
> > 
> > It was implemented and reverted.  The issue was that I thought nfsd
> > should mix in the ctime to prevent the change attribute going
> > backwards
> > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but Trond
> > was
> > concerned about the possibility of time going backwards.  See
> > 1631087ba872 "Revert "nfsd4: support change_attr_type attribute"".
> > There's some mailing list discussion to that I'm not turning up right
> > now.

https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/
is what I was thinking of but it isn't actually that interesting.

> My main concern was that some filesystems (e.g. ext3) were failing to
> provide sufficient timestamp resolution to actually label the resulting
> 'change attribute' as being updated monotonically. If the time stamp
> doesn't change when the file data or metadata are changed, then the
> client has to perform extra checks to try to figure out whether or not
> its caches are up to date.

That's a different issue from the one you were raising in that
discussion.

> > Did NFSv4 add change_attr_type because some implementations needed
> > the
> > unordered case, or because they realized ordering was useful but
> > wanted
> > to keep backwards compatibility?  I don't know which it was.
> 
> We implemented it because, as implied above, knowledge of whether or
> not the change attribute behaves monotonically, or strictly
> monotonically, enables a number of optimisations.

Of course, but my question was about the value of the old behavior, not
about the value of the monotonic behavior.

Put differently, if we could redesign the protocol from scratch would we
actually have included the option of non-monotonic behavior?

--b.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 15:17                   ` J. Bruce Fields
@ 2022-08-30 15:43                     ` Trond Myklebust
  2022-08-30 17:02                       ` Jeff Layton
  0 siblings, 1 reply; 46+ messages in thread
From: Trond Myklebust @ 2022-08-30 15:43 UTC (permalink / raw)
  To: bfields
  Cc: zohar, djwong, xiubli, brauner, linux-xfs, linux-api, neilb,
	david, linux-kernel, jlayton, chuck.lever, linux-ceph, linux-nfs,
	tytso, viro, jack, linux-ext4, linux-btrfs, linux-fsdevel,
	lczerner, adilger.kernel, walters

On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote:
> On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote:
> > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> > > > > > Yes, saying only that it must be different is intentional.
> > > > > > What
> > > > > > we
> > > > > > really want is for consumers to treat this as an opaque
> > > > > > value
> > > > > > for the
> > > > > > most part [1]. Therefore an implementation based on hashing
> > > > > > would
> > > > > > conform to the spec, I'd think, as long as all of the
> > > > > > relevant
> > > > > > info is
> > > > > > part of the hash.
> > > > > 
> > > > > It'd conform, but it might not be as useful as an increasing
> > > > > value.
> > > > > 
> > > > > E.g. a client can use that to work out which of a series of
> > > > > reordered
> > > > > write replies is the most recent, and I seem to recall that
> > > > > can
> > > > > prevent
> > > > > unnecessary invalidations in some cases.
> > > > > 
> > > > 
> > > > That's a good point; the linux client does this. That said,
> > > > NFSv4
> > > > has a
> > > > way for the server to advertise its change attribute behavior
> > > > [1]
> > > > (though nfsd hasn't implemented this yet).
> > > 
> > > It was implemented and reverted.  The issue was that I thought
> > > nfsd
> > > should mix in the ctime to prevent the change attribute going
> > > backwards
> > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but
> > > Trond
> > > was
> > > concerned about the possibility of time going backwards.  See
> > > 1631087ba872 "Revert "nfsd4: support change_attr_type
> > > attribute"".
> > > There's some mailing list discussion to that I'm not turning up
> > > right
> > > now.
> 
> https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/
> is what I was thinking of but it isn't actually that interesting.
> 
> > My main concern was that some filesystems (e.g. ext3) were failing
> > to
> > provide sufficient timestamp resolution to actually label the
> > resulting
> > 'change attribute' as being updated monotonically. If the time
> > stamp
> > doesn't change when the file data or metadata are changed, then the
> > client has to perform extra checks to try to figure out whether or
> > not
> > its caches are up to date.
> 
> That's a different issue from the one you were raising in that
> discussion.
> 
> > > Did NFSv4 add change_attr_type because some implementations
> > > needed
> > > the
> > > unordered case, or because they realized ordering was useful but
> > > wanted
> > > to keep backwards compatibility?  I don't know which it was.
> > 
> > We implemented it because, as implied above, knowledge of whether
> > or
> > not the change attribute behaves monotonically, or strictly
> > monotonically, enables a number of optimisations.
> 
> Of course, but my question was about the value of the old behavior,
> not
> about the value of the monotonic behavior.
> 
> Put differently, if we could redesign the protocol from scratch would
> we
> actually have included the option of non-monotonic behavior?
> 

If we could design the filesystems from scratch, we probably would not.
The protocol ended up being as it is because people were trying to make
it as easy to implement as possible.

So if we could design the filesystem from scratch, we would have
probably designed it along the lines of what AFS does.
i.e. each explicit change is accompanied by a single bump of the change
attribute, so that the clients can not only decide the order of the
resulting changes, but also if they have missed a change (that might
have been made by a different client).

However that would be a requirement that is likely to be very specific
to distributed caches (and hence distributed filesystems). I doubt
there are many user space applications that would need that high
precision. Maybe MPI, but that's the only candidate I can think of for
now?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 15:43                     ` Trond Myklebust
@ 2022-08-30 17:02                       ` Jeff Layton
  2022-08-30 17:47                         ` Trond Myklebust
  2022-08-30 18:32                         ` J. Bruce Fields
  0 siblings, 2 replies; 46+ messages in thread
From: Jeff Layton @ 2022-08-30 17:02 UTC (permalink / raw)
  To: Trond Myklebust, bfields
  Cc: zohar, djwong, xiubli, brauner, linux-xfs, linux-api, neilb,
	david, linux-kernel, chuck.lever, linux-ceph, linux-nfs, tytso,
	viro, jack, linux-ext4, linux-btrfs, linux-fsdevel, lczerner,
	adilger.kernel, walters

On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote:
> On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote:
> > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote:
> > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> > > > > > > Yes, saying only that it must be different is intentional.
> > > > > > > What
> > > > > > > we
> > > > > > > really want is for consumers to treat this as an opaque
> > > > > > > value
> > > > > > > for the
> > > > > > > most part [1]. Therefore an implementation based on hashing
> > > > > > > would
> > > > > > > conform to the spec, I'd think, as long as all of the
> > > > > > > relevant
> > > > > > > info is
> > > > > > > part of the hash.
> > > > > > 
> > > > > > It'd conform, but it might not be as useful as an increasing
> > > > > > value.
> > > > > > 
> > > > > > E.g. a client can use that to work out which of a series of
> > > > > > reordered
> > > > > > write replies is the most recent, and I seem to recall that
> > > > > > can
> > > > > > prevent
> > > > > > unnecessary invalidations in some cases.
> > > > > > 
> > > > > 
> > > > > That's a good point; the linux client does this. That said,
> > > > > NFSv4
> > > > > has a
> > > > > way for the server to advertise its change attribute behavior
> > > > > [1]
> > > > > (though nfsd hasn't implemented this yet).
> > > > 
> > > > It was implemented and reverted.  The issue was that I thought
> > > > nfsd
> > > > should mix in the ctime to prevent the change attribute going
> > > > backwards
> > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but
> > > > Trond
> > > > was
> > > > concerned about the possibility of time going backwards.  See
> > > > 1631087ba872 "Revert "nfsd4: support change_attr_type
> > > > attribute"".
> > > > There's some mailing list discussion to that I'm not turning up
> > > > right
> > > > now.
> > 
> > https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/
> > is what I was thinking of but it isn't actually that interesting.
> > 
> > > My main concern was that some filesystems (e.g. ext3) were failing
> > > to
> > > provide sufficient timestamp resolution to actually label the
> > > resulting
> > > 'change attribute' as being updated monotonically. If the time
> > > stamp
> > > doesn't change when the file data or metadata are changed, then the
> > > client has to perform extra checks to try to figure out whether or
> > > not
> > > its caches are up to date.
> > 
> > That's a different issue from the one you were raising in that
> > discussion.
> > 
> > > > Did NFSv4 add change_attr_type because some implementations
> > > > needed
> > > > the
> > > > unordered case, or because they realized ordering was useful but
> > > > wanted
> > > > to keep backwards compatibility?  I don't know which it was.
> > > 
> > > We implemented it because, as implied above, knowledge of whether
> > > or
> > > not the change attribute behaves monotonically, or strictly
> > > monotonically, enables a number of optimisations.
> > 
> > Of course, but my question was about the value of the old behavior,
> > not
> > about the value of the monotonic behavior.
> > 
> > Put differently, if we could redesign the protocol from scratch would
> > we
> > actually have included the option of non-monotonic behavior?
> > 
> 
> If we could design the filesystems from scratch, we probably would not.
> The protocol ended up being as it is because people were trying to make
> it as easy to implement as possible.
> 
> So if we could design the filesystem from scratch, we would have
> probably designed it along the lines of what AFS does.
> i.e. each explicit change is accompanied by a single bump of the change
> attribute, so that the clients can not only decide the order of the
> resulting changes, but also if they have missed a change (that might
> have been made by a different client).
> 
> However that would be a requirement that is likely to be very specific
> to distributed caches (and hence distributed filesystems). I doubt
> there are many user space applications that would need that high
> precision. Maybe MPI, but that's the only candidate I can think of for
> now?
> 

The fact that NFS kept this more loosely-defined is what allowed us to
elide some of the i_version bumps and regain a fair bit of performance
for local filesystems [1]. If the change attribute had been more
strictly defined like you mention, then that particular optimization
would not have been possible.

This sort of thing is why I'm a fan of not defining this any more
strictly than we require. Later on, maybe we'll come up with a way for
filesystems to advertise that they can offer stronger guarantees.
-- 
Jeff Layton <jlayton@kernel.org>

[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f02a9ad1f15d

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 17:02                       ` Jeff Layton
@ 2022-08-30 17:47                         ` Trond Myklebust
  2022-08-30 17:53                           ` Jeff Layton
  2022-08-30 18:32                         ` J. Bruce Fields
  1 sibling, 1 reply; 46+ messages in thread
From: Trond Myklebust @ 2022-08-30 17:47 UTC (permalink / raw)
  To: bfields, jlayton
  Cc: zohar, djwong, brauner, linux-xfs, neilb, linux-api, david,
	linux-kernel, xiubli, chuck.lever, linux-ceph, linux-nfs, tytso,
	linux-ext4, jack, viro, linux-btrfs, linux-fsdevel, lczerner,
	adilger.kernel, walters

On Tue, 2022-08-30 at 13:02 -0400, Jeff Layton wrote:
> On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote:
> > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote:
> > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote:
> > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton
> > > > > > > wrote:
> > > > > > > > Yes, saying only that it must be different is
> > > > > > > > intentional.
> > > > > > > > What
> > > > > > > > we
> > > > > > > > really want is for consumers to treat this as an opaque
> > > > > > > > value
> > > > > > > > for the
> > > > > > > > most part [1]. Therefore an implementation based on
> > > > > > > > hashing
> > > > > > > > would
> > > > > > > > conform to the spec, I'd think, as long as all of the
> > > > > > > > relevant
> > > > > > > > info is
> > > > > > > > part of the hash.
> > > > > > > 
> > > > > > > It'd conform, but it might not be as useful as an
> > > > > > > increasing
> > > > > > > value.
> > > > > > > 
> > > > > > > E.g. a client can use that to work out which of a series
> > > > > > > of
> > > > > > > reordered
> > > > > > > write replies is the most recent, and I seem to recall
> > > > > > > that
> > > > > > > can
> > > > > > > prevent
> > > > > > > unnecessary invalidations in some cases.
> > > > > > > 
> > > > > > 
> > > > > > That's a good point; the linux client does this. That said,
> > > > > > NFSv4
> > > > > > has a
> > > > > > way for the server to advertise its change attribute
> > > > > > behavior
> > > > > > [1]
> > > > > > (though nfsd hasn't implemented this yet).
> > > > > 
> > > > > It was implemented and reverted.  The issue was that I
> > > > > thought
> > > > > nfsd
> > > > > should mix in the ctime to prevent the change attribute going
> > > > > backwards
> > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but
> > > > > Trond
> > > > > was
> > > > > concerned about the possibility of time going backwards.  See
> > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type
> > > > > attribute"".
> > > > > There's some mailing list discussion to that I'm not turning
> > > > > up
> > > > > right
> > > > > now.
> > > 
> > > https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/
> > > is what I was thinking of but it isn't actually that interesting.
> > > 
> > > > My main concern was that some filesystems (e.g. ext3) were
> > > > failing
> > > > to
> > > > provide sufficient timestamp resolution to actually label the
> > > > resulting
> > > > 'change attribute' as being updated monotonically. If the time
> > > > stamp
> > > > doesn't change when the file data or metadata are changed, then
> > > > the
> > > > client has to perform extra checks to try to figure out whether
> > > > or
> > > > not
> > > > its caches are up to date.
> > > 
> > > That's a different issue from the one you were raising in that
> > > discussion.
> > > 
> > > > > Did NFSv4 add change_attr_type because some implementations
> > > > > needed
> > > > > the
> > > > > unordered case, or because they realized ordering was useful
> > > > > but
> > > > > wanted
> > > > > to keep backwards compatibility?  I don't know which it was.
> > > > 
> > > > We implemented it because, as implied above, knowledge of
> > > > whether
> > > > or
> > > > not the change attribute behaves monotonically, or strictly
> > > > monotonically, enables a number of optimisations.
> > > 
> > > Of course, but my question was about the value of the old
> > > behavior,
> > > not
> > > about the value of the monotonic behavior.
> > > 
> > > Put differently, if we could redesign the protocol from scratch
> > > would
> > > we
> > > actually have included the option of non-monotonic behavior?
> > > 
> > 
> > If we could design the filesystems from scratch, we probably would
> > not.
> > The protocol ended up being as it is because people were trying to
> > make
> > it as easy to implement as possible.
> > 
> > So if we could design the filesystem from scratch, we would have
> > probably designed it along the lines of what AFS does.
> > i.e. each explicit change is accompanied by a single bump of the
> > change
> > attribute, so that the clients can not only decide the order of the
> > resulting changes, but also if they have missed a change (that
> > might
> > have been made by a different client).
> > 
> > However that would be a requirement that is likely to be very
> > specific
> > to distributed caches (and hence distributed filesystems). I doubt
> > there are many user space applications that would need that high
> > precision. Maybe MPI, but that's the only candidate I can think of
> > for
> > now?
> > 
> 
> The fact that NFS kept this more loosely-defined is what allowed us
> to
> elide some of the i_version bumps and regain a fair bit of
> performance
> for local filesystems [1]. If the change attribute had been more
> strictly defined like you mention, then that particular optimization
> would not have been possible.
> 
> This sort of thing is why I'm a fan of not defining this any more
> strictly than we require. Later on, maybe we'll come up with a way
> for
> filesystems to advertise that they can offer stronger guarantees.

What 'eliding of the bumps' are we talking about here? If it results in
unreliable behaviour, then I propose we just drop the whole concept and
go back to using the ctime. The change attribute is only useful if it
results in a reliable mechanism for detecting changes. Once you "elide
away" the word "reliable", then it has no value beyond what ctime
already does.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 17:47                         ` Trond Myklebust
@ 2022-08-30 17:53                           ` Jeff Layton
  2022-08-30 18:25                             ` Trond Myklebust
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2022-08-30 17:53 UTC (permalink / raw)
  To: Trond Myklebust, bfields
  Cc: zohar, djwong, brauner, linux-xfs, neilb, linux-api, david,
	linux-kernel, xiubli, chuck.lever, linux-ceph, linux-nfs, tytso,
	linux-ext4, jack, viro, linux-btrfs, linux-fsdevel, lczerner,
	adilger.kernel, walters

On Tue, 2022-08-30 at 17:47 +0000, Trond Myklebust wrote:
> On Tue, 2022-08-30 at 13:02 -0400, Jeff Layton wrote:
> > On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote:
> > > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote:
> > > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote:
> > > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> > > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton
> > > > > > > > wrote:
> > > > > > > > > Yes, saying only that it must be different is
> > > > > > > > > intentional.
> > > > > > > > > What
> > > > > > > > > we
> > > > > > > > > really want is for consumers to treat this as an opaque
> > > > > > > > > value
> > > > > > > > > for the
> > > > > > > > > most part [1]. Therefore an implementation based on
> > > > > > > > > hashing
> > > > > > > > > would
> > > > > > > > > conform to the spec, I'd think, as long as all of the
> > > > > > > > > relevant
> > > > > > > > > info is
> > > > > > > > > part of the hash.
> > > > > > > > 
> > > > > > > > It'd conform, but it might not be as useful as an
> > > > > > > > increasing
> > > > > > > > value.
> > > > > > > > 
> > > > > > > > E.g. a client can use that to work out which of a series
> > > > > > > > of
> > > > > > > > reordered
> > > > > > > > write replies is the most recent, and I seem to recall
> > > > > > > > that
> > > > > > > > can
> > > > > > > > prevent
> > > > > > > > unnecessary invalidations in some cases.
> > > > > > > > 
> > > > > > > 
> > > > > > > That's a good point; the linux client does this. That said,
> > > > > > > NFSv4
> > > > > > > has a
> > > > > > > way for the server to advertise its change attribute
> > > > > > > behavior
> > > > > > > [1]
> > > > > > > (though nfsd hasn't implemented this yet).
> > > > > > 
> > > > > > It was implemented and reverted.  The issue was that I
> > > > > > thought
> > > > > > nfsd
> > > > > > should mix in the ctime to prevent the change attribute going
> > > > > > backwards
> > > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but
> > > > > > Trond
> > > > > > was
> > > > > > concerned about the possibility of time going backwards.  See
> > > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type
> > > > > > attribute"".
> > > > > > There's some mailing list discussion to that I'm not turning
> > > > > > up
> > > > > > right
> > > > > > now.
> > > > 
> > > > https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/
> > > > is what I was thinking of but it isn't actually that interesting.
> > > > 
> > > > > My main concern was that some filesystems (e.g. ext3) were
> > > > > failing
> > > > > to
> > > > > provide sufficient timestamp resolution to actually label the
> > > > > resulting
> > > > > 'change attribute' as being updated monotonically. If the time
> > > > > stamp
> > > > > doesn't change when the file data or metadata are changed, then
> > > > > the
> > > > > client has to perform extra checks to try to figure out whether
> > > > > or
> > > > > not
> > > > > its caches are up to date.
> > > > 
> > > > That's a different issue from the one you were raising in that
> > > > discussion.
> > > > 
> > > > > > Did NFSv4 add change_attr_type because some implementations
> > > > > > needed
> > > > > > the
> > > > > > unordered case, or because they realized ordering was useful
> > > > > > but
> > > > > > wanted
> > > > > > to keep backwards compatibility?  I don't know which it was.
> > > > > 
> > > > > We implemented it because, as implied above, knowledge of
> > > > > whether
> > > > > or
> > > > > not the change attribute behaves monotonically, or strictly
> > > > > monotonically, enables a number of optimisations.
> > > > 
> > > > Of course, but my question was about the value of the old
> > > > behavior,
> > > > not
> > > > about the value of the monotonic behavior.
> > > > 
> > > > Put differently, if we could redesign the protocol from scratch
> > > > would
> > > > we
> > > > actually have included the option of non-monotonic behavior?
> > > > 
> > > 
> > > If we could design the filesystems from scratch, we probably would
> > > not.
> > > The protocol ended up being as it is because people were trying to
> > > make
> > > it as easy to implement as possible.
> > > 
> > > So if we could design the filesystem from scratch, we would have
> > > probably designed it along the lines of what AFS does.
> > > i.e. each explicit change is accompanied by a single bump of the
> > > change
> > > attribute, so that the clients can not only decide the order of the
> > > resulting changes, but also if they have missed a change (that
> > > might
> > > have been made by a different client).
> > > 
> > > However that would be a requirement that is likely to be very
> > > specific
> > > to distributed caches (and hence distributed filesystems). I doubt
> > > there are many user space applications that would need that high
> > > precision. Maybe MPI, but that's the only candidate I can think of
> > > for
> > > now?
> > > 
> > 
> > The fact that NFS kept this more loosely-defined is what allowed us
> > to
> > elide some of the i_version bumps and regain a fair bit of
> > performance
> > for local filesystems [1]. If the change attribute had been more
> > strictly defined like you mention, then that particular optimization
> > would not have been possible.
> > 
> > This sort of thing is why I'm a fan of not defining this any more
> > strictly than we require. Later on, maybe we'll come up with a way
> > for
> > filesystems to advertise that they can offer stronger guarantees.
> 
> What 'eliding of the bumps' are we talking about here? If it results in
> unreliable behaviour, then I propose we just drop the whole concept and
> go back to using the ctime. The change attribute is only useful if it
> results in a reliable mechanism for detecting changes. Once you "elide
> away" the word "reliable", then it has no value beyond what ctime
> already does.
> 

I'm talking about the scheme to optimize away i_version updates when the
current one has never been queried:

    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f02a9ad1f15d

There's nothing unreliable about it.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 17:53                           ` Jeff Layton
@ 2022-08-30 18:25                             ` Trond Myklebust
  2022-08-30 19:11                               ` Jeff Layton
  0 siblings, 1 reply; 46+ messages in thread
From: Trond Myklebust @ 2022-08-30 18:25 UTC (permalink / raw)
  To: bfields, jlayton
  Cc: zohar, djwong, xiubli, linux-xfs, neilb, linux-api, david,
	linux-kernel, chuck.lever, linux-ceph, linux-nfs, linux-ext4,
	tytso, viro, jack, brauner, linux-fsdevel, lczerner,
	adilger.kernel, walters, linux-btrfs

On Tue, 2022-08-30 at 13:53 -0400, Jeff Layton wrote:
> On Tue, 2022-08-30 at 17:47 +0000, Trond Myklebust wrote:
> > On Tue, 2022-08-30 at 13:02 -0400, Jeff Layton wrote:
> > > On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote:
> > > > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust
> > > > > wrote:
> > > > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > > > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton
> > > > > > > wrote:
> > > > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields
> > > > > > > > wrote:
> > > > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton
> > > > > > > > > wrote:
> > > > > > > > > > Yes, saying only that it must be different is
> > > > > > > > > > intentional.
> > > > > > > > > > What
> > > > > > > > > > we
> > > > > > > > > > really want is for consumers to treat this as an
> > > > > > > > > > opaque
> > > > > > > > > > value
> > > > > > > > > > for the
> > > > > > > > > > most part [1]. Therefore an implementation based on
> > > > > > > > > > hashing
> > > > > > > > > > would
> > > > > > > > > > conform to the spec, I'd think, as long as all of
> > > > > > > > > > the
> > > > > > > > > > relevant
> > > > > > > > > > info is
> > > > > > > > > > part of the hash.
> > > > > > > > > 
> > > > > > > > > It'd conform, but it might not be as useful as an
> > > > > > > > > increasing
> > > > > > > > > value.
> > > > > > > > > 
> > > > > > > > > E.g. a client can use that to work out which of a
> > > > > > > > > series
> > > > > > > > > of
> > > > > > > > > reordered
> > > > > > > > > write replies is the most recent, and I seem to
> > > > > > > > > recall
> > > > > > > > > that
> > > > > > > > > can
> > > > > > > > > prevent
> > > > > > > > > unnecessary invalidations in some cases.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > That's a good point; the linux client does this. That
> > > > > > > > said,
> > > > > > > > NFSv4
> > > > > > > > has a
> > > > > > > > way for the server to advertise its change attribute
> > > > > > > > behavior
> > > > > > > > [1]
> > > > > > > > (though nfsd hasn't implemented this yet).
> > > > > > > 
> > > > > > > It was implemented and reverted.  The issue was that I
> > > > > > > thought
> > > > > > > nfsd
> > > > > > > should mix in the ctime to prevent the change attribute
> > > > > > > going
> > > > > > > backwards
> > > > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()),
> > > > > > > but
> > > > > > > Trond
> > > > > > > was
> > > > > > > concerned about the possibility of time going backwards. 
> > > > > > > See
> > > > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type
> > > > > > > attribute"".
> > > > > > > There's some mailing list discussion to that I'm not
> > > > > > > turning
> > > > > > > up
> > > > > > > right
> > > > > > > now.
> > > > > 
> > > > > https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/
> > > > > is what I was thinking of but it isn't actually that
> > > > > interesting.
> > > > > 
> > > > > > My main concern was that some filesystems (e.g. ext3) were
> > > > > > failing
> > > > > > to
> > > > > > provide sufficient timestamp resolution to actually label
> > > > > > the
> > > > > > resulting
> > > > > > 'change attribute' as being updated monotonically. If the
> > > > > > time
> > > > > > stamp
> > > > > > doesn't change when the file data or metadata are changed,
> > > > > > then
> > > > > > the
> > > > > > client has to perform extra checks to try to figure out
> > > > > > whether
> > > > > > or
> > > > > > not
> > > > > > its caches are up to date.
> > > > > 
> > > > > That's a different issue from the one you were raising in
> > > > > that
> > > > > discussion.
> > > > > 
> > > > > > > Did NFSv4 add change_attr_type because some
> > > > > > > implementations
> > > > > > > needed
> > > > > > > the
> > > > > > > unordered case, or because they realized ordering was
> > > > > > > useful
> > > > > > > but
> > > > > > > wanted
> > > > > > > to keep backwards compatibility?  I don't know which it
> > > > > > > was.
> > > > > > 
> > > > > > We implemented it because, as implied above, knowledge of
> > > > > > whether
> > > > > > or
> > > > > > not the change attribute behaves monotonically, or strictly
> > > > > > monotonically, enables a number of optimisations.
> > > > > 
> > > > > Of course, but my question was about the value of the old
> > > > > behavior,
> > > > > not
> > > > > about the value of the monotonic behavior.
> > > > > 
> > > > > Put differently, if we could redesign the protocol from
> > > > > scratch
> > > > > would
> > > > > we
> > > > > actually have included the option of non-monotonic behavior?
> > > > > 
> > > > 
> > > > If we could design the filesystems from scratch, we probably
> > > > would
> > > > not.
> > > > The protocol ended up being as it is because people were trying
> > > > to
> > > > make
> > > > it as easy to implement as possible.
> > > > 
> > > > So if we could design the filesystem from scratch, we would
> > > > have
> > > > probably designed it along the lines of what AFS does.
> > > > i.e. each explicit change is accompanied by a single bump of
> > > > the
> > > > change
> > > > attribute, so that the clients can not only decide the order of
> > > > the
> > > > resulting changes, but also if they have missed a change (that
> > > > might
> > > > have been made by a different client).
> > > > 
> > > > However that would be a requirement that is likely to be very
> > > > specific
> > > > to distributed caches (and hence distributed filesystems). I
> > > > doubt
> > > > there are many user space applications that would need that
> > > > high
> > > > precision. Maybe MPI, but that's the only candidate I can think
> > > > of
> > > > for
> > > > now?
> > > > 
> > > 
> > > The fact that NFS kept this more loosely-defined is what allowed
> > > us
> > > to
> > > elide some of the i_version bumps and regain a fair bit of
> > > performance
> > > for local filesystems [1]. If the change attribute had been more
> > > strictly defined like you mention, then that particular
> > > optimization
> > > would not have been possible.
> > > 
> > > This sort of thing is why I'm a fan of not defining this any more
> > > strictly than we require. Later on, maybe we'll come up with a
> > > way
> > > for
> > > filesystems to advertise that they can offer stronger guarantees.
> > 
> > What 'eliding of the bumps' are we talking about here? If it
> > results in
> > unreliable behaviour, then I propose we just drop the whole concept
> > and
> > go back to using the ctime. The change attribute is only useful if
> > it
> > results in a reliable mechanism for detecting changes. Once you
> > "elide
> > away" the word "reliable", then it has no value beyond what ctime
> > already does.
> > 
> 
> I'm talking about the scheme to optimize away i_version updates when
> the
> current one has never been queried:
> 
>    
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f02a9ad1f15d
> 
> There's nothing unreliable about it.

Not really seeing why that would be incompatible with the idea of
bumping on every change. The I_VERSION_QUERIED is just a hint to tell
you that at the very least you need to sync the next metadata update
after someone peeked at the value. You could still continue to cache
updates after that, and only sync them once a O_SYNC or an fsync() call
explicitly requires you to do so.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 17:02                       ` Jeff Layton
  2022-08-30 17:47                         ` Trond Myklebust
@ 2022-08-30 18:32                         ` J. Bruce Fields
  2022-08-30 19:30                           ` Jeff Layton
  1 sibling, 1 reply; 46+ messages in thread
From: J. Bruce Fields @ 2022-08-30 18:32 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Trond Myklebust, zohar, djwong, xiubli, brauner, linux-xfs,
	linux-api, neilb, david, linux-kernel, chuck.lever, linux-ceph,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, walters

On Tue, Aug 30, 2022 at 01:02:50PM -0400, Jeff Layton wrote:
> The fact that NFS kept this more loosely-defined is what allowed us to
> elide some of the i_version bumps and regain a fair bit of performance
> for local filesystems [1]. If the change attribute had been more
> strictly defined like you mention, then that particular optimization
> would not have been possible.
> 
> This sort of thing is why I'm a fan of not defining this any more
> strictly than we require. Later on, maybe we'll come up with a way for
> filesystems to advertise that they can offer stronger guarantees.

Yeah, the afs change-attribute-as-counter thing seems ambitious--I
wouldn't even know how to define what exactly you're counting.

My one question is whether it'd be worth just defining the thing as
*increasing*.  That's a lower bar.

(Though admittedly we don't quite manage it now--see again 1631087ba872
"Revert "nfsd4: support change_attr_type attribute"".)

--b.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 18:25                             ` Trond Myklebust
@ 2022-08-30 19:11                               ` Jeff Layton
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2022-08-30 19:11 UTC (permalink / raw)
  To: Trond Myklebust, bfields
  Cc: zohar, djwong, xiubli, linux-xfs, neilb, linux-api, david,
	linux-kernel, chuck.lever, linux-ceph, linux-nfs, linux-ext4,
	tytso, viro, jack, brauner, linux-fsdevel, lczerner,
	adilger.kernel, walters, linux-btrfs

On Tue, 2022-08-30 at 18:25 +0000, Trond Myklebust wrote:
> On Tue, 2022-08-30 at 13:53 -0400, Jeff Layton wrote:
> > On Tue, 2022-08-30 at 17:47 +0000, Trond Myklebust wrote:
> > > On Tue, 2022-08-30 at 13:02 -0400, Jeff Layton wrote:
> > > > On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote:
> > > > > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote:
> > > > > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust
> > > > > > wrote:
> > > > > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > > > > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton
> > > > > > > > wrote:
> > > > > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields
> > > > > > > > > wrote:
> > > > > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton
> > > > > > > > > > wrote:
> > > > > > > > > > > Yes, saying only that it must be different is
> > > > > > > > > > > intentional.
> > > > > > > > > > > What
> > > > > > > > > > > we
> > > > > > > > > > > really want is for consumers to treat this as an
> > > > > > > > > > > opaque
> > > > > > > > > > > value
> > > > > > > > > > > for the
> > > > > > > > > > > most part [1]. Therefore an implementation based on
> > > > > > > > > > > hashing
> > > > > > > > > > > would
> > > > > > > > > > > conform to the spec, I'd think, as long as all of
> > > > > > > > > > > the
> > > > > > > > > > > relevant
> > > > > > > > > > > info is
> > > > > > > > > > > part of the hash.
> > > > > > > > > > 
> > > > > > > > > > It'd conform, but it might not be as useful as an
> > > > > > > > > > increasing
> > > > > > > > > > value.
> > > > > > > > > > 
> > > > > > > > > > E.g. a client can use that to work out which of a
> > > > > > > > > > series
> > > > > > > > > > of
> > > > > > > > > > reordered
> > > > > > > > > > write replies is the most recent, and I seem to
> > > > > > > > > > recall
> > > > > > > > > > that
> > > > > > > > > > can
> > > > > > > > > > prevent
> > > > > > > > > > unnecessary invalidations in some cases.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > That's a good point; the linux client does this. That
> > > > > > > > > said,
> > > > > > > > > NFSv4
> > > > > > > > > has a
> > > > > > > > > way for the server to advertise its change attribute
> > > > > > > > > behavior
> > > > > > > > > [1]
> > > > > > > > > (though nfsd hasn't implemented this yet).
> > > > > > > > 
> > > > > > > > It was implemented and reverted.  The issue was that I
> > > > > > > > thought
> > > > > > > > nfsd
> > > > > > > > should mix in the ctime to prevent the change attribute
> > > > > > > > going
> > > > > > > > backwards
> > > > > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()),
> > > > > > > > but
> > > > > > > > Trond
> > > > > > > > was
> > > > > > > > concerned about the possibility of time going backwards. 
> > > > > > > > See
> > > > > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type
> > > > > > > > attribute"".
> > > > > > > > There's some mailing list discussion to that I'm not
> > > > > > > > turning
> > > > > > > > up
> > > > > > > > right
> > > > > > > > now.
> > > > > > 
> > > > > > https://lore.kernel.org/linux-nfs/a6294c25cb5eb98193f609a52aa8f4b5d4e81279.camel@hammerspace.com/
> > > > > > is what I was thinking of but it isn't actually that
> > > > > > interesting.
> > > > > > 
> > > > > > > My main concern was that some filesystems (e.g. ext3) were
> > > > > > > failing
> > > > > > > to
> > > > > > > provide sufficient timestamp resolution to actually label
> > > > > > > the
> > > > > > > resulting
> > > > > > > 'change attribute' as being updated monotonically. If the
> > > > > > > time
> > > > > > > stamp
> > > > > > > doesn't change when the file data or metadata are changed,
> > > > > > > then
> > > > > > > the
> > > > > > > client has to perform extra checks to try to figure out
> > > > > > > whether
> > > > > > > or
> > > > > > > not
> > > > > > > its caches are up to date.
> > > > > > 
> > > > > > That's a different issue from the one you were raising in
> > > > > > that
> > > > > > discussion.
> > > > > > 
> > > > > > > > Did NFSv4 add change_attr_type because some
> > > > > > > > implementations
> > > > > > > > needed
> > > > > > > > the
> > > > > > > > unordered case, or because they realized ordering was
> > > > > > > > useful
> > > > > > > > but
> > > > > > > > wanted
> > > > > > > > to keep backwards compatibility?  I don't know which it
> > > > > > > > was.
> > > > > > > 
> > > > > > > We implemented it because, as implied above, knowledge of
> > > > > > > whether
> > > > > > > or
> > > > > > > not the change attribute behaves monotonically, or strictly
> > > > > > > monotonically, enables a number of optimisations.
> > > > > > 
> > > > > > Of course, but my question was about the value of the old
> > > > > > behavior,
> > > > > > not
> > > > > > about the value of the monotonic behavior.
> > > > > > 
> > > > > > Put differently, if we could redesign the protocol from
> > > > > > scratch
> > > > > > would
> > > > > > we
> > > > > > actually have included the option of non-monotonic behavior?
> > > > > > 
> > > > > 
> > > > > If we could design the filesystems from scratch, we probably
> > > > > would
> > > > > not.
> > > > > The protocol ended up being as it is because people were trying
> > > > > to
> > > > > make
> > > > > it as easy to implement as possible.
> > > > > 
> > > > > So if we could design the filesystem from scratch, we would
> > > > > have
> > > > > probably designed it along the lines of what AFS does.
> > > > > i.e. each explicit change is accompanied by a single bump of
> > > > > the
> > > > > change
> > > > > attribute, so that the clients can not only decide the order of
> > > > > the
> > > > > resulting changes, but also if they have missed a change (that
> > > > > might
> > > > > have been made by a different client).
> > > > > 
> > > > > However that would be a requirement that is likely to be very
> > > > > specific
> > > > > to distributed caches (and hence distributed filesystems). I
> > > > > doubt
> > > > > there are many user space applications that would need that
> > > > > high
> > > > > precision. Maybe MPI, but that's the only candidate I can think
> > > > > of
> > > > > for
> > > > > now?
> > > > > 
> > > > 
> > > > The fact that NFS kept this more loosely-defined is what allowed
> > > > us
> > > > to
> > > > elide some of the i_version bumps and regain a fair bit of
> > > > performance
> > > > for local filesystems [1]. If the change attribute had been more
> > > > strictly defined like you mention, then that particular
> > > > optimization
> > > > would not have been possible.
> > > > 
> > > > This sort of thing is why I'm a fan of not defining this any more
> > > > strictly than we require. Later on, maybe we'll come up with a
> > > > way
> > > > for
> > > > filesystems to advertise that they can offer stronger guarantees.
> > > 
> > > What 'eliding of the bumps' are we talking about here? If it
> > > results in
> > > unreliable behaviour, then I propose we just drop the whole concept
> > > and
> > > go back to using the ctime. The change attribute is only useful if
> > > it
> > > results in a reliable mechanism for detecting changes. Once you
> > > "elide
> > > away" the word "reliable", then it has no value beyond what ctime
> > > already does.
> > > 
> > 
> > I'm talking about the scheme to optimize away i_version updates when
> > the
> > current one has never been queried:
> > 
> >    
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f02a9ad1f15d
> > 
> > There's nothing unreliable about it.
> 
> Not really seeing why that would be incompatible with the idea of
> bumping on every change. The I_VERSION_QUERIED is just a hint to tell
> you that at the very least you need to sync the next metadata update
> after someone peeked at the value. You could still continue to cache
> updates after that, and only sync them once a O_SYNC or an fsync() call
> explicitly requires you to do so.
> 

Good point! It's not implemented that way today, but we could change it
to do that if it were useful. I think it'd be slightly more costly
CPU-wise when the update isn't going to disk, since you'd now have to
update the value on every change instead of just skipping it, but I
doubt anyone would notice the extra overhead.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 18:32                         ` J. Bruce Fields
@ 2022-08-30 19:30                           ` Jeff Layton
  2022-08-30 19:46                             ` J. Bruce Fields
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2022-08-30 19:30 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond Myklebust, zohar, djwong, xiubli, brauner, linux-xfs,
	linux-api, neilb, david, linux-kernel, chuck.lever, linux-ceph,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, walters

On Tue, 2022-08-30 at 14:32 -0400, J. Bruce Fields wrote:
> On Tue, Aug 30, 2022 at 01:02:50PM -0400, Jeff Layton wrote:
> > The fact that NFS kept this more loosely-defined is what allowed us to
> > elide some of the i_version bumps and regain a fair bit of performance
> > for local filesystems [1]. If the change attribute had been more
> > strictly defined like you mention, then that particular optimization
> > would not have been possible.
> > 
> > This sort of thing is why I'm a fan of not defining this any more
> > strictly than we require. Later on, maybe we'll come up with a way for
> > filesystems to advertise that they can offer stronger guarantees.
> 
> Yeah, the afs change-attribute-as-counter thing seems ambitious--I
> wouldn't even know how to define what exactly you're counting.
> 
> My one question is whether it'd be worth just defining the thing as
> *increasing*.  That's a lower bar.
> 

That's a very good question.

One could argue that NFSv4 sort of requires that for write delegations
anyway. All of the existing implementations that I know of do this, so
that wouldn't rule any of them out.

I'm not opposed to adding that constraint. Let me think on it a bit
more.

> (Though admittedly we don't quite manage it now--see again 1631087ba872
> "Revert "nfsd4: support change_attr_type attribute"".)
> 

Factoring the ctime into the change attr seems wrong, since a clock jump
could make it go backward. Do you remember what drove that change (see
630458e730b8) ?

It seems like if the i_version were to go backward, then the ctime
probably would too, and you'd still see a duplicate change attr.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 19:30                           ` Jeff Layton
@ 2022-08-30 19:46                             ` J. Bruce Fields
  2022-08-30 19:57                               ` Jeff Layton
  0 siblings, 1 reply; 46+ messages in thread
From: J. Bruce Fields @ 2022-08-30 19:46 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Trond Myklebust, zohar, djwong, xiubli, brauner, linux-xfs,
	linux-api, neilb, david, linux-kernel, chuck.lever, linux-ceph,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, walters

On Tue, Aug 30, 2022 at 03:30:13PM -0400, Jeff Layton wrote:
> On Tue, 2022-08-30 at 14:32 -0400, J. Bruce Fields wrote:
> > On Tue, Aug 30, 2022 at 01:02:50PM -0400, Jeff Layton wrote:
> > > The fact that NFS kept this more loosely-defined is what allowed us to
> > > elide some of the i_version bumps and regain a fair bit of performance
> > > for local filesystems [1]. If the change attribute had been more
> > > strictly defined like you mention, then that particular optimization
> > > would not have been possible.
> > > 
> > > This sort of thing is why I'm a fan of not defining this any more
> > > strictly than we require. Later on, maybe we'll come up with a way for
> > > filesystems to advertise that they can offer stronger guarantees.
> > 
> > Yeah, the afs change-attribute-as-counter thing seems ambitious--I
> > wouldn't even know how to define what exactly you're counting.
> > 
> > My one question is whether it'd be worth just defining the thing as
> > *increasing*.  That's a lower bar.
> > 
> 
> That's a very good question.
> 
> One could argue that NFSv4 sort of requires that for write delegations
> anyway. All of the existing implementations that I know of do this, so
> that wouldn't rule any of them out.
> 
> I'm not opposed to adding that constraint. Let me think on it a bit
> more.
> 
> > (Though admittedly we don't quite manage it now--see again 1631087ba872
> > "Revert "nfsd4: support change_attr_type attribute"".)
> > 
> 
> Factoring the ctime into the change attr seems wrong, since a clock jump
> could make it go backward. Do you remember what drove that change (see
> 630458e730b8) ?
> 
> It seems like if the i_version were to go backward, then the ctime
> probably would too, and you'd still see a duplicate change attr.

See the comment--I was worried about crashes: the change attribute isn't
on disk at the time the client requests it, so after a crash the client
may see it go backward.  (And then could see it repeat a value, possibly
with different file contents.)

Combining it with the ctime means we get something that behaves
correctly even in that case--unless the clock goes backwards.

--b.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 19:46                             ` J. Bruce Fields
@ 2022-08-30 19:57                               ` Jeff Layton
  2022-08-30 20:08                                 ` J. Bruce Fields
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2022-08-30 19:57 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond Myklebust, zohar, djwong, xiubli, brauner, linux-xfs,
	linux-api, neilb, david, linux-kernel, chuck.lever, linux-ceph,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, walters

On Tue, 2022-08-30 at 15:46 -0400, J. Bruce Fields wrote:
> On Tue, Aug 30, 2022 at 03:30:13PM -0400, Jeff Layton wrote:
> > On Tue, 2022-08-30 at 14:32 -0400, J. Bruce Fields wrote:
> > > On Tue, Aug 30, 2022 at 01:02:50PM -0400, Jeff Layton wrote:
> > > > The fact that NFS kept this more loosely-defined is what allowed us to
> > > > elide some of the i_version bumps and regain a fair bit of performance
> > > > for local filesystems [1]. If the change attribute had been more
> > > > strictly defined like you mention, then that particular optimization
> > > > would not have been possible.
> > > > 
> > > > This sort of thing is why I'm a fan of not defining this any more
> > > > strictly than we require. Later on, maybe we'll come up with a way for
> > > > filesystems to advertise that they can offer stronger guarantees.
> > > 
> > > Yeah, the afs change-attribute-as-counter thing seems ambitious--I
> > > wouldn't even know how to define what exactly you're counting.
> > > 
> > > My one question is whether it'd be worth just defining the thing as
> > > *increasing*.  That's a lower bar.
> > > 
> > 
> > That's a very good question.
> > 
> > One could argue that NFSv4 sort of requires that for write delegations
> > anyway. All of the existing implementations that I know of do this, so
> > that wouldn't rule any of them out.
> > 
> > I'm not opposed to adding that constraint. Let me think on it a bit
> > more.
> > 
> > > (Though admittedly we don't quite manage it now--see again 1631087ba872
> > > "Revert "nfsd4: support change_attr_type attribute"".)
> > > 
> > 
> > Factoring the ctime into the change attr seems wrong, since a clock jump
> > could make it go backward. Do you remember what drove that change (see
> > 630458e730b8) ?
> > 
> > It seems like if the i_version were to go backward, then the ctime
> > probably would too, and you'd still see a duplicate change attr.
> 
> See the comment--I was worried about crashes: the change attribute isn't
> on disk at the time the client requests it, so after a crash the client
> may see it go backward.  (And then could see it repeat a value, possibly
> with different file contents.)
> 
> Combining it with the ctime means we get something that behaves
> correctly even in that case--unless the clock goes backwards.
> 

Yeah ok, I vaguely remember discussing this.

That seems like it has its own problem though. If you mix in the ctime
and the clock jumps backward, then you could end up with the same issue
(a stale changeid, different contents). No crash required.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/7] iversion: update comments with info about atime updates
  2022-08-30 19:57                               ` Jeff Layton
@ 2022-08-30 20:08                                 ` J. Bruce Fields
  0 siblings, 0 replies; 46+ messages in thread
From: J. Bruce Fields @ 2022-08-30 20:08 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Trond Myklebust, zohar, djwong, xiubli, brauner, linux-xfs,
	linux-api, neilb, david, linux-kernel, chuck.lever, linux-ceph,
	linux-nfs, tytso, viro, jack, linux-ext4, linux-btrfs,
	linux-fsdevel, lczerner, adilger.kernel, walters

On Tue, Aug 30, 2022 at 03:57:09PM -0400, Jeff Layton wrote:
> On Tue, 2022-08-30 at 15:46 -0400, J. Bruce Fields wrote:
> > On Tue, Aug 30, 2022 at 03:30:13PM -0400, Jeff Layton wrote:
> > > On Tue, 2022-08-30 at 14:32 -0400, J. Bruce Fields wrote:
> > > > On Tue, Aug 30, 2022 at 01:02:50PM -0400, Jeff Layton wrote:
> > > > > The fact that NFS kept this more loosely-defined is what allowed us to
> > > > > elide some of the i_version bumps and regain a fair bit of performance
> > > > > for local filesystems [1]. If the change attribute had been more
> > > > > strictly defined like you mention, then that particular optimization
> > > > > would not have been possible.
> > > > > 
> > > > > This sort of thing is why I'm a fan of not defining this any more
> > > > > strictly than we require. Later on, maybe we'll come up with a way for
> > > > > filesystems to advertise that they can offer stronger guarantees.
> > > > 
> > > > Yeah, the afs change-attribute-as-counter thing seems ambitious--I
> > > > wouldn't even know how to define what exactly you're counting.
> > > > 
> > > > My one question is whether it'd be worth just defining the thing as
> > > > *increasing*.  That's a lower bar.
> > > > 
> > > 
> > > That's a very good question.
> > > 
> > > One could argue that NFSv4 sort of requires that for write delegations
> > > anyway. All of the existing implementations that I know of do this, so
> > > that wouldn't rule any of them out.
> > > 
> > > I'm not opposed to adding that constraint. Let me think on it a bit
> > > more.
> > > 
> > > > (Though admittedly we don't quite manage it now--see again 1631087ba872
> > > > "Revert "nfsd4: support change_attr_type attribute"".)
> > > > 
> > > 
> > > Factoring the ctime into the change attr seems wrong, since a clock jump
> > > could make it go backward. Do you remember what drove that change (see
> > > 630458e730b8) ?
> > > 
> > > It seems like if the i_version were to go backward, then the ctime
> > > probably would too, and you'd still see a duplicate change attr.
> > 
> > See the comment--I was worried about crashes: the change attribute isn't
> > on disk at the time the client requests it, so after a crash the client
> > may see it go backward.  (And then could see it repeat a value, possibly
> > with different file contents.)
> > 
> > Combining it with the ctime means we get something that behaves
> > correctly even in that case--unless the clock goes backwards.
> > 
> 
> Yeah ok, I vaguely remember discussing this.
> 
> That seems like it has its own problem though. If you mix in the ctime
> and the clock jumps backward, then you could end up with the same issue
> (a stale changeid, different contents). No crash required.

Yes, exactly.

My feeling was that I've got no control over power failures and such,
but the clock setting is something I can get right.  So I'd rather have
something that depended on the latter than the former.

I could be wrong.

--b.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2022-08-30 20:08 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-26 21:46 [PATCH v3 0/7] vfs: clean up i_version behavior and expose it via statx Jeff Layton
2022-08-26 21:46 ` [PATCH v3 1/7] iversion: update comments with info about atime updates Jeff Layton
2022-08-29  7:56   ` Dave Chinner
2022-08-29 10:39     ` Jeff Layton
2022-08-29 22:58       ` NeilBrown
2022-08-30 11:40         ` Jeff Layton
2022-08-30 13:24           ` J. Bruce Fields
2022-08-30 13:50             ` Jeff Layton
2022-08-30 14:44               ` J. Bruce Fields
2022-08-30 14:58                 ` Trond Myklebust
2022-08-30 15:17                   ` J. Bruce Fields
2022-08-30 15:43                     ` Trond Myklebust
2022-08-30 17:02                       ` Jeff Layton
2022-08-30 17:47                         ` Trond Myklebust
2022-08-30 17:53                           ` Jeff Layton
2022-08-30 18:25                             ` Trond Myklebust
2022-08-30 19:11                               ` Jeff Layton
2022-08-30 18:32                         ` J. Bruce Fields
2022-08-30 19:30                           ` Jeff Layton
2022-08-30 19:46                             ` J. Bruce Fields
2022-08-30 19:57                               ` Jeff Layton
2022-08-30 20:08                                 ` J. Bruce Fields
2022-08-30  1:04       ` Dave Chinner
2022-08-30 12:38         ` Jeff Layton
2022-08-26 21:46 ` [PATCH v3 2/7] ext4: fix i_version handling in ext4 Jeff Layton
2022-08-26 21:46 ` [PATCH v3 3/7] ext4: unconditionally enable the i_version counter Jeff Layton
2022-08-29 14:51   ` Jan Kara
2022-08-26 21:47 ` [PATCH v3 4/7] xfs: don't bump the i_version on an atime update in xfs_vn_update_time Jeff Layton
2022-08-27  7:26   ` Amir Goldstein
2022-08-27  8:01     ` Amir Goldstein
2022-08-27 13:14       ` Jeff Layton
2022-08-27 15:46         ` Darrick J. Wong
2022-08-27 16:03           ` Trond Myklebust
2022-08-27 16:10             ` Jeff Layton
2022-08-27 17:06               ` Trond Myklebust
2022-08-28 13:25               ` Amir Goldstein
2022-08-28 14:37                 ` Jeff Layton
2022-08-28 16:53                   ` Amir Goldstein
2022-08-29  5:48                   ` Dave Chinner
2022-08-29 10:33                     ` Jeff Layton
2022-08-30  0:08                       ` Dave Chinner
2022-08-30 11:20                         ` Jeff Layton
2022-08-28 17:30           ` Amir Goldstein
2022-08-26 21:47 ` [PATCH v3 5/7] vfs: report an inode version in statx for IS_I_VERSION inodes Jeff Layton
2022-08-26 21:47 ` [PATCH v3 6/7] nfs: report the inode version in statx if requested Jeff Layton
2022-08-26 21:47 ` [PATCH v3 7/7] ceph: fill in the change attribute in statx requests Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).