All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/9] fs: clean up handling of i_version counter
@ 2022-10-17 10:57 Jeff Layton
  2022-10-17 10:57 ` [PATCH v7 1/9] fs: uninline inode_query_iversion Jeff Layton
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Jeff Layton @ 2022-10-17 10:57 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

This patchset is intended to clean up the handling of the i_version
counter by nfsd. Most of the changes are to internal interfaces.

This set is not intended to address crash resilience, or the fact that
the counter is bumped before a change and not after. I intend to tackle
those in follow-on patchsets.

My intention is to get this series included into linux-next soon, with
an eye toward merging most of it during the v6.2 merge window. The last
patch in the series is probably not suitable for merge as-is, at least
until we sort out the semantics we want to present to userland for it.

Jeff Layton (9):
  fs: uninline inode_query_iversion
  fs: clarify when the i_version counter must be updated
  vfs: plumb i_version handling into struct kstat
  nfs: report the inode version in getattr if requested
  ceph: report the inode version in getattr if requested
  nfsd: move nfsd4_change_attribute to nfsfh.c
  nfsd: use the getattr operation to fetch i_version
  nfsd: remove fetch_iversion export operation
  vfs: expose STATX_VERSION to userland

 fs/ceph/inode.c           | 16 +++++++----
 fs/libfs.c                | 36 ++++++++++++++++++++++++
 fs/nfs/export.c           |  7 -----
 fs/nfs/inode.c            | 15 +++++++---
 fs/nfsd/nfs4xdr.c         |  4 ++-
 fs/nfsd/nfsfh.c           | 42 ++++++++++++++++++++++++++++
 fs/nfsd/nfsfh.h           | 29 +-------------------
 fs/nfsd/vfs.h             |  7 ++++-
 fs/stat.c                 |  7 +++++
 include/linux/exportfs.h  |  1 -
 include/linux/iversion.h  | 58 ++++++++++++++-------------------------
 include/linux/stat.h      |  2 +-
 include/uapi/linux/stat.h |  6 ++--
 samples/vfs/test-statx.c  |  8 ++++--
 14 files changed, 148 insertions(+), 90 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v7 1/9] fs: uninline inode_query_iversion
  2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
@ 2022-10-17 10:57 ` Jeff Layton
  2022-10-17 10:57 ` [PATCH v7 2/9] fs: clarify when the i_version counter must be updated Jeff Layton
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2022-10-17 10:57 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/libfs.c               | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/iversion.h | 38 ++------------------------------------
 2 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 682d56345a1c..5ae81466a422 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1566,3 +1566,39 @@ bool inode_maybe_inc_iversion(struct inode *inode, bool force)
 	return true;
 }
 EXPORT_SYMBOL(inode_maybe_inc_iversion);
+
+/**
+ * inode_query_iversion - read i_version for later use
+ * @inode: inode from which i_version should be read
+ *
+ * Read the inode i_version counter. This should be used by callers that wish
+ * to store the returned i_version for later comparison. This will guarantee
+ * that a later query of the i_version will result in a different value if
+ * anything has changed.
+ *
+ * In this implementation, we fetch the current value, set the QUERIED flag and
+ * then try to swap it into place with a cmpxchg, if it wasn't already set. If
+ * that fails, we try again with the newly fetched value from the cmpxchg.
+ */
+u64 inode_query_iversion(struct inode *inode)
+{
+	u64 cur, new;
+
+	cur = inode_peek_iversion_raw(inode);
+	do {
+		/* If flag is already set, then no need to swap */
+		if (cur & I_VERSION_QUERIED) {
+			/*
+			 * This barrier (and the implicit barrier in the
+			 * cmpxchg below) pairs with the barrier in
+			 * inode_maybe_inc_iversion().
+			 */
+			smp_mb();
+			break;
+		}
+
+		new = cur | I_VERSION_QUERIED;
+	} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
+	return cur >> I_VERSION_QUERIED_SHIFT;
+}
+EXPORT_SYMBOL(inode_query_iversion);
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index e27bd4f55d84..6755d8b4f20b 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -234,42 +234,6 @@ inode_peek_iversion(const struct inode *inode)
 	return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT;
 }
 
-/**
- * inode_query_iversion - read i_version for later use
- * @inode: inode from which i_version should be read
- *
- * Read the inode i_version counter. This should be used by callers that wish
- * to store the returned i_version for later comparison. This will guarantee
- * that a later query of the i_version will result in a different value if
- * anything has changed.
- *
- * In this implementation, we fetch the current value, set the QUERIED flag and
- * then try to swap it into place with a cmpxchg, if it wasn't already set. If
- * that fails, we try again with the newly fetched value from the cmpxchg.
- */
-static inline u64
-inode_query_iversion(struct inode *inode)
-{
-	u64 cur, new;
-
-	cur = inode_peek_iversion_raw(inode);
-	do {
-		/* If flag is already set, then no need to swap */
-		if (cur & I_VERSION_QUERIED) {
-			/*
-			 * This barrier (and the implicit barrier in the
-			 * cmpxchg below) pairs with the barrier in
-			 * inode_maybe_inc_iversion().
-			 */
-			smp_mb();
-			break;
-		}
-
-		new = cur | I_VERSION_QUERIED;
-	} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
-	return cur >> I_VERSION_QUERIED_SHIFT;
-}
-
 /*
  * For filesystems without any sort of change attribute, the best we can
  * do is fake one up from the ctime:
@@ -283,6 +247,8 @@ static inline u64 time_to_chattr(struct timespec64 *t)
 	return chattr;
 }
 
+u64 inode_query_iversion(struct inode *inode);
+
 /**
  * inode_eq_iversion_raw - check whether the raw i_version counter has changed
  * @inode: inode to check
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 2/9] fs: clarify when the i_version counter must be updated
  2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
  2022-10-17 10:57 ` [PATCH v7 1/9] fs: uninline inode_query_iversion Jeff Layton
@ 2022-10-17 10:57 ` Jeff Layton
  2022-10-17 10:57 ` [PATCH v7 3/9] vfs: plumb i_version handling into struct kstat Jeff Layton
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2022-10-17 10:57 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

The i_version field in the kernel has had different semantics over
the decades, but NFSv4 has certain expectations. Update the comments
in iversion.h to describe when the i_version must change.

Cc: Colin Walters <walters@verbum.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trondmy@hammerspace.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/iversion.h | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 6755d8b4f20b..94f4dc620d01 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -9,8 +9,24 @@
  * ---------------------------
  * The change attribute (i_version) is mandated by NFSv4 and is mostly for
  * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
- * appear different to observers if there was a change to the inode's data or
- * metadata since it was last queried.
+ * appear larger to observers if there was an explicit change to the inode's
+ * data or metadata since it was last queried.
+ *
+ * An explicit change is one that would ordinarily result in a change to the
+ * inode status change time (aka ctime). i_version must appear to change, even
+ * if the ctime does not (since the whole point is to avoid missing updates due
+ * to timestamp granularity). If POSIX mandates that the ctime must change due
+ * to an operation, then the i_version counter must be incremented as well.
+ *
+ * Making the i_version update completely atomic with the operation itself would
+ * be prohibitively expensive. Traditionally the kernel has updated the times on
+ * directories after an operation that changes its contents. For regular files,
+ * the ctime is usually updated before the data is copied into the cache for a
+ * write. This means that there is a window of time when an observer can
+ * associate a new timestamp with old file contents. Since the purpose of the
+ * i_version is to allow for better cache coherency, the i_version must always
+ * be updated after the results of the operation are visible. Updating it before
+ * and after a change is also permitted.
  *
  * Observers see the i_version as a 64-bit number that never decreases. If it
  * remains the same since it was last checked, then nothing has changed in the
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 3/9] vfs: plumb i_version handling into struct kstat
  2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
  2022-10-17 10:57 ` [PATCH v7 1/9] fs: uninline inode_query_iversion Jeff Layton
  2022-10-17 10:57 ` [PATCH v7 2/9] fs: clarify when the i_version counter must be updated Jeff Layton
@ 2022-10-17 10:57 ` Jeff Layton
  2022-10-17 10:57 ` [PATCH v7 4/9] nfs: report the inode version in getattr if requested Jeff Layton
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2022-10-17 10:57 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Jeff Layton

From: Jeff Layton <jlayton@redhat.com>

The NFS server has a lot of special handling for different types of
change attribute access, depending on the underlying filesystem. In
most cases, it's doing a getattr anyway and then fetching that value
after the fact.

Rather that do that, add a new STATX_VERSION flag that is a kernel-only
symbol (for now). If requested and getattr can implement it, it can fill
out this field. For IS_I_VERSION inodes, add a generic implementation in
vfs_getattr_nosec. Take care to mask STATX_VERSION off in requests from
userland and in the result mask.

Since not all filesystems can give the same guarantees of monotonicity,
claim a STATX_ATTR_VERSION_MONOTONIC flag that filesystems can set to
indicate that they offer an i_version value that can never go backward.

Eventually if we decide to make the i_version available to userland, we
can just designate a field for it in struct statx, and move the
STATX_VERSION definition to the uapi header.

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/stat.c            | 17 +++++++++++++++--
 include/linux/stat.h |  9 +++++++++
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/stat.c b/fs/stat.c
index a7930d744483..e7f8cd4b24e1 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -17,6 +17,7 @@
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
 #include <linux/compat.h>
+#include <linux/iversion.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -118,6 +119,11 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	stat->attributes_mask |= (STATX_ATTR_AUTOMOUNT |
 				  STATX_ATTR_DAX);
 
+	if ((request_mask & STATX_VERSION) && IS_I_VERSION(inode)) {
+		stat->result_mask |= STATX_VERSION;
+		stat->version = inode_query_iversion(inode);
+	}
+
 	mnt_userns = mnt_user_ns(path->mnt);
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(mnt_userns, path, stat,
@@ -587,9 +593,11 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 
 	memset(&tmp, 0, sizeof(tmp));
 
-	tmp.stx_mask = stat->result_mask;
+	/* STATX_VERSION is kernel-only for now */
+	tmp.stx_mask = stat->result_mask & ~STATX_VERSION;
 	tmp.stx_blksize = stat->blksize;
-	tmp.stx_attributes = stat->attributes;
+	/* STATX_ATTR_VERSION_MONOTONIC is kernel-only for now */
+	tmp.stx_attributes = stat->attributes & ~STATX_ATTR_VERSION_MONOTONIC;
 	tmp.stx_nlink = stat->nlink;
 	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
 	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
@@ -628,6 +636,11 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
 	if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
 		return -EINVAL;
 
+	/* STATX_VERSION is kernel-only for now. Ignore requests
+	 * from userland.
+	 */
+	mask &= ~STATX_VERSION;
+
 	error = vfs_statx(dfd, filename, flags, &stat, mask);
 	if (error)
 		return error;
diff --git a/include/linux/stat.h b/include/linux/stat.h
index ff277ced50e9..4e9428d86a3a 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -52,6 +52,15 @@ struct kstat {
 	u64		mnt_id;
 	u32		dio_mem_align;
 	u32		dio_offset_align;
+	u64		version;
 };
 
+/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
+
+/* mask values */
+#define STATX_VERSION		0x40000000U	/* Want/got stx_change_attr */
+
+/* file attribute values */
+#define STATX_ATTR_VERSION_MONOTONIC	0x8000000000000000ULL /* version monotonically increases */
+
 #endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 4/9] nfs: report the inode version in getattr if requested
  2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
                   ` (2 preceding siblings ...)
  2022-10-17 10:57 ` [PATCH v7 3/9] vfs: plumb i_version handling into struct kstat Jeff Layton
@ 2022-10-17 10:57 ` Jeff Layton
  2022-10-17 10:57 ` [PATCH v7 5/9] ceph: " Jeff Layton
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2022-10-17 10:57 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

Allow NFS to report the i_version in getattr requests. Since the cost to
fetch it is relatively cheap, do it unconditionally and just set the
flag if it looks like it's valid. Also, conditionally enable the
MONOTONIC flag when the server reports its change attr type as such.

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfs/inode.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index bea7c005119c..7ed1b4c9260a 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -830,6 +830,8 @@ static u32 nfs_get_valid_attrmask(struct inode *inode)
 		reply_mask |= STATX_UID | STATX_GID;
 	if (!(cache_validity & NFS_INO_INVALID_BLOCKS))
 		reply_mask |= STATX_BLOCKS;
+	if (!(cache_validity & NFS_INO_INVALID_CHANGE))
+		reply_mask |= STATX_VERSION;
 	return reply_mask;
 }
 
@@ -848,7 +850,8 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 
 	request_mask &= STATX_TYPE | STATX_MODE | STATX_NLINK | STATX_UID |
 			STATX_GID | STATX_ATIME | STATX_MTIME | STATX_CTIME |
-			STATX_INO | STATX_SIZE | STATX_BLOCKS;
+			STATX_INO | STATX_SIZE | STATX_BLOCKS | STATX_BTIME |
+			STATX_VERSION;
 
 	if ((query_flags & AT_STATX_DONT_SYNC) && !force_sync) {
 		if (readdirplus_enabled)
@@ -856,8 +859,8 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 		goto out_no_revalidate;
 	}
 
-	/* Flush out writes to the server in order to update c/mtime.  */
-	if ((request_mask & (STATX_CTIME | STATX_MTIME)) &&
+	/* Flush out writes to the server in order to update c/mtime/version.  */
+	if ((request_mask & (STATX_CTIME | STATX_MTIME | STATX_VERSION)) &&
 	    S_ISREG(inode->i_mode))
 		filemap_write_and_wait(inode->i_mapping);
 
@@ -877,7 +880,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 	/* Is the user requesting attributes that might need revalidation? */
 	if (!(request_mask & (STATX_MODE|STATX_NLINK|STATX_ATIME|STATX_CTIME|
 					STATX_MTIME|STATX_UID|STATX_GID|
-					STATX_SIZE|STATX_BLOCKS)))
+					STATX_SIZE|STATX_BLOCKS|STATX_VERSION)))
 		goto out_no_revalidate;
 
 	/* Check whether the cached attributes are stale */
@@ -915,6 +918,10 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 
 	generic_fillattr(&init_user_ns, inode, stat);
 	stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
+	stat->version = inode_peek_iversion_raw(inode);
+	stat->attributes_mask |= STATX_ATTR_VERSION_MONOTONIC;
+	if (server->change_attr_type != NFS4_CHANGE_TYPE_IS_UNDEFINED)
+		stat->attributes |= STATX_ATTR_VERSION_MONOTONIC;
 	if (S_ISDIR(inode->i_mode))
 		stat->blksize = NFS_SERVER(inode)->dtsize;
 out:
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 5/9] ceph: report the inode version in getattr if requested
  2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
                   ` (3 preceding siblings ...)
  2022-10-17 10:57 ` [PATCH v7 4/9] nfs: report the inode version in getattr if requested Jeff Layton
@ 2022-10-17 10:57 ` Jeff Layton
  2022-10-17 10:57 ` [PATCH v7 6/9] nfsd: move nfsd4_change_attribute to nfsfh.c Jeff Layton
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2022-10-17 10:57 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

When getattr requests the STX_VERSION, request the full gamut of caps
(similarly to how ctime is handled). When the change attribute seems to
be valid, return it in the ino_version field and set the flag in the
reply mask. Also, unconditionally enable STATX_ATTR_VERSION_MONOTONIC.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 42351d7a0dd6..bcab855bf1ae 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2415,10 +2415,10 @@ static int statx_to_caps(u32 want, umode_t mode)
 {
 	int mask = 0;
 
-	if (want & (STATX_MODE|STATX_UID|STATX_GID|STATX_CTIME|STATX_BTIME))
+	if (want & (STATX_MODE|STATX_UID|STATX_GID|STATX_CTIME|STATX_BTIME|STATX_VERSION))
 		mask |= CEPH_CAP_AUTH_SHARED;
 
-	if (want & (STATX_NLINK|STATX_CTIME)) {
+	if (want & (STATX_NLINK|STATX_CTIME|STATX_VERSION)) {
 		/*
 		 * The link count for directories depends on inode->i_subdirs,
 		 * and that is only updated when Fs caps are held.
@@ -2429,11 +2429,10 @@ static int statx_to_caps(u32 want, umode_t mode)
 			mask |= CEPH_CAP_LINK_SHARED;
 	}
 
-	if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|
-		    STATX_BLOCKS))
+	if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|STATX_BLOCKS|STATX_VERSION))
 		mask |= CEPH_CAP_FILE_SHARED;
 
-	if (want & (STATX_CTIME))
+	if (want & (STATX_CTIME|STATX_VERSION))
 		mask |= CEPH_CAP_XATTR_SHARED;
 
 	return mask;
@@ -2475,6 +2474,11 @@ int ceph_getattr(struct user_namespace *mnt_userns, const struct path *path,
 		valid_mask |= STATX_BTIME;
 	}
 
+	if (request_mask & STATX_VERSION) {
+		stat->version = inode_peek_iversion_raw(inode);
+		valid_mask |= STATX_VERSION;
+	}
+
 	if (ceph_snap(inode) == CEPH_NOSNAP)
 		stat->dev = inode->i_sb->s_dev;
 	else
@@ -2498,6 +2502,8 @@ int ceph_getattr(struct user_namespace *mnt_userns, const struct path *path,
 			stat->nlink = 1 + 1 + ci->i_subdirs;
 	}
 
+	stat->attributes_mask |= STATX_ATTR_VERSION_MONOTONIC;
+	stat->attributes |= STATX_ATTR_VERSION_MONOTONIC;
 	stat->result_mask = request_mask & valid_mask;
 	return err;
 }
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 6/9] nfsd: move nfsd4_change_attribute to nfsfh.c
  2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
                   ` (4 preceding siblings ...)
  2022-10-17 10:57 ` [PATCH v7 5/9] ceph: " Jeff Layton
@ 2022-10-17 10:57 ` Jeff Layton
  2022-11-03 15:42   ` Chuck Lever III
  2022-10-17 10:57 ` [PATCH v7 7/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2022-10-17 10:57 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

This is a pretty big function for inlining. Move it to being
non-inlined.

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/nfsfh.c | 27 +++++++++++++++++++++++++++
 fs/nfsd/nfsfh.h | 29 +----------------------------
 2 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index d73434200df9..7030d9209903 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -748,3 +748,30 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
 		return FSIDSOURCE_UUID;
 	return FSIDSOURCE_DEV;
 }
+
+/*
+ * We could use i_version alone as the change attribute.  However,
+ * i_version can go backwards after a reboot.  On its own that doesn't
+ * necessarily cause a problem, but if i_version goes backwards and then
+ * is incremented again it could reuse a value that was previously used
+ * before boot, and a client who queried the two values might
+ * incorrectly assume nothing changed.
+ *
+ * By using both ctime and the i_version counter we guarantee that as
+ * long as time doesn't go backwards we never reuse an old value.
+ */
+u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
+{
+	if (inode->i_sb->s_export_op->fetch_iversion)
+		return inode->i_sb->s_export_op->fetch_iversion(inode);
+	else if (IS_I_VERSION(inode)) {
+		u64 chattr;
+
+		chattr =  stat->ctime.tv_sec;
+		chattr <<= 30;
+		chattr += stat->ctime.tv_nsec;
+		chattr += inode_query_iversion(inode);
+		return chattr;
+	} else
+		return time_to_chattr(&stat->ctime);
+}
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index c3ae6414fc5c..4c223a7a91d4 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -291,34 +291,7 @@ static inline void fh_clear_pre_post_attrs(struct svc_fh *fhp)
 	fhp->fh_pre_saved = false;
 }
 
-/*
- * We could use i_version alone as the change attribute.  However,
- * i_version can go backwards after a reboot.  On its own that doesn't
- * necessarily cause a problem, but if i_version goes backwards and then
- * is incremented again it could reuse a value that was previously used
- * before boot, and a client who queried the two values might
- * incorrectly assume nothing changed.
- *
- * By using both ctime and the i_version counter we guarantee that as
- * long as time doesn't go backwards we never reuse an old value.
- */
-static inline u64 nfsd4_change_attribute(struct kstat *stat,
-					 struct inode *inode)
-{
-	if (inode->i_sb->s_export_op->fetch_iversion)
-		return inode->i_sb->s_export_op->fetch_iversion(inode);
-	else if (IS_I_VERSION(inode)) {
-		u64 chattr;
-
-		chattr =  stat->ctime.tv_sec;
-		chattr <<= 30;
-		chattr += stat->ctime.tv_nsec;
-		chattr += inode_query_iversion(inode);
-		return chattr;
-	} else
-		return time_to_chattr(&stat->ctime);
-}
-
+u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode);
 extern void fh_fill_pre_attrs(struct svc_fh *fhp);
 extern void fh_fill_post_attrs(struct svc_fh *fhp);
 extern void fh_fill_both_attrs(struct svc_fh *fhp);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 7/9] nfsd: use the getattr operation to fetch i_version
  2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
                   ` (5 preceding siblings ...)
  2022-10-17 10:57 ` [PATCH v7 6/9] nfsd: move nfsd4_change_attribute to nfsfh.c Jeff Layton
@ 2022-10-17 10:57 ` Jeff Layton
  2022-11-03 15:42   ` Chuck Lever III
  2022-10-17 10:57 ` [PATCH v7 8/9] nfsd: remove fetch_iversion export operation Jeff Layton
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2022-10-17 10:57 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

Now that we can call into vfs_getattr to get the i_version field, use
that facility to fetch it instead of doing it in nfsd4_change_attribute.

Neil also pointed out recently that IS_I_VERSION directory operations
are always logged, and so we only need to mitigate the rollback problem
on regular files. Also, we don't need to factor in the ctime when
reexporting NFS or Ceph.

Set the STATX_VERSION (and BTIME) bits in the request when we're dealing
with a v4 request. Then, instead of looking at IS_I_VERSION when
generating the change attr, look at the result mask and only use it if
STATX_VERSION is set.

Change nfsd4_change_attribute to only factor in the ctime if it's a regular
file and the fs doesn't advertise STATX_ATTR_VERSION_MONOTONIC.

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/nfs4xdr.c |  4 +++-
 fs/nfsd/nfsfh.c   | 53 +++++++++++++++++++++++++++++++----------------
 fs/nfsd/vfs.h     |  7 ++++++-
 3 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index bcfeb1a922c0..c19b6b00b620 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2906,7 +2906,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 			goto out;
 	}
 
-	err = vfs_getattr(&path, &stat, STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
+	err = vfs_getattr(&path, &stat,
+			  STATX_BASIC_STATS | STATX_BTIME | STATX_VERSION,
+			  AT_STATX_SYNC_AS_STAT);
 	if (err)
 		goto out_nfserr;
 	if (!(stat.result_mask & STATX_BTIME))
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 7030d9209903..21b64ac97a06 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -628,6 +628,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
 		stat.mtime = inode->i_mtime;
 		stat.ctime = inode->i_ctime;
 		stat.size  = inode->i_size;
+		if (v4 && IS_I_VERSION(inode)) {
+			stat.version = inode_query_iversion(inode);
+			stat.result_mask |= STATX_VERSION;
+		}
 	}
 	if (v4)
 		fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
@@ -659,6 +663,10 @@ void fh_fill_post_attrs(struct svc_fh *fhp)
 	if (err) {
 		fhp->fh_post_saved = false;
 		fhp->fh_post_attr.ctime = inode->i_ctime;
+		if (v4 && IS_I_VERSION(inode)) {
+			fhp->fh_post_attr.version = inode_query_iversion(inode);
+			fhp->fh_post_attr.result_mask |= STATX_VERSION;
+		}
 	} else
 		fhp->fh_post_saved = true;
 	if (v4)
@@ -750,28 +758,37 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
 }
 
 /*
- * We could use i_version alone as the change attribute.  However,
- * i_version can go backwards after a reboot.  On its own that doesn't
- * necessarily cause a problem, but if i_version goes backwards and then
- * is incremented again it could reuse a value that was previously used
- * before boot, and a client who queried the two values might
- * incorrectly assume nothing changed.
+ * We could use i_version alone as the change attribute.  However, i_version
+ * can go backwards on a regular file after an unclean shutdown.  On its own
+ * that doesn't necessarily cause a problem, but if i_version goes backwards
+ * and then is incremented again it could reuse a value that was previously
+ * used before boot, and a client who queried the two values might incorrectly
+ * assume nothing changed.
+ *
+ * By using both ctime and the i_version counter we guarantee that as long as
+ * time doesn't go backwards we never reuse an old value. If the filesystem
+ * advertises STATX_ATTR_VERSION_MONOTONIC, then this mitigation is not needed.
  *
- * By using both ctime and the i_version counter we guarantee that as
- * long as time doesn't go backwards we never reuse an old value.
+ * We only need to do this for regular files as well. For directories, we
+ * assume that the new change attr is always logged to stable storage in some
+ * fashion before the results can be seen.
  */
 u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
 {
+	u64 chattr;
+
 	if (inode->i_sb->s_export_op->fetch_iversion)
 		return inode->i_sb->s_export_op->fetch_iversion(inode);
-	else if (IS_I_VERSION(inode)) {
-		u64 chattr;
-
-		chattr =  stat->ctime.tv_sec;
-		chattr <<= 30;
-		chattr += stat->ctime.tv_nsec;
-		chattr += inode_query_iversion(inode);
-		return chattr;
-	} else
-		return time_to_chattr(&stat->ctime);
+	if (stat->result_mask & STATX_VERSION) {
+		chattr = stat->version;
+
+		if (S_ISREG(inode->i_mode) &&
+		    !(stat->attributes & STATX_ATTR_VERSION_MONOTONIC)) {
+			chattr += (u64)stat->ctime.tv_sec << 30;
+			chattr += stat->ctime.tv_nsec;
+		}
+	} else {
+		chattr = time_to_chattr(&stat->ctime);
+	}
+	return chattr;
 }
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 120521bc7b24..c98e13ec37b2 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -168,9 +168,14 @@ static inline void fh_drop_write(struct svc_fh *fh)
 
 static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
 {
+	u32 request_mask = STATX_BASIC_STATS;
 	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
 			 .dentry = fh->fh_dentry};
-	return nfserrno(vfs_getattr(&p, stat, STATX_BASIC_STATS,
+
+	if (fh->fh_maxsize == NFS4_FHSIZE)
+		request_mask |= (STATX_BTIME | STATX_VERSION);
+
+	return nfserrno(vfs_getattr(&p, stat, request_mask,
 				    AT_STATX_SYNC_AS_STAT));
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v7 8/9] nfsd: remove fetch_iversion export operation
  2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
                   ` (6 preceding siblings ...)
  2022-10-17 10:57 ` [PATCH v7 7/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
@ 2022-10-17 10:57 ` Jeff Layton
  2022-11-03 15:43   ` Chuck Lever III
  2022-10-17 10:57 ` [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland Jeff Layton
  2022-10-19 11:13 ` [PATCH v7 0/9] fs: clean up handling of i_version counter Christian Brauner
  9 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2022-10-17 10:57 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

Now that the i_version counter is reported in struct kstat, there is no
need for this export operation.

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfs/export.c          | 7 -------
 fs/nfsd/nfsfh.c          | 2 --
 include/linux/exportfs.h | 1 -
 3 files changed, 10 deletions(-)

diff --git a/fs/nfs/export.c b/fs/nfs/export.c
index 01596f2d0a1e..1a9d5aa51dfb 100644
--- a/fs/nfs/export.c
+++ b/fs/nfs/export.c
@@ -145,17 +145,10 @@ nfs_get_parent(struct dentry *dentry)
 	return parent;
 }
 
-static u64 nfs_fetch_iversion(struct inode *inode)
-{
-	nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
-	return inode_peek_iversion_raw(inode);
-}
-
 const struct export_operations nfs_export_ops = {
 	.encode_fh = nfs_encode_fh,
 	.fh_to_dentry = nfs_fh_to_dentry,
 	.get_parent = nfs_get_parent,
-	.fetch_iversion = nfs_fetch_iversion,
 	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|
 		EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS|
 		EXPORT_OP_NOATOMIC_ATTR,
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 21b64ac97a06..9c1f697ffc72 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -777,8 +777,6 @@ u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
 {
 	u64 chattr;
 
-	if (inode->i_sb->s_export_op->fetch_iversion)
-		return inode->i_sb->s_export_op->fetch_iversion(inode);
 	if (stat->result_mask & STATX_VERSION) {
 		chattr = stat->version;
 
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index fe848901fcc3..9f4d4bcbf251 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -213,7 +213,6 @@ struct export_operations {
 			  bool write, u32 *device_generation);
 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 			     int nr_iomaps, struct iattr *iattr);
-	u64 (*fetch_iversion)(struct inode *);
 #define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
 #define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */
 #define	EXPORT_OP_CLOSE_BEFORE_UNLINK	(0x4) /* close files before unlink */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
  2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
                   ` (7 preceding siblings ...)
  2022-10-17 10:57 ` [PATCH v7 8/9] nfsd: remove fetch_iversion export operation Jeff Layton
@ 2022-10-17 10:57 ` Jeff Layton
  2022-10-17 22:14   ` Dave Chinner
  2022-10-19 11:13 ` [PATCH v7 0/9] fs: clean up handling of i_version counter Christian Brauner
  9 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2022-10-17 10:57 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Jeff Layton

From: Jeff Layton <jlayton@redhat.com>

Claim one of the spare fields in struct statx to hold a 64-bit inode
version attribute. When userland requests STATX_VERSION, copy the
value from the kstat struct there, and stop masking off
STATX_ATTR_VERSION_MONOTONIC.

Update the test-statx sample program to output the change attr and
MountId.

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/stat.c                 | 12 +++---------
 include/linux/stat.h      |  9 ---------
 include/uapi/linux/stat.h |  6 ++++--
 samples/vfs/test-statx.c  |  8 ++++++--
 4 files changed, 13 insertions(+), 22 deletions(-)

Posting this as an RFC as we're still trying to sort out what semantics
we want to present to userland. In particular, this patch leaves the
problem of crash resilience in to userland applications on filesystems
that don't report as MONOTONIC.

Trond is of the opinion that monotonicity is a hard requirement, and
that we should not allow filesystems that can't provide that quality to
report STATX_VERSION at all. His rationale is that one of the main uses
for this is for backup applications, and for those a counter that could
go backward is worse than useless.

I don't have strong feelings either way, but if we want that then we
will not be able to offload the crash counter handling to userland.

Thoughts?

diff --git a/fs/stat.c b/fs/stat.c
index e7f8cd4b24e1..8396c372022f 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -593,11 +593,9 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 
 	memset(&tmp, 0, sizeof(tmp));
 
-	/* STATX_VERSION is kernel-only for now */
-	tmp.stx_mask = stat->result_mask & ~STATX_VERSION;
+	tmp.stx_mask = stat->result_mask;
 	tmp.stx_blksize = stat->blksize;
-	/* STATX_ATTR_VERSION_MONOTONIC is kernel-only for now */
-	tmp.stx_attributes = stat->attributes & ~STATX_ATTR_VERSION_MONOTONIC;
+	tmp.stx_attributes = stat->attributes;
 	tmp.stx_nlink = stat->nlink;
 	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
 	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
@@ -621,6 +619,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_mnt_id = stat->mnt_id;
 	tmp.stx_dio_mem_align = stat->dio_mem_align;
 	tmp.stx_dio_offset_align = stat->dio_offset_align;
+	tmp.stx_version = stat->version;
 
 	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
 }
@@ -636,11 +635,6 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
 	if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
 		return -EINVAL;
 
-	/* STATX_VERSION is kernel-only for now. Ignore requests
-	 * from userland.
-	 */
-	mask &= ~STATX_VERSION;
-
 	error = vfs_statx(dfd, filename, flags, &stat, mask);
 	if (error)
 		return error;
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 4e9428d86a3a..69c79e4fd1b1 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -54,13 +54,4 @@ struct kstat {
 	u32		dio_offset_align;
 	u64		version;
 };
-
-/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
-
-/* mask values */
-#define STATX_VERSION		0x40000000U	/* Want/got stx_change_attr */
-
-/* file attribute values */
-#define STATX_ATTR_VERSION_MONOTONIC	0x8000000000000000ULL /* version monotonically increases */
-
 #endif
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7cab2c65d3d7..4a0a1f27c059 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -127,7 +127,8 @@ struct statx {
 	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
 	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
 	/* 0xa0 */
-	__u64	__spare3[12];	/* Spare space for future expansion */
+	__u64	stx_version; /* Inode change attribute */
+	__u64	__spare3[11];	/* Spare space for future expansion */
 	/* 0x100 */
 };
 
@@ -154,6 +155,7 @@ struct statx {
 #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
 #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
 #define STATX_DIOALIGN		0x00002000U	/* Want/got direct I/O alignment info */
+#define STATX_VERSION		0x00004000U	/* Want/got stx_version */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
@@ -189,6 +191,6 @@ struct statx {
 #define STATX_ATTR_MOUNT_ROOT		0x00002000 /* Root of a mount */
 #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
 #define STATX_ATTR_DAX			0x00200000 /* File is currently in DAX state */
-
+#define STATX_ATTR_VERSION_MONOTONIC	0x00400000 /* stx_version increases w/ every change */
 
 #endif /* _UAPI_LINUX_STAT_H */
diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c
index 49c7a46cee07..868c9394e038 100644
--- a/samples/vfs/test-statx.c
+++ b/samples/vfs/test-statx.c
@@ -107,6 +107,8 @@ static void dump_statx(struct statx *stx)
 	printf("Device: %-15s", buffer);
 	if (stx->stx_mask & STATX_INO)
 		printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino);
+	if (stx->stx_mask & STATX_MNT_ID)
+		printf(" MountId: %llx", stx->stx_mnt_id);
 	if (stx->stx_mask & STATX_NLINK)
 		printf(" Links: %-5u", stx->stx_nlink);
 	if (stx->stx_mask & STATX_TYPE) {
@@ -145,7 +147,9 @@ static void dump_statx(struct statx *stx)
 	if (stx->stx_mask & STATX_CTIME)
 		print_time("Change: ", &stx->stx_ctime);
 	if (stx->stx_mask & STATX_BTIME)
-		print_time(" Birth: ", &stx->stx_btime);
+		print_time("Birth: ", &stx->stx_btime);
+	if (stx->stx_mask & STATX_VERSION)
+		printf("Inode Version: %llu\n", stx->stx_version);
 
 	if (stx->stx_attributes_mask) {
 		unsigned char bits, mbits;
@@ -218,7 +222,7 @@ int main(int argc, char **argv)
 	struct statx stx;
 	int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
 
-	unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
+	unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_MNT_ID | STATX_VERSION;
 
 	for (argv++; *argv; argv++) {
 		if (strcmp(*argv, "-F") == 0) {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
  2022-10-17 10:57 ` [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland Jeff Layton
@ 2022-10-17 22:14   ` Dave Chinner
  2022-10-18 10:35     ` Jeff Layton
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2022-10-17 22:14 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, trondmy, neilb, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Jeff Layton

On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> From: Jeff Layton <jlayton@redhat.com>
> 
> Claim one of the spare fields in struct statx to hold a 64-bit inode
> version attribute. When userland requests STATX_VERSION, copy the
> value from the kstat struct there, and stop masking off
> STATX_ATTR_VERSION_MONOTONIC.

Can we please make the name more sepcific than "version"? It's way
too generic and - we already have userspace facing "version" fields
for inodes that refer to the on-disk format version exposed in
various UAPIs. It's common for UAPI structures used for file
operations to have a "version" field that refers to the *UAPI
structure version* rather than file metadata or data being retrieved
from the file in question.

The need for an explanatory comment like this:

> +	__u64	stx_version; /* Inode change attribute */

demonstrates it is badly named. If you want it known as an inode
change attribute, then don't name the variable "version". In
reality, it really needs to be an opaque cookie, not something
applications need to decode directly to make sense of.

> Update the test-statx sample program to output the change attr and
> MountId.
> 
> Reviewed-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/stat.c                 | 12 +++---------
>  include/linux/stat.h      |  9 ---------
>  include/uapi/linux/stat.h |  6 ++++--
>  samples/vfs/test-statx.c  |  8 ++++++--
>  4 files changed, 13 insertions(+), 22 deletions(-)
> 
> Posting this as an RFC as we're still trying to sort out what semantics
> we want to present to userland. In particular, this patch leaves the
> problem of crash resilience in to userland applications on filesystems
> that don't report as MONOTONIC.

Firstly, if userspace wants to use the change attribute, they are
going to have to detect crashes themselves anyway because no fs in
the kernel can set the MONOTONIC flag right now and it may be years
before kernels/filesystems actually support it in production
systems.

But more fundamentally, I think this monotonic increase guarantee is
completely broken by the presence of snapshots and snapshot
rollbacks. If you change something, then a while later decide it
broke (e.g. a production system upgrade went awry) and you roll back
the filesystem to the pre-upgrade snapshot, then all the change
counters and m/ctimes are guaranteed to go backwards because they
will revert to the snapshot values. Maybe the filesystem can bump
some internal counter for the snapshot when the revert happens, but
until that is implemented, filesystems that support snapshots and
rollback can't assert MONOTONIC.

And that's worse for other filesystems, because if you put them on
dm-thinp and roll them back, they are completely unaware of the fact
that a rollback happened and there's *nothing* the filesystem can do
about this. Indeed, snapshots are suppose to be done on clean
filesystems so snapshot images don't require journal recovery, so
any crash detection put in the filesystem recovery code to guarantee
MONOTONIC behaviour will be soundly defeated by such block device
snapshot rollbacks.

Hence I think MONOTONIC is completely unworkable for most existing
filesystems because snapshots and rollbacks completely break the
underlying assumption MONOTONIC relies on: that filesystem
modifications always move forwards in both the time and modification
order dimensions....

This means that monotonicity is probably not acheivable by any
existing filesystem and so should not ever be mentioned in the UAPI.
I think userspace semantics can be simplified down to "if the change
cookie does not match exactly, caches are invalid" combined with
"applications are responsible for detecting temporal discontiguities
in filesystem presentation at start up (e.g. after a crash, unclean
shutdown, restoration from backup, snapshot rollback, etc) for
persistent cache invalidation purposes"....

> Trond is of the opinion that monotonicity is a hard requirement, and
> that we should not allow filesystems that can't provide that quality to
> report STATX_VERSION at all.  His rationale is that one of the main uses
> for this is for backup applications, and for those a counter that could
> go backward is worse than useless.

From the perspective of a backup program doing incremental backups,
an inode with a change counter that has a different value to the
current backup inventory means the file contains different
information than what the current backup inventory holds. Again,
snapshots, rollbacks, etc.

Therefore, regardless of whether the change counter has gone
forwards or backwards, the backup program needs to back up this
current version of the file in this backup because it is different
to the inventory copy.  Hence if the backup program fails to back it
up, it will not be creating an exact backup of the user's data at
the point in time the backup is run...

Hence I don't see that MONOTONIC is a requirement for backup
programs - they really do have to be able to handle filesystems that
have modifications that move backwards in time as well as forwards...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
  2022-10-17 22:14   ` Dave Chinner
@ 2022-10-18 10:35     ` Jeff Layton
  2022-10-18 13:49       ` Jan Kara
  2022-10-18 14:56       ` Jeff Layton
  0 siblings, 2 replies; 29+ messages in thread
From: Jeff Layton @ 2022-10-18 10:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: tytso, adilger.kernel, djwong, trondmy, neilb, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > From: Jeff Layton <jlayton@redhat.com>
> > 
> > Claim one of the spare fields in struct statx to hold a 64-bit inode
> > version attribute. When userland requests STATX_VERSION, copy the
> > value from the kstat struct there, and stop masking off
> > STATX_ATTR_VERSION_MONOTONIC.
> 
> Can we please make the name more sepcific than "version"? It's way
> too generic and - we already have userspace facing "version" fields
> for inodes that refer to the on-disk format version exposed in
> various UAPIs. It's common for UAPI structures used for file
> operations to have a "version" field that refers to the *UAPI
> structure version* rather than file metadata or data being retrieved
> from the file in question.
> 
> The need for an explanatory comment like this:
> 
> > +	__u64	stx_version; /* Inode change attribute */
> 
> demonstrates it is badly named. If you want it known as an inode
> change attribute, then don't name the variable "version". In
> reality, it really needs to be an opaque cookie, not something
> applications need to decode directly to make sense of.
> 

Fair enough. I started with this being named stx_change_attr and other
people objected. I then changed to stx_ino_version, but the "_ino"
seemed redundant.

I'm open to suggestions here. Naming things like this is hard.

> > Update the test-statx sample program to output the change attr and
> > MountId.
> > 
> > Reviewed-by: NeilBrown <neilb@suse.de>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/stat.c                 | 12 +++---------
> >  include/linux/stat.h      |  9 ---------
> >  include/uapi/linux/stat.h |  6 ++++--
> >  samples/vfs/test-statx.c  |  8 ++++++--
> >  4 files changed, 13 insertions(+), 22 deletions(-)
> > 
> > Posting this as an RFC as we're still trying to sort out what semantics
> > we want to present to userland. In particular, this patch leaves the
> > problem of crash resilience in to userland applications on filesystems
> > that don't report as MONOTONIC.
> 
> Firstly, if userspace wants to use the change attribute, they are
> going to have to detect crashes themselves anyway because no fs in
> the kernel can set the MONOTONIC flag right now and it may be years
> before kernels/filesystems actually support it in production
> systems.
> 

We can turn it on today in CephFS, NFS and tmpfs. Maybe also btrfs
(modulo the issue you point out with snapshots, of course).

> But more fundamentally, I think this monotonic increase guarantee is
> completely broken by the presence of snapshots and snapshot
> rollbacks. If you change something, then a while later decide it
> broke (e.g. a production system upgrade went awry) and you roll back
> the filesystem to the pre-upgrade snapshot, then all the change
> counters and m/ctimes are guaranteed to go backwards because they
> will revert to the snapshot values. Maybe the filesystem can bump
> some internal counter for the snapshot when the revert happens, but
> until that is implemented, filesystems that support snapshots and
> rollback can't assert MONOTONIC.
> 
> And that's worse for other filesystems, because if you put them on
> dm-thinp and roll them back, they are completely unaware of the fact
> that a rollback happened and there's *nothing* the filesystem can do
> about this. Indeed, snapshots are suppose to be done on clean
> filesystems so snapshot images don't require journal recovery, so
> any crash detection put in the filesystem recovery code to guarantee
> MONOTONIC behaviour will be soundly defeated by such block device
> snapshot rollbacks.
> 
> Hence I think MONOTONIC is completely unworkable for most existing
> filesystems because snapshots and rollbacks completely break the
> underlying assumption MONOTONIC relies on: that filesystem
> modifications always move forwards in both the time and modification
> order dimensions....
> 
> This means that monotonicity is probably not acheivable by any
> existing filesystem and so should not ever be mentioned in the UAPI.
> I think userspace semantics can be simplified down to "if the change
> cookie does not match exactly, caches are invalid" combined with
> "applications are responsible for detecting temporal discontiguities
> in filesystem presentation at start up (e.g. after a crash, unclean
> shutdown, restoration from backup, snapshot rollback, etc) for
> persistent cache invalidation purposes"....
> 

I don't think we can make any sort of blanket statement about
monotonicity in the face of snapshots. Restoring a snapshot (or a backup
for that matter) means restoring the filesystem to a particular point in
time in the past. I think it's reasonable to expect that the change
attrs may roll backward in the face of these sorts of events.

> > Trond is of the opinion that monotonicity is a hard requirement, and
> > that we should not allow filesystems that can't provide that quality to
> > report STATX_VERSION at all.  His rationale is that one of the main uses
> > for this is for backup applications, and for those a counter that could
> > go backward is worse than useless.
> 
> From the perspective of a backup program doing incremental backups,
> an inode with a change counter that has a different value to the
> current backup inventory means the file contains different
> information than what the current backup inventory holds. Again,
> snapshots, rollbacks, etc.
> 
> Therefore, regardless of whether the change counter has gone
> forwards or backwards, the backup program needs to back up this
> current version of the file in this backup because it is different
> to the inventory copy.  Hence if the backup program fails to back it
> up, it will not be creating an exact backup of the user's data at
> the point in time the backup is run...
> 
> Hence I don't see that MONOTONIC is a requirement for backup
> programs - they really do have to be able to handle filesystems that
> have modifications that move backwards in time as well as forwards...

Rolling backward is not a problem in and of itself. The big issue is
that after a crash, we can end up with a change attr seen before the
crash that is now associated with a completely different inode state.

The scenario is something like:

- Change attr for an empty file starts at 1

- Write "A" to file, change attr goes to 2

- Read and statx happens (client sees "A" with change attr 2)

- Crash (before last change is logged to disk)

- Machine reboots, inode is empty, change attr back to 1

- Write "B" to file, change attr goes to 2

- Client stat's file, sees change attr 2 and assumes its cache is
correct when it isn't (should be "B" not "A" now).

The real danger comes not from the thing going backward, but the fact
that it can march forward again after going backward, and then the
client can see two different inode states associated with the same
change attr value. Jumping all the change attributes forward by a
significant amount after a crash should avoid this issue.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
  2022-10-18 10:35     ` Jeff Layton
@ 2022-10-18 13:49       ` Jan Kara
  2022-10-18 14:21         ` Jeff Layton
  2022-10-18 14:56       ` Jeff Layton
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Kara @ 2022-10-18 13:49 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dave Chinner, tytso, adilger.kernel, djwong, trondmy, neilb,
	viro, zohar, xiubli, chuck.lever, lczerner, jack, bfields,
	brauner, fweimer, linux-btrfs, linux-fsdevel, linux-kernel,
	ceph-devel, linux-ext4, linux-nfs, linux-xfs

On Tue 18-10-22 06:35:14, Jeff Layton wrote:
> On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > > Trond is of the opinion that monotonicity is a hard requirement, and
> > > that we should not allow filesystems that can't provide that quality to
> > > report STATX_VERSION at all.  His rationale is that one of the main uses
> > > for this is for backup applications, and for those a counter that could
> > > go backward is worse than useless.
> > 
> > From the perspective of a backup program doing incremental backups,
> > an inode with a change counter that has a different value to the
> > current backup inventory means the file contains different
> > information than what the current backup inventory holds. Again,
> > snapshots, rollbacks, etc.
> > 
> > Therefore, regardless of whether the change counter has gone
> > forwards or backwards, the backup program needs to back up this
> > current version of the file in this backup because it is different
> > to the inventory copy.  Hence if the backup program fails to back it
> > up, it will not be creating an exact backup of the user's data at
> > the point in time the backup is run...
> > 
> > Hence I don't see that MONOTONIC is a requirement for backup
> > programs - they really do have to be able to handle filesystems that
> > have modifications that move backwards in time as well as forwards...
> 
> Rolling backward is not a problem in and of itself. The big issue is
> that after a crash, we can end up with a change attr seen before the
> crash that is now associated with a completely different inode state.
> 
> The scenario is something like:
> 
> - Change attr for an empty file starts at 1
> 
> - Write "A" to file, change attr goes to 2
> 
> - Read and statx happens (client sees "A" with change attr 2)
> 
> - Crash (before last change is logged to disk)
> 
> - Machine reboots, inode is empty, change attr back to 1
> 
> - Write "B" to file, change attr goes to 2
> 
> - Client stat's file, sees change attr 2 and assumes its cache is
> correct when it isn't (should be "B" not "A" now).
> 
> The real danger comes not from the thing going backward, but the fact
> that it can march forward again after going backward, and then the
> client can see two different inode states associated with the same
> change attr value. Jumping all the change attributes forward by a
> significant amount after a crash should avoid this issue.

As Dave pointed out, the problem with change attr having the same value for
a different inode state (after going backwards) holds not only for the
crashes but also for restore from backups, fs snapshots, device snapshots
etc. So relying on change attr only looks a bit fragile. It works for the
common case but the edge cases are awkward and there's no easy way to
detect you are in the edge case.

So I think any implementation caring about data integrity would have to
include something like ctime into the picture anyway. Or we could just
completely give up any idea of monotonicity and on each mount select random
prime P < 2^64 and instead of doing inc when advancing the change
attribute, we'd advance it by P. That makes collisions after restore /
crash fairly unlikely.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
  2022-10-18 13:49       ` Jan Kara
@ 2022-10-18 14:21         ` Jeff Layton
  2022-10-18 15:17           ` Jan Kara
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2022-10-18 14:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, tytso, adilger.kernel, djwong, trondmy, neilb,
	viro, zohar, xiubli, chuck.lever, lczerner, bfields, brauner,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote:
> On Tue 18-10-22 06:35:14, Jeff Layton wrote:
> > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > > > Trond is of the opinion that monotonicity is a hard requirement, and
> > > > that we should not allow filesystems that can't provide that quality to
> > > > report STATX_VERSION at all.  His rationale is that one of the main uses
> > > > for this is for backup applications, and for those a counter that could
> > > > go backward is worse than useless.
> > > 
> > > From the perspective of a backup program doing incremental backups,
> > > an inode with a change counter that has a different value to the
> > > current backup inventory means the file contains different
> > > information than what the current backup inventory holds. Again,
> > > snapshots, rollbacks, etc.
> > > 
> > > Therefore, regardless of whether the change counter has gone
> > > forwards or backwards, the backup program needs to back up this
> > > current version of the file in this backup because it is different
> > > to the inventory copy.  Hence if the backup program fails to back it
> > > up, it will not be creating an exact backup of the user's data at
> > > the point in time the backup is run...
> > > 
> > > Hence I don't see that MONOTONIC is a requirement for backup
> > > programs - they really do have to be able to handle filesystems that
> > > have modifications that move backwards in time as well as forwards...
> > 
> > Rolling backward is not a problem in and of itself. The big issue is
> > that after a crash, we can end up with a change attr seen before the
> > crash that is now associated with a completely different inode state.
> > 
> > The scenario is something like:
> > 
> > - Change attr for an empty file starts at 1
> > 
> > - Write "A" to file, change attr goes to 2
> > 
> > - Read and statx happens (client sees "A" with change attr 2)
> > 
> > - Crash (before last change is logged to disk)
> > 
> > - Machine reboots, inode is empty, change attr back to 1
> > 
> > - Write "B" to file, change attr goes to 2
> > 
> > - Client stat's file, sees change attr 2 and assumes its cache is
> > correct when it isn't (should be "B" not "A" now).
> > 
> > The real danger comes not from the thing going backward, but the fact
> > that it can march forward again after going backward, and then the
> > client can see two different inode states associated with the same
> > change attr value. Jumping all the change attributes forward by a
> > significant amount after a crash should avoid this issue.
> 
> As Dave pointed out, the problem with change attr having the same value for
> a different inode state (after going backwards) holds not only for the
> crashes but also for restore from backups, fs snapshots, device snapshots
> etc. So relying on change attr only looks a bit fragile. It works for the
> common case but the edge cases are awkward and there's no easy way to
> detect you are in the edge case.
> 

This is true. In fact in the snapshot case you can't even rely on doing
anything at reboot since you won't necessarily need to reboot to make it
roll backward.

Whether that obviates the use of this value altogether, I'm not sure.

> So I think any implementation caring about data integrity would have to
> include something like ctime into the picture anyway. Or we could just
> completely give up any idea of monotonicity and on each mount select random
> prime P < 2^64 and instead of doing inc when advancing the change
> attribute, we'd advance it by P. That makes collisions after restore /
> crash fairly unlikely.

Part of the goal (at least for NFS) is to avoid unnecessary cache
invalidations.

If we just increment it by a particular offset on every reboot, then
every time the server reboots, the clients will invalidate all of their
cached inodes, and proceed to hammer the server with READ calls just as
it's having to populate its own caches from disk.

IOW, that will not be good for performance. Doing that after a crash is
also less than ideal, but crashes should (hopefully) be rare enough that
it's not a major issue.

In any case, we need to decide whether and what to present to userland.
There is a lot of disagreement here, and we need to come to a consensus.
I think we have to answer 2 questions:

1/ Is this counter useful enough on its own, without any baked-in
rollback resilience to justify exposing it via statx()?

2/ if the answer above is "yes", then is there any value to the
MONOTONIC flag, given that we can't really do anything about snapshot
rollbacks and the like?

I tend to be in the camp of "let's make the raw counter available and
leave it up to userland to deal with the potential issues". After all,
the c/mtime are still widely used today to detect changes and they have
many of the same problems.

Trying to do anything more elaborate than that will lead to a lot of
extra complexity in the kernel, and make it a lot more difficult for any
filesystem to report this at all.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
  2022-10-18 10:35     ` Jeff Layton
  2022-10-18 13:49       ` Jan Kara
@ 2022-10-18 14:56       ` Jeff Layton
  1 sibling, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2022-10-18 14:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: tytso, adilger.kernel, djwong, trondmy, neilb, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Tue, 2022-10-18 at 06:35 -0400, Jeff Layton wrote:
> On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > > From: Jeff Layton <jlayton@redhat.com>
> > > 
> > > Claim one of the spare fields in struct statx to hold a 64-bit inode
> > > version attribute. When userland requests STATX_VERSION, copy the
> > > value from the kstat struct there, and stop masking off
> > > STATX_ATTR_VERSION_MONOTONIC.
> > 
> > Can we please make the name more sepcific than "version"? It's way
> > too generic and - we already have userspace facing "version" fields
> > for inodes that refer to the on-disk format version exposed in
> > various UAPIs. It's common for UAPI structures used for file
> > operations to have a "version" field that refers to the *UAPI
> > structure version* rather than file metadata or data being retrieved
> > from the file in question.
> > 
> > The need for an explanatory comment like this:
> > 
> > > +	__u64	stx_version; /* Inode change attribute */
> > 
> > demonstrates it is badly named. If you want it known as an inode
> > change attribute, then don't name the variable "version". In
> > reality, it really needs to be an opaque cookie, not something
> > applications need to decode directly to make sense of.
> > 
> 
> Fair enough. I started with this being named stx_change_attr and other
> people objected. I then changed to stx_ino_version, but the "_ino"
> seemed redundant.
> 
> I'm open to suggestions here. Naming things like this is hard.
> 

How about:

    STATX_CHANGE / statx->stx_change / STATX_ATTR_CHANGE_MONOTONIC

?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
  2022-10-18 14:21         ` Jeff Layton
@ 2022-10-18 15:17           ` Jan Kara
  2022-10-18 17:04             ` Jeff Layton
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2022-10-18 15:17 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Jan Kara, Dave Chinner, tytso, adilger.kernel, djwong, trondmy,
	neilb, viro, zohar, xiubli, chuck.lever, lczerner, bfields,
	brauner, fweimer, linux-btrfs, linux-fsdevel, linux-kernel,
	ceph-devel, linux-ext4, linux-nfs, linux-xfs

On Tue 18-10-22 10:21:08, Jeff Layton wrote:
> On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote:
> > On Tue 18-10-22 06:35:14, Jeff Layton wrote:
> > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > > > > Trond is of the opinion that monotonicity is a hard requirement, and
> > > > > that we should not allow filesystems that can't provide that quality to
> > > > > report STATX_VERSION at all.  His rationale is that one of the main uses
> > > > > for this is for backup applications, and for those a counter that could
> > > > > go backward is worse than useless.
> > > > 
> > > > From the perspective of a backup program doing incremental backups,
> > > > an inode with a change counter that has a different value to the
> > > > current backup inventory means the file contains different
> > > > information than what the current backup inventory holds. Again,
> > > > snapshots, rollbacks, etc.
> > > > 
> > > > Therefore, regardless of whether the change counter has gone
> > > > forwards or backwards, the backup program needs to back up this
> > > > current version of the file in this backup because it is different
> > > > to the inventory copy.  Hence if the backup program fails to back it
> > > > up, it will not be creating an exact backup of the user's data at
> > > > the point in time the backup is run...
> > > > 
> > > > Hence I don't see that MONOTONIC is a requirement for backup
> > > > programs - they really do have to be able to handle filesystems that
> > > > have modifications that move backwards in time as well as forwards...
> > > 
> > > Rolling backward is not a problem in and of itself. The big issue is
> > > that after a crash, we can end up with a change attr seen before the
> > > crash that is now associated with a completely different inode state.
> > > 
> > > The scenario is something like:
> > > 
> > > - Change attr for an empty file starts at 1
> > > 
> > > - Write "A" to file, change attr goes to 2
> > > 
> > > - Read and statx happens (client sees "A" with change attr 2)
> > > 
> > > - Crash (before last change is logged to disk)
> > > 
> > > - Machine reboots, inode is empty, change attr back to 1
> > > 
> > > - Write "B" to file, change attr goes to 2
> > > 
> > > - Client stat's file, sees change attr 2 and assumes its cache is
> > > correct when it isn't (should be "B" not "A" now).
> > > 
> > > The real danger comes not from the thing going backward, but the fact
> > > that it can march forward again after going backward, and then the
> > > client can see two different inode states associated with the same
> > > change attr value. Jumping all the change attributes forward by a
> > > significant amount after a crash should avoid this issue.
> > 
> > As Dave pointed out, the problem with change attr having the same value for
> > a different inode state (after going backwards) holds not only for the
> > crashes but also for restore from backups, fs snapshots, device snapshots
> > etc. So relying on change attr only looks a bit fragile. It works for the
> > common case but the edge cases are awkward and there's no easy way to
> > detect you are in the edge case.
> > 
> 
> This is true. In fact in the snapshot case you can't even rely on doing
> anything at reboot since you won't necessarily need to reboot to make it
> roll backward.
> 
> Whether that obviates the use of this value altogether, I'm not sure.
> 
> > So I think any implementation caring about data integrity would have to
> > include something like ctime into the picture anyway. Or we could just
> > completely give up any idea of monotonicity and on each mount select random
> > prime P < 2^64 and instead of doing inc when advancing the change
> > attribute, we'd advance it by P. That makes collisions after restore /
> > crash fairly unlikely.
> 
> Part of the goal (at least for NFS) is to avoid unnecessary cache
> invalidations.
> 
> If we just increment it by a particular offset on every reboot, then
> every time the server reboots, the clients will invalidate all of their
> cached inodes, and proceed to hammer the server with READ calls just as
> it's having to populate its own caches from disk.

Note that I didn't propose to increment by offset on every reboot or mount.
I have proposed that inode_maybe_inc_iversion() would not increment
i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P
<< I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected
on filesystem mount.

This will not cause cache invalidation after a clean unmount + remount. It
will cause cache invalidation after a crash, snapshot rollback etc., only for
inodes where i_version changed. If P is suitably selected (e.g. as being a
prime), then the chances of collisions (even after a snapshot rollback) are
very low (on the order of 2^(-50) if my piece of envelope calculations are
right).

So this should nicely deal with all the problems we've spotted so far. But
I may be missing something...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
  2022-10-18 15:17           ` Jan Kara
@ 2022-10-18 17:04             ` Jeff Layton
  2022-10-19 17:23               ` Jan Kara
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2022-10-18 17:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, tytso, adilger.kernel, djwong, trondmy, neilb,
	viro, zohar, xiubli, chuck.lever, lczerner, bfields, brauner,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote:
> On Tue 18-10-22 10:21:08, Jeff Layton wrote:
> > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote:
> > > On Tue 18-10-22 06:35:14, Jeff Layton wrote:
> > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > > > > > Trond is of the opinion that monotonicity is a hard requirement, and
> > > > > > that we should not allow filesystems that can't provide that quality to
> > > > > > report STATX_VERSION at all.  His rationale is that one of the main uses
> > > > > > for this is for backup applications, and for those a counter that could
> > > > > > go backward is worse than useless.
> > > > > 
> > > > > From the perspective of a backup program doing incremental backups,
> > > > > an inode with a change counter that has a different value to the
> > > > > current backup inventory means the file contains different
> > > > > information than what the current backup inventory holds. Again,
> > > > > snapshots, rollbacks, etc.
> > > > > 
> > > > > Therefore, regardless of whether the change counter has gone
> > > > > forwards or backwards, the backup program needs to back up this
> > > > > current version of the file in this backup because it is different
> > > > > to the inventory copy.  Hence if the backup program fails to back it
> > > > > up, it will not be creating an exact backup of the user's data at
> > > > > the point in time the backup is run...
> > > > > 
> > > > > Hence I don't see that MONOTONIC is a requirement for backup
> > > > > programs - they really do have to be able to handle filesystems that
> > > > > have modifications that move backwards in time as well as forwards...
> > > > 
> > > > Rolling backward is not a problem in and of itself. The big issue is
> > > > that after a crash, we can end up with a change attr seen before the
> > > > crash that is now associated with a completely different inode state.
> > > > 
> > > > The scenario is something like:
> > > > 
> > > > - Change attr for an empty file starts at 1
> > > > 
> > > > - Write "A" to file, change attr goes to 2
> > > > 
> > > > - Read and statx happens (client sees "A" with change attr 2)
> > > > 
> > > > - Crash (before last change is logged to disk)
> > > > 
> > > > - Machine reboots, inode is empty, change attr back to 1
> > > > 
> > > > - Write "B" to file, change attr goes to 2
> > > > 
> > > > - Client stat's file, sees change attr 2 and assumes its cache is
> > > > correct when it isn't (should be "B" not "A" now).
> > > > 
> > > > The real danger comes not from the thing going backward, but the fact
> > > > that it can march forward again after going backward, and then the
> > > > client can see two different inode states associated with the same
> > > > change attr value. Jumping all the change attributes forward by a
> > > > significant amount after a crash should avoid this issue.
> > > 
> > > As Dave pointed out, the problem with change attr having the same value for
> > > a different inode state (after going backwards) holds not only for the
> > > crashes but also for restore from backups, fs snapshots, device snapshots
> > > etc. So relying on change attr only looks a bit fragile. It works for the
> > > common case but the edge cases are awkward and there's no easy way to
> > > detect you are in the edge case.
> > > 
> > 
> > This is true. In fact in the snapshot case you can't even rely on doing
> > anything at reboot since you won't necessarily need to reboot to make it
> > roll backward.
> > 
> > Whether that obviates the use of this value altogether, I'm not sure.
> > 
> > > So I think any implementation caring about data integrity would have to
> > > include something like ctime into the picture anyway. Or we could just
> > > completely give up any idea of monotonicity and on each mount select random
> > > prime P < 2^64 and instead of doing inc when advancing the change
> > > attribute, we'd advance it by P. That makes collisions after restore /
> > > crash fairly unlikely.
> > 
> > Part of the goal (at least for NFS) is to avoid unnecessary cache
> > invalidations.
> > 
> > If we just increment it by a particular offset on every reboot, then
> > every time the server reboots, the clients will invalidate all of their
> > cached inodes, and proceed to hammer the server with READ calls just as
> > it's having to populate its own caches from disk.
> 
> Note that I didn't propose to increment by offset on every reboot or mount.
> I have proposed that inode_maybe_inc_iversion() would not increment
> i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P
> << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected
> on filesystem mount.
> 
> This will not cause cache invalidation after a clean unmount + remount. It
> will cause cache invalidation after a crash, snapshot rollback etc., only for
> inodes where i_version changed. If P is suitably selected (e.g. as being a
> prime), then the chances of collisions (even after a snapshot rollback) are
> very low (on the order of 2^(-50) if my piece of envelope calculations are
> right).
>
> So this should nicely deal with all the problems we've spotted so far. But
> I may be missing something...


Got it! That makes a lot more sense. Thinking about this some more...

What sort of range for P would be suitable?

Every increment would need to be by (shifted) P, so we can't choose too
large a number. Queries are pretty rare vs. writes though, so that
mitigates the issue somewhat.

There are 31 primes between 1 and 127. Worst case, we'd still have 2^48
increments before the counter wraps.

Let me think about this some more, but maybe that's good enough to
ensure uniqueness.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 0/9] fs: clean up handling of i_version counter
  2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
                   ` (8 preceding siblings ...)
  2022-10-17 10:57 ` [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland Jeff Layton
@ 2022-10-19 11:13 ` Christian Brauner
  2022-10-19 12:18   ` Jeff Layton
  9 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2022-10-19 11:13 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Mon, Oct 17, 2022 at 06:57:00AM -0400, Jeff Layton wrote:
> This patchset is intended to clean up the handling of the i_version
> counter by nfsd. Most of the changes are to internal interfaces.
> 
> This set is not intended to address crash resilience, or the fact that
> the counter is bumped before a change and not after. I intend to tackle
> those in follow-on patchsets.
> 
> My intention is to get this series included into linux-next soon, with
> an eye toward merging most of it during the v6.2 merge window. The last
> patch in the series is probably not suitable for merge as-is, at least
> until we sort out the semantics we want to present to userland for it.

Over the course of the series I struggled a bit - and sorry for losing
focus - with what i_version is supposed to represent for userspace. So I
would support not exposing it to userspace before that. But that
shouldn't affect your other changes iiuc.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 0/9] fs: clean up handling of i_version counter
  2022-10-19 11:13 ` [PATCH v7 0/9] fs: clean up handling of i_version counter Christian Brauner
@ 2022-10-19 12:18   ` Jeff Layton
  2022-10-19 15:45     ` Darrick J. Wong
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2022-10-19 12:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Wed, 2022-10-19 at 13:13 +0200, Christian Brauner wrote:
> On Mon, Oct 17, 2022 at 06:57:00AM -0400, Jeff Layton wrote:
> > This patchset is intended to clean up the handling of the i_version
> > counter by nfsd. Most of the changes are to internal interfaces.
> > 
> > This set is not intended to address crash resilience, or the fact that
> > the counter is bumped before a change and not after. I intend to tackle
> > those in follow-on patchsets.
> > 
> > My intention is to get this series included into linux-next soon, with
> > an eye toward merging most of it during the v6.2 merge window. The last
> > patch in the series is probably not suitable for merge as-is, at least
> > until we sort out the semantics we want to present to userland for it.
> 
> Over the course of the series I struggled a bit - and sorry for losing
> focus - with what i_version is supposed to represent for userspace. So I
> would support not exposing it to userspace before that. But that
> shouldn't affect your other changes iiuc.

Thanks Christian,

It has been a real struggle to nail this down, and yeah I too am not
planning to expose this to userland until we have this much better
defined. Patch #9 is just to give you an idea of what this would
ultimately look like. I intend to re-post the first 8 patches with an
eye toward merge in v6.2, once we've settled on the naming. On that
note...

I believe you had mentioned that you didn't like STATX_CHANGE_ATTR for
the name, and suggested STATX_I_VERSION (or something similar), which I
later shortened to STATX_VERSION.

Dave C. objected to STATX_VERSION, as "version" fields in a struct
usually refer to the version of the struct itself rather than the
version of the thing it describes. It also sort of implies a monotonic
counter, and I'm not ready to require that just yet.

What about STATX_CHANGE for the name (with corresponding names for the
field and other flags)? That drops the redundant "_ATTR" postfix, while
being sufficiently vague to allow for alternative implementations in the
future.

Do you (or anyone else) have other suggestions for a name?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 0/9] fs: clean up handling of i_version counter
  2022-10-19 12:18   ` Jeff Layton
@ 2022-10-19 15:45     ` Darrick J. Wong
  2022-10-19 20:36       ` Jeff Layton
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2022-10-19 15:45 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christian Brauner, tytso, adilger.kernel, david, trondmy, neilb,
	viro, zohar, xiubli, chuck.lever, lczerner, jack, bfields,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Wed, Oct 19, 2022 at 08:18:15AM -0400, Jeff Layton wrote:
> On Wed, 2022-10-19 at 13:13 +0200, Christian Brauner wrote:
> > On Mon, Oct 17, 2022 at 06:57:00AM -0400, Jeff Layton wrote:
> > > This patchset is intended to clean up the handling of the i_version
> > > counter by nfsd. Most of the changes are to internal interfaces.
> > > 
> > > This set is not intended to address crash resilience, or the fact that
> > > the counter is bumped before a change and not after. I intend to tackle
> > > those in follow-on patchsets.
> > > 
> > > My intention is to get this series included into linux-next soon, with
> > > an eye toward merging most of it during the v6.2 merge window. The last
> > > patch in the series is probably not suitable for merge as-is, at least
> > > until we sort out the semantics we want to present to userland for it.
> > 
> > Over the course of the series I struggled a bit - and sorry for losing
> > focus - with what i_version is supposed to represent for userspace. So I
> > would support not exposing it to userspace before that. But that
> > shouldn't affect your other changes iiuc.
> 
> Thanks Christian,
> 
> It has been a real struggle to nail this down, and yeah I too am not
> planning to expose this to userland until we have this much better
> defined. Patch #9 is just to give you an idea of what this would
> ultimately look like. I intend to re-post the first 8 patches with an
> eye toward merge in v6.2, once we've settled on the naming. On that
> note...
> 
> I believe you had mentioned that you didn't like STATX_CHANGE_ATTR for
> the name, and suggested STATX_I_VERSION (or something similar), which I
> later shortened to STATX_VERSION.
> 
> Dave C. objected to STATX_VERSION, as "version" fields in a struct
> usually refer to the version of the struct itself rather than the
> version of the thing it describes. It also sort of implies a monotonic
> counter, and I'm not ready to require that just yet.
> 
> What about STATX_CHANGE for the name (with corresponding names for the
> field and other flags)? That drops the redundant "_ATTR" postfix, while
> being sufficiently vague to allow for alternative implementations in the
> future.
> 
> Do you (or anyone else) have other suggestions for a name?

Welllll it's really a u32 whose value doesn't have any intrinsic meaning
other than "if (value_now != value_before) flush_cache();" right?
I think it really only tracks changes to file data, right?

STATX_CHANGE_COOKIE	(wait, does this cookie augment i_ctime?)

STATX_MOD_COOKIE	(...or just file modifications/i_mtime?)

STATX_MONITOR_COOKIE	(...what are we monitoring??)

STATX_MON_COOKIE

STATX_COOKIE_MON

STATX_COOKIE_MONSTER

There we go. ;)

In seriousness, I'd probably go with one of the first two.  I wouldn't
be opposed to the last one, either, but others may disagree. ;)

--D

> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
  2022-10-18 17:04             ` Jeff Layton
@ 2022-10-19 17:23               ` Jan Kara
  2022-10-19 18:47                 ` Jeff Layton
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2022-10-19 17:23 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Jan Kara, Dave Chinner, tytso, adilger.kernel, djwong, trondmy,
	neilb, viro, zohar, xiubli, chuck.lever, lczerner, bfields,
	brauner, fweimer, linux-btrfs, linux-fsdevel, linux-kernel,
	ceph-devel, linux-ext4, linux-nfs, linux-xfs

On Tue 18-10-22 13:04:34, Jeff Layton wrote:
> On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote:
> > On Tue 18-10-22 10:21:08, Jeff Layton wrote:
> > > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote:
> > > > On Tue 18-10-22 06:35:14, Jeff Layton wrote:
> > > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> > > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > > > > > > Trond is of the opinion that monotonicity is a hard requirement, and
> > > > > > > that we should not allow filesystems that can't provide that quality to
> > > > > > > report STATX_VERSION at all.  His rationale is that one of the main uses
> > > > > > > for this is for backup applications, and for those a counter that could
> > > > > > > go backward is worse than useless.
> > > > > > 
> > > > > > From the perspective of a backup program doing incremental backups,
> > > > > > an inode with a change counter that has a different value to the
> > > > > > current backup inventory means the file contains different
> > > > > > information than what the current backup inventory holds. Again,
> > > > > > snapshots, rollbacks, etc.
> > > > > > 
> > > > > > Therefore, regardless of whether the change counter has gone
> > > > > > forwards or backwards, the backup program needs to back up this
> > > > > > current version of the file in this backup because it is different
> > > > > > to the inventory copy.  Hence if the backup program fails to back it
> > > > > > up, it will not be creating an exact backup of the user's data at
> > > > > > the point in time the backup is run...
> > > > > > 
> > > > > > Hence I don't see that MONOTONIC is a requirement for backup
> > > > > > programs - they really do have to be able to handle filesystems that
> > > > > > have modifications that move backwards in time as well as forwards...
> > > > > 
> > > > > Rolling backward is not a problem in and of itself. The big issue is
> > > > > that after a crash, we can end up with a change attr seen before the
> > > > > crash that is now associated with a completely different inode state.
> > > > > 
> > > > > The scenario is something like:
> > > > > 
> > > > > - Change attr for an empty file starts at 1
> > > > > 
> > > > > - Write "A" to file, change attr goes to 2
> > > > > 
> > > > > - Read and statx happens (client sees "A" with change attr 2)
> > > > > 
> > > > > - Crash (before last change is logged to disk)
> > > > > 
> > > > > - Machine reboots, inode is empty, change attr back to 1
> > > > > 
> > > > > - Write "B" to file, change attr goes to 2
> > > > > 
> > > > > - Client stat's file, sees change attr 2 and assumes its cache is
> > > > > correct when it isn't (should be "B" not "A" now).
> > > > > 
> > > > > The real danger comes not from the thing going backward, but the fact
> > > > > that it can march forward again after going backward, and then the
> > > > > client can see two different inode states associated with the same
> > > > > change attr value. Jumping all the change attributes forward by a
> > > > > significant amount after a crash should avoid this issue.
> > > > 
> > > > As Dave pointed out, the problem with change attr having the same value for
> > > > a different inode state (after going backwards) holds not only for the
> > > > crashes but also for restore from backups, fs snapshots, device snapshots
> > > > etc. So relying on change attr only looks a bit fragile. It works for the
> > > > common case but the edge cases are awkward and there's no easy way to
> > > > detect you are in the edge case.
> > > > 
> > > 
> > > This is true. In fact in the snapshot case you can't even rely on doing
> > > anything at reboot since you won't necessarily need to reboot to make it
> > > roll backward.
> > > 
> > > Whether that obviates the use of this value altogether, I'm not sure.
> > > 
> > > > So I think any implementation caring about data integrity would have to
> > > > include something like ctime into the picture anyway. Or we could just
> > > > completely give up any idea of monotonicity and on each mount select random
> > > > prime P < 2^64 and instead of doing inc when advancing the change
> > > > attribute, we'd advance it by P. That makes collisions after restore /
> > > > crash fairly unlikely.
> > > 
> > > Part of the goal (at least for NFS) is to avoid unnecessary cache
> > > invalidations.
> > > 
> > > If we just increment it by a particular offset on every reboot, then
> > > every time the server reboots, the clients will invalidate all of their
> > > cached inodes, and proceed to hammer the server with READ calls just as
> > > it's having to populate its own caches from disk.
> > 
> > Note that I didn't propose to increment by offset on every reboot or mount.
> > I have proposed that inode_maybe_inc_iversion() would not increment
> > i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P
> > << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected
> > on filesystem mount.
> > 
> > This will not cause cache invalidation after a clean unmount + remount. It
> > will cause cache invalidation after a crash, snapshot rollback etc., only for
> > inodes where i_version changed. If P is suitably selected (e.g. as being a
> > prime), then the chances of collisions (even after a snapshot rollback) are
> > very low (on the order of 2^(-50) if my piece of envelope calculations are
> > right).
> >
> > So this should nicely deal with all the problems we've spotted so far. But
> > I may be missing something...
> 
> 
> Got it! That makes a lot more sense. Thinking about this some more...
> 
> What sort of range for P would be suitable?
> 
> Every increment would need to be by (shifted) P, so we can't choose too
> large a number. Queries are pretty rare vs. writes though, so that
> mitigates the issue somewhat.

Well, I agree that for large P the counter would wrap earlier. But is that
a problem? Note that if P is a prime (indivisible by 2 is enough), then the
counter would get to already used value still only after 2^63 steps. Thus if
we give up monotonicity and just treat the counter as an opaque cookie, we
do not have to care about wrapping.

Sure given different P is selected for each mount the wrapping argument
does not hold 100% but here comes the advantage of primes - if you have two
different primes P and Q, then a collision means that k*P mod 2^63 = l*Q
mod 2^63 and that holds for exactly one pair k,l from 1..2^63 range. So the
chances of early collision even after selecting a different prime on each
mount are *very* low.

So I think we should select from a relatively large set of primes so that
the chance of randomly selecting the same prime (and thus reissuing the
same change attr for different inode state sometime later) are small.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
  2022-10-19 17:23               ` Jan Kara
@ 2022-10-19 18:47                 ` Jeff Layton
  2022-10-20 10:39                   ` Jan Kara
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2022-10-19 18:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, tytso, adilger.kernel, djwong, trondmy, neilb,
	viro, zohar, xiubli, chuck.lever, lczerner, bfields, brauner,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Wed, 2022-10-19 at 19:23 +0200, Jan Kara wrote:
> On Tue 18-10-22 13:04:34, Jeff Layton wrote:
> > On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote:
> > > On Tue 18-10-22 10:21:08, Jeff Layton wrote:
> > > > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote:
> > > > > On Tue 18-10-22 06:35:14, Jeff Layton wrote:
> > > > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> > > > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > > > > > > > Trond is of the opinion that monotonicity is a hard requirement, and
> > > > > > > > that we should not allow filesystems that can't provide that quality to
> > > > > > > > report STATX_VERSION at all.  His rationale is that one of the main uses
> > > > > > > > for this is for backup applications, and for those a counter that could
> > > > > > > > go backward is worse than useless.
> > > > > > > 
> > > > > > > From the perspective of a backup program doing incremental backups,
> > > > > > > an inode with a change counter that has a different value to the
> > > > > > > current backup inventory means the file contains different
> > > > > > > information than what the current backup inventory holds. Again,
> > > > > > > snapshots, rollbacks, etc.
> > > > > > > 
> > > > > > > Therefore, regardless of whether the change counter has gone
> > > > > > > forwards or backwards, the backup program needs to back up this
> > > > > > > current version of the file in this backup because it is different
> > > > > > > to the inventory copy.  Hence if the backup program fails to back it
> > > > > > > up, it will not be creating an exact backup of the user's data at
> > > > > > > the point in time the backup is run...
> > > > > > > 
> > > > > > > Hence I don't see that MONOTONIC is a requirement for backup
> > > > > > > programs - they really do have to be able to handle filesystems that
> > > > > > > have modifications that move backwards in time as well as forwards...
> > > > > > 
> > > > > > Rolling backward is not a problem in and of itself. The big issue is
> > > > > > that after a crash, we can end up with a change attr seen before the
> > > > > > crash that is now associated with a completely different inode state.
> > > > > > 
> > > > > > The scenario is something like:
> > > > > > 
> > > > > > - Change attr for an empty file starts at 1
> > > > > > 
> > > > > > - Write "A" to file, change attr goes to 2
> > > > > > 
> > > > > > - Read and statx happens (client sees "A" with change attr 2)
> > > > > > 
> > > > > > - Crash (before last change is logged to disk)
> > > > > > 
> > > > > > - Machine reboots, inode is empty, change attr back to 1
> > > > > > 
> > > > > > - Write "B" to file, change attr goes to 2
> > > > > > 
> > > > > > - Client stat's file, sees change attr 2 and assumes its cache is
> > > > > > correct when it isn't (should be "B" not "A" now).
> > > > > > 
> > > > > > The real danger comes not from the thing going backward, but the fact
> > > > > > that it can march forward again after going backward, and then the
> > > > > > client can see two different inode states associated with the same
> > > > > > change attr value. Jumping all the change attributes forward by a
> > > > > > significant amount after a crash should avoid this issue.
> > > > > 
> > > > > As Dave pointed out, the problem with change attr having the same value for
> > > > > a different inode state (after going backwards) holds not only for the
> > > > > crashes but also for restore from backups, fs snapshots, device snapshots
> > > > > etc. So relying on change attr only looks a bit fragile. It works for the
> > > > > common case but the edge cases are awkward and there's no easy way to
> > > > > detect you are in the edge case.
> > > > > 
> > > > 
> > > > This is true. In fact in the snapshot case you can't even rely on doing
> > > > anything at reboot since you won't necessarily need to reboot to make it
> > > > roll backward.
> > > > 
> > > > Whether that obviates the use of this value altogether, I'm not sure.
> > > > 
> > > > > So I think any implementation caring about data integrity would have to
> > > > > include something like ctime into the picture anyway. Or we could just
> > > > > completely give up any idea of monotonicity and on each mount select random
> > > > > prime P < 2^64 and instead of doing inc when advancing the change
> > > > > attribute, we'd advance it by P. That makes collisions after restore /
> > > > > crash fairly unlikely.
> > > > 
> > > > Part of the goal (at least for NFS) is to avoid unnecessary cache
> > > > invalidations.
> > > > 
> > > > If we just increment it by a particular offset on every reboot, then
> > > > every time the server reboots, the clients will invalidate all of their
> > > > cached inodes, and proceed to hammer the server with READ calls just as
> > > > it's having to populate its own caches from disk.
> > > 
> > > Note that I didn't propose to increment by offset on every reboot or mount.
> > > I have proposed that inode_maybe_inc_iversion() would not increment
> > > i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P
> > > << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected
> > > on filesystem mount.
> > > 
> > > This will not cause cache invalidation after a clean unmount + remount. It
> > > will cause cache invalidation after a crash, snapshot rollback etc., only for
> > > inodes where i_version changed. If P is suitably selected (e.g. as being a
> > > prime), then the chances of collisions (even after a snapshot rollback) are
> > > very low (on the order of 2^(-50) if my piece of envelope calculations are
> > > right).
> > > 
> > > So this should nicely deal with all the problems we've spotted so far. But
> > > I may be missing something...
> > 
> > 
> > Got it! That makes a lot more sense. Thinking about this some more...
> > 
> > What sort of range for P would be suitable?
> > 
> > Every increment would need to be by (shifted) P, so we can't choose too
> > large a number. Queries are pretty rare vs. writes though, so that
> > mitigates the issue somewhat.
> 
> Well, I agree that for large P the counter would wrap earlier. But is that
> a problem? Note that if P is a prime (indivisible by 2 is enough), then the
> counter would get to already used value still only after 2^63 steps. Thus if
> we give up monotonicity and just treat the counter as an opaque cookie, we
> do not have to care about wrapping.
> 
> Sure given different P is selected for each mount the wrapping argument
> does not hold 100% but here comes the advantage of primes - if you have two
> different primes P and Q, then a collision means that k*P mod 2^63 = l*Q
> mod 2^63 and that holds for exactly one pair k,l from 1..2^63 range. So the
> chances of early collision even after selecting a different prime on each
> mount are *very* low.
> 

I think we'll have to start avoiding 1 as a value for P if we do this,
but the rest makes sense.  I like this idea, Jan!
 
> So I think we should select from a relatively large set of primes so that
> the chance of randomly selecting the same prime (and thus reissuing the
> same change attr for different inode state sometime later) are small.
> 

Monotonicity allows you to discard "old" attr updates. For instance,
sometimes a NFS GETATTR response may be delayed for various reasons. If
the client sees a change attr that is provably older than one it has
already seen, it can discard the update. So, there is value in servers
advertising that property, and NFSv4.2 has a way to do that.

The Linux NFS client (at least) uses the same trick we do with jiffies
to handle wrapping for MONOTONIC values. We should be able to advertise
MONOTONIC as long as the client isn't comparing values that are more
than ~2^62 apart. 

Once we start talking about applications storing these values for
incremental backups, then the time between checks could be very long.

So, I think we don't want _too_ large a value for P. The big question is
how many individual change attr increments do we need to account for?

We have 64 bits total (it's an atomic64_t). We consume the lowest bit
for the QUERIED flag. That leaves us 63 bits of counter (currently).
When we increment by a larger value, we're effectively decreasing the
size of the counter.

Let's assume a worst case of one increment per microsecond, interleaved
by queries (so that they have to be real increments). 2^48 microseconds
is close to 9 years.

That leaves 15 bits for the P, which is primes from 3..32749. Is that a
large enough pool of prime numbers?

It looks like the kernel already has some infrastructure for handling
primes in lib/math/prime_numbers.c. We could just select a global P
value to use on every reboot, or just have filesystems set their own
(maybe in a new field in the superblock?)
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 0/9] fs: clean up handling of i_version counter
  2022-10-19 15:45     ` Darrick J. Wong
@ 2022-10-19 20:36       ` Jeff Layton
  2022-10-20  6:58         ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2022-10-19 20:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christian Brauner, tytso, adilger.kernel, david, trondmy, neilb,
	viro, zohar, xiubli, chuck.lever, lczerner, jack, bfields,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Wed, 2022-10-19 at 08:45 -0700, Darrick J. Wong wrote:
> On Wed, Oct 19, 2022 at 08:18:15AM -0400, Jeff Layton wrote:
> > On Wed, 2022-10-19 at 13:13 +0200, Christian Brauner wrote:
> > > On Mon, Oct 17, 2022 at 06:57:00AM -0400, Jeff Layton wrote:
> > > > This patchset is intended to clean up the handling of the i_version
> > > > counter by nfsd. Most of the changes are to internal interfaces.
> > > > 
> > > > This set is not intended to address crash resilience, or the fact that
> > > > the counter is bumped before a change and not after. I intend to tackle
> > > > those in follow-on patchsets.
> > > > 
> > > > My intention is to get this series included into linux-next soon, with
> > > > an eye toward merging most of it during the v6.2 merge window. The last
> > > > patch in the series is probably not suitable for merge as-is, at least
> > > > until we sort out the semantics we want to present to userland for it.
> > > 
> > > Over the course of the series I struggled a bit - and sorry for losing
> > > focus - with what i_version is supposed to represent for userspace. So I
> > > would support not exposing it to userspace before that. But that
> > > shouldn't affect your other changes iiuc.
> > 
> > Thanks Christian,
> > 
> > It has been a real struggle to nail this down, and yeah I too am not
> > planning to expose this to userland until we have this much better
> > defined. Patch #9 is just to give you an idea of what this would
> > ultimately look like. I intend to re-post the first 8 patches with an
> > eye toward merge in v6.2, once we've settled on the naming. On that
> > note...
> > 
> > I believe you had mentioned that you didn't like STATX_CHANGE_ATTR for
> > the name, and suggested STATX_I_VERSION (or something similar), which I
> > later shortened to STATX_VERSION.
> > 
> > Dave C. objected to STATX_VERSION, as "version" fields in a struct
> > usually refer to the version of the struct itself rather than the
> > version of the thing it describes. It also sort of implies a monotonic
> > counter, and I'm not ready to require that just yet.
> > 
> > What about STATX_CHANGE for the name (with corresponding names for the
> > field and other flags)? That drops the redundant "_ATTR" postfix, while
> > being sufficiently vague to allow for alternative implementations in the
> > future.
> > 
> > Do you (or anyone else) have other suggestions for a name?
> 
> Welllll it's really a u32 whose value doesn't have any intrinsic meaning
> other than "if (value_now != value_before) flush_cache();" right?
> I think it really only tracks changes to file data, right?
> 

It's a u64, but yeah, you're not supposed to assign any intrinsic
meaning to the value itself.

> STATX_CHANGE_COOKIE	(wait, does this cookie augment i_ctime?)
> 
> STATX_MOD_COOKIE	(...or just file modifications/i_mtime?)
> 
> STATX_MONITOR_COOKIE	(...what are we monitoring??)
> 
> STATX_MON_COOKIE
> 
> STATX_COOKIE_MON
> 
> STATX_COOKIE_MONSTER
> 
> There we go. ;)
> 
> In seriousness, I'd probably go with one of the first two.  I wouldn't
> be opposed to the last one, either, but others may disagree. ;)
> 
> --D
> 
> 

STATX_CHANGE_COOKIE is probably the best one. I'll plan to go with that
unless someone has a better idea. Thanks for the suggestions!

Cheers,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 0/9] fs: clean up handling of i_version counter
  2022-10-19 20:36       ` Jeff Layton
@ 2022-10-20  6:58         ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2022-10-20  6:58 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Darrick J. Wong, tytso, adilger.kernel, david, trondmy, neilb,
	viro, zohar, xiubli, chuck.lever, lczerner, jack, bfields,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Wed, Oct 19, 2022 at 04:36:47PM -0400, Jeff Layton wrote:
> On Wed, 2022-10-19 at 08:45 -0700, Darrick J. Wong wrote:
> > On Wed, Oct 19, 2022 at 08:18:15AM -0400, Jeff Layton wrote:
> > > On Wed, 2022-10-19 at 13:13 +0200, Christian Brauner wrote:
> > > > On Mon, Oct 17, 2022 at 06:57:00AM -0400, Jeff Layton wrote:
> > > > > This patchset is intended to clean up the handling of the i_version
> > > > > counter by nfsd. Most of the changes are to internal interfaces.
> > > > > 
> > > > > This set is not intended to address crash resilience, or the fact that
> > > > > the counter is bumped before a change and not after. I intend to tackle
> > > > > those in follow-on patchsets.
> > > > > 
> > > > > My intention is to get this series included into linux-next soon, with
> > > > > an eye toward merging most of it during the v6.2 merge window. The last
> > > > > patch in the series is probably not suitable for merge as-is, at least
> > > > > until we sort out the semantics we want to present to userland for it.
> > > > 
> > > > Over the course of the series I struggled a bit - and sorry for losing
> > > > focus - with what i_version is supposed to represent for userspace. So I
> > > > would support not exposing it to userspace before that. But that
> > > > shouldn't affect your other changes iiuc.
> > > 
> > > Thanks Christian,
> > > 
> > > It has been a real struggle to nail this down, and yeah I too am not
> > > planning to expose this to userland until we have this much better
> > > defined. Patch #9 is just to give you an idea of what this would
> > > ultimately look like. I intend to re-post the first 8 patches with an
> > > eye toward merge in v6.2, once we've settled on the naming. On that
> > > note...
> > > 
> > > I believe you had mentioned that you didn't like STATX_CHANGE_ATTR for
> > > the name, and suggested STATX_I_VERSION (or something similar), which I
> > > later shortened to STATX_VERSION.
> > > 
> > > Dave C. objected to STATX_VERSION, as "version" fields in a struct
> > > usually refer to the version of the struct itself rather than the
> > > version of the thing it describes. It also sort of implies a monotonic
> > > counter, and I'm not ready to require that just yet.
> > > 
> > > What about STATX_CHANGE for the name (with corresponding names for the
> > > field and other flags)? That drops the redundant "_ATTR" postfix, while
> > > being sufficiently vague to allow for alternative implementations in the
> > > future.
> > > 
> > > Do you (or anyone else) have other suggestions for a name?
> > 
> > Welllll it's really a u32 whose value doesn't have any intrinsic meaning
> > other than "if (value_now != value_before) flush_cache();" right?
> > I think it really only tracks changes to file data, right?
> > 
> 
> It's a u64, but yeah, you're not supposed to assign any intrinsic
> meaning to the value itself.
> 
> > STATX_CHANGE_COOKIE	(wait, does this cookie augment i_ctime?)
> > 
> > STATX_MOD_COOKIE	(...or just file modifications/i_mtime?)
> > 
> > STATX_MONITOR_COOKIE	(...what are we monitoring??)
> > 
> > STATX_MON_COOKIE
> > 
> > STATX_COOKIE_MON
> > 
> > STATX_COOKIE_MONSTER
> > 
> > There we go. ;)
> > 
> > In seriousness, I'd probably go with one of the first two.  I wouldn't
> > be opposed to the last one, either, but others may disagree. ;)
> > 
> > --D
> > 
> > 
> 
> STATX_CHANGE_COOKIE is probably the best one. I'll plan to go with that
> unless someone has a better idea. Thanks for the suggestions!

Sounds fine to me.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
  2022-10-19 18:47                 ` Jeff Layton
@ 2022-10-20 10:39                   ` Jan Kara
  2022-10-21 10:08                     ` Jeff Layton
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2022-10-20 10:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Jan Kara, Dave Chinner, tytso, adilger.kernel, djwong, trondmy,
	neilb, viro, zohar, xiubli, chuck.lever, lczerner, bfields,
	brauner, fweimer, linux-btrfs, linux-fsdevel, linux-kernel,
	ceph-devel, linux-ext4, linux-nfs, linux-xfs

On Wed 19-10-22 14:47:48, Jeff Layton wrote:
> On Wed, 2022-10-19 at 19:23 +0200, Jan Kara wrote:
> > On Tue 18-10-22 13:04:34, Jeff Layton wrote:
> > > On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote:
> > > > On Tue 18-10-22 10:21:08, Jeff Layton wrote:
> > > > > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote:
> > > > > > On Tue 18-10-22 06:35:14, Jeff Layton wrote:
> > > > > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> > > > > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > > > > > > > > Trond is of the opinion that monotonicity is a hard requirement, and
> > > > > > > > > that we should not allow filesystems that can't provide that quality to
> > > > > > > > > report STATX_VERSION at all.  His rationale is that one of the main uses
> > > > > > > > > for this is for backup applications, and for those a counter that could
> > > > > > > > > go backward is worse than useless.
> > > > > > > > 
> > > > > > > > From the perspective of a backup program doing incremental backups,
> > > > > > > > an inode with a change counter that has a different value to the
> > > > > > > > current backup inventory means the file contains different
> > > > > > > > information than what the current backup inventory holds. Again,
> > > > > > > > snapshots, rollbacks, etc.
> > > > > > > > 
> > > > > > > > Therefore, regardless of whether the change counter has gone
> > > > > > > > forwards or backwards, the backup program needs to back up this
> > > > > > > > current version of the file in this backup because it is different
> > > > > > > > to the inventory copy.  Hence if the backup program fails to back it
> > > > > > > > up, it will not be creating an exact backup of the user's data at
> > > > > > > > the point in time the backup is run...
> > > > > > > > 
> > > > > > > > Hence I don't see that MONOTONIC is a requirement for backup
> > > > > > > > programs - they really do have to be able to handle filesystems that
> > > > > > > > have modifications that move backwards in time as well as forwards...
> > > > > > > 
> > > > > > > Rolling backward is not a problem in and of itself. The big issue is
> > > > > > > that after a crash, we can end up with a change attr seen before the
> > > > > > > crash that is now associated with a completely different inode state.
> > > > > > > 
> > > > > > > The scenario is something like:
> > > > > > > 
> > > > > > > - Change attr for an empty file starts at 1
> > > > > > > 
> > > > > > > - Write "A" to file, change attr goes to 2
> > > > > > > 
> > > > > > > - Read and statx happens (client sees "A" with change attr 2)
> > > > > > > 
> > > > > > > - Crash (before last change is logged to disk)
> > > > > > > 
> > > > > > > - Machine reboots, inode is empty, change attr back to 1
> > > > > > > 
> > > > > > > - Write "B" to file, change attr goes to 2
> > > > > > > 
> > > > > > > - Client stat's file, sees change attr 2 and assumes its cache is
> > > > > > > correct when it isn't (should be "B" not "A" now).
> > > > > > > 
> > > > > > > The real danger comes not from the thing going backward, but the fact
> > > > > > > that it can march forward again after going backward, and then the
> > > > > > > client can see two different inode states associated with the same
> > > > > > > change attr value. Jumping all the change attributes forward by a
> > > > > > > significant amount after a crash should avoid this issue.
> > > > > > 
> > > > > > As Dave pointed out, the problem with change attr having the same value for
> > > > > > a different inode state (after going backwards) holds not only for the
> > > > > > crashes but also for restore from backups, fs snapshots, device snapshots
> > > > > > etc. So relying on change attr only looks a bit fragile. It works for the
> > > > > > common case but the edge cases are awkward and there's no easy way to
> > > > > > detect you are in the edge case.
> > > > > > 
> > > > > 
> > > > > This is true. In fact in the snapshot case you can't even rely on doing
> > > > > anything at reboot since you won't necessarily need to reboot to make it
> > > > > roll backward.
> > > > > 
> > > > > Whether that obviates the use of this value altogether, I'm not sure.
> > > > > 
> > > > > > So I think any implementation caring about data integrity would have to
> > > > > > include something like ctime into the picture anyway. Or we could just
> > > > > > completely give up any idea of monotonicity and on each mount select random
> > > > > > prime P < 2^64 and instead of doing inc when advancing the change
> > > > > > attribute, we'd advance it by P. That makes collisions after restore /
> > > > > > crash fairly unlikely.
> > > > > 
> > > > > Part of the goal (at least for NFS) is to avoid unnecessary cache
> > > > > invalidations.
> > > > > 
> > > > > If we just increment it by a particular offset on every reboot, then
> > > > > every time the server reboots, the clients will invalidate all of their
> > > > > cached inodes, and proceed to hammer the server with READ calls just as
> > > > > it's having to populate its own caches from disk.
> > > > 
> > > > Note that I didn't propose to increment by offset on every reboot or mount.
> > > > I have proposed that inode_maybe_inc_iversion() would not increment
> > > > i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P
> > > > << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected
> > > > on filesystem mount.
> > > > 
> > > > This will not cause cache invalidation after a clean unmount + remount. It
> > > > will cause cache invalidation after a crash, snapshot rollback etc., only for
> > > > inodes where i_version changed. If P is suitably selected (e.g. as being a
> > > > prime), then the chances of collisions (even after a snapshot rollback) are
> > > > very low (on the order of 2^(-50) if my piece of envelope calculations are
> > > > right).
> > > > 
> > > > So this should nicely deal with all the problems we've spotted so far. But
> > > > I may be missing something...
> > > 
> > > 
> > > Got it! That makes a lot more sense. Thinking about this some more...
> > > 
> > > What sort of range for P would be suitable?
> > > 
> > > Every increment would need to be by (shifted) P, so we can't choose too
> > > large a number. Queries are pretty rare vs. writes though, so that
> > > mitigates the issue somewhat.
> > 
> > Well, I agree that for large P the counter would wrap earlier. But is that
> > a problem? Note that if P is a prime (indivisible by 2 is enough), then the
> > counter would get to already used value still only after 2^63 steps. Thus if
> > we give up monotonicity and just treat the counter as an opaque cookie, we
> > do not have to care about wrapping.
> > 
> > Sure given different P is selected for each mount the wrapping argument
> > does not hold 100% but here comes the advantage of primes - if you have two
> > different primes P and Q, then a collision means that k*P mod 2^63 = l*Q
> > mod 2^63 and that holds for exactly one pair k,l from 1..2^63 range. So the
> > chances of early collision even after selecting a different prime on each
> > mount are *very* low.
> 
> I think we'll have to start avoiding 1 as a value for P if we do this,
> but the rest makes sense.  I like this idea, Jan!

Yes, 1 is kind of special so we should better avoid it in this scheme.
Especially if we're going to select only smaller primes.

> > So I think we should select from a relatively large set of primes so that
> > the chance of randomly selecting the same prime (and thus reissuing the
> > same change attr for different inode state sometime later) are small.
> > 
> 
> Monotonicity allows you to discard "old" attr updates. For instance,
> sometimes a NFS GETATTR response may be delayed for various reasons. If
> the client sees a change attr that is provably older than one it has
> already seen, it can discard the update. So, there is value in servers
> advertising that property, and NFSv4.2 has a way to do that.
> 
> The Linux NFS client (at least) uses the same trick we do with jiffies
> to handle wrapping for MONOTONIC values. We should be able to advertise
> MONOTONIC as long as the client isn't comparing values that are more
> than ~2^62 apart. 
> 
> Once we start talking about applications storing these values for
> incremental backups, then the time between checks could be very long.
> 
> So, I think we don't want _too_ large a value for P. The big question is
> how many individual change attr increments do we need to account for?
> 
> We have 64 bits total (it's an atomic64_t). We consume the lowest bit
> for the QUERIED flag. That leaves us 63 bits of counter (currently).
> When we increment by a larger value, we're effectively decreasing the
> size of the counter.

Yes, the larger value of P we take the sooner it will wrap which defeats
comparisons attempting to establish any ordering of change cookie values.

> Let's assume a worst case of one increment per microsecond, interleaved
> by queries (so that they have to be real increments). 2^48 microseconds
> is close to 9 years.
> 
> That leaves 15 bits for the P, which is primes from 3..32749. Is that a
> large enough pool of prime numbers?

Well, there are ~3000 primes in this range so that gives you a 1/3000
chance that after a crash, backup restore, snapshot rollback etc. you will
pick the same prime which results in collisions of change cookies and thus
possibility of data corruption. Is that low enough chance? The events I
mention above should be relatively rare but given the number of machines
running this code I would think the collision is bound to happen and the
consequences could be ... unpleasant. That's why I would prefer to pick
primes at least say upto 1m (there are ~78k of those). But that makes
wrapping more frequent (~100 days with 1us update period). Probably still
usable for NFS but not really for backup purposes. So I'm not sure we
should be advertising the values have any ordering.

If the last used value would be persisted (e.g. in the filesystem's
superblock), we could easily make sure the next selected P is different so
in that case we could get away with a smaller set of primes but it would
require filesystem on-disk format changes which has its own drawbacks. But
that would be at least some path forward for providing change cookies that
can be ordered on larger timescales.

> It looks like the kernel already has some infrastructure for handling
> primes in lib/math/prime_numbers.c. We could just select a global P
> value to use on every reboot, or just have filesystems set their own
> (maybe in a new field in the superblock?)

IMO P needs to be selected on each mount to reliably solve the "restore
from backup" and "snapshot rollback" scenarios. I agree it can be a new
field in the VFS part of the superblock so that it is accessible by the
iversion handling code.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
  2022-10-20 10:39                   ` Jan Kara
@ 2022-10-21 10:08                     ` Jeff Layton
  0 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2022-10-21 10:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, tytso, adilger.kernel, djwong, trondmy, neilb,
	viro, zohar, xiubli, chuck.lever, lczerner, bfields, brauner,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Thu, 2022-10-20 at 12:39 +0200, Jan Kara wrote:
> On Wed 19-10-22 14:47:48, Jeff Layton wrote:
> > On Wed, 2022-10-19 at 19:23 +0200, Jan Kara wrote:
> > > On Tue 18-10-22 13:04:34, Jeff Layton wrote:
> > > > On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote:
> > > > > On Tue 18-10-22 10:21:08, Jeff Layton wrote:
> > > > > > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote:
> > > > > > > On Tue 18-10-22 06:35:14, Jeff Layton wrote:
> > > > > > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> > > > > > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > > > > > > > > > Trond is of the opinion that monotonicity is a hard requirement, and
> > > > > > > > > > that we should not allow filesystems that can't provide that quality to
> > > > > > > > > > report STATX_VERSION at all.  His rationale is that one of the main uses
> > > > > > > > > > for this is for backup applications, and for those a counter that could
> > > > > > > > > > go backward is worse than useless.
> > > > > > > > > 
> > > > > > > > > From the perspective of a backup program doing incremental backups,
> > > > > > > > > an inode with a change counter that has a different value to the
> > > > > > > > > current backup inventory means the file contains different
> > > > > > > > > information than what the current backup inventory holds. Again,
> > > > > > > > > snapshots, rollbacks, etc.
> > > > > > > > > 
> > > > > > > > > Therefore, regardless of whether the change counter has gone
> > > > > > > > > forwards or backwards, the backup program needs to back up this
> > > > > > > > > current version of the file in this backup because it is different
> > > > > > > > > to the inventory copy.  Hence if the backup program fails to back it
> > > > > > > > > up, it will not be creating an exact backup of the user's data at
> > > > > > > > > the point in time the backup is run...
> > > > > > > > > 
> > > > > > > > > Hence I don't see that MONOTONIC is a requirement for backup
> > > > > > > > > programs - they really do have to be able to handle filesystems that
> > > > > > > > > have modifications that move backwards in time as well as forwards...
> > > > > > > > 
> > > > > > > > Rolling backward is not a problem in and of itself. The big issue is
> > > > > > > > that after a crash, we can end up with a change attr seen before the
> > > > > > > > crash that is now associated with a completely different inode state.
> > > > > > > > 
> > > > > > > > The scenario is something like:
> > > > > > > > 
> > > > > > > > - Change attr for an empty file starts at 1
> > > > > > > > 
> > > > > > > > - Write "A" to file, change attr goes to 2
> > > > > > > > 
> > > > > > > > - Read and statx happens (client sees "A" with change attr 2)
> > > > > > > > 
> > > > > > > > - Crash (before last change is logged to disk)
> > > > > > > > 
> > > > > > > > - Machine reboots, inode is empty, change attr back to 1
> > > > > > > > 
> > > > > > > > - Write "B" to file, change attr goes to 2
> > > > > > > > 
> > > > > > > > - Client stat's file, sees change attr 2 and assumes its cache is
> > > > > > > > correct when it isn't (should be "B" not "A" now).
> > > > > > > > 
> > > > > > > > The real danger comes not from the thing going backward, but the fact
> > > > > > > > that it can march forward again after going backward, and then the
> > > > > > > > client can see two different inode states associated with the same
> > > > > > > > change attr value. Jumping all the change attributes forward by a
> > > > > > > > significant amount after a crash should avoid this issue.
> > > > > > > 
> > > > > > > As Dave pointed out, the problem with change attr having the same value for
> > > > > > > a different inode state (after going backwards) holds not only for the
> > > > > > > crashes but also for restore from backups, fs snapshots, device snapshots
> > > > > > > etc. So relying on change attr only looks a bit fragile. It works for the
> > > > > > > common case but the edge cases are awkward and there's no easy way to
> > > > > > > detect you are in the edge case.
> > > > > > > 
> > > > > > 
> > > > > > This is true. In fact in the snapshot case you can't even rely on doing
> > > > > > anything at reboot since you won't necessarily need to reboot to make it
> > > > > > roll backward.
> > > > > > 
> > > > > > Whether that obviates the use of this value altogether, I'm not sure.
> > > > > > 
> > > > > > > So I think any implementation caring about data integrity would have to
> > > > > > > include something like ctime into the picture anyway. Or we could just
> > > > > > > completely give up any idea of monotonicity and on each mount select random
> > > > > > > prime P < 2^64 and instead of doing inc when advancing the change
> > > > > > > attribute, we'd advance it by P. That makes collisions after restore /
> > > > > > > crash fairly unlikely.
> > > > > > 
> > > > > > Part of the goal (at least for NFS) is to avoid unnecessary cache
> > > > > > invalidations.
> > > > > > 
> > > > > > If we just increment it by a particular offset on every reboot, then
> > > > > > every time the server reboots, the clients will invalidate all of their
> > > > > > cached inodes, and proceed to hammer the server with READ calls just as
> > > > > > it's having to populate its own caches from disk.
> > > > > 
> > > > > Note that I didn't propose to increment by offset on every reboot or mount.
> > > > > I have proposed that inode_maybe_inc_iversion() would not increment
> > > > > i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P
> > > > > << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected
> > > > > on filesystem mount.
> > > > > 
> > > > > This will not cause cache invalidation after a clean unmount + remount. It
> > > > > will cause cache invalidation after a crash, snapshot rollback etc., only for
> > > > > inodes where i_version changed. If P is suitably selected (e.g. as being a
> > > > > prime), then the chances of collisions (even after a snapshot rollback) are
> > > > > very low (on the order of 2^(-50) if my piece of envelope calculations are
> > > > > right).
> > > > > 
> > > > > So this should nicely deal with all the problems we've spotted so far. But
> > > > > I may be missing something...
> > > > 
> > > > 
> > > > Got it! That makes a lot more sense. Thinking about this some more...
> > > > 
> > > > What sort of range for P would be suitable?
> > > > 
> > > > Every increment would need to be by (shifted) P, so we can't choose too
> > > > large a number. Queries are pretty rare vs. writes though, so that
> > > > mitigates the issue somewhat.
> > > 
> > > Well, I agree that for large P the counter would wrap earlier. But is that
> > > a problem? Note that if P is a prime (indivisible by 2 is enough), then the
> > > counter would get to already used value still only after 2^63 steps. Thus if
> > > we give up monotonicity and just treat the counter as an opaque cookie, we
> > > do not have to care about wrapping.
> > > 
> > > Sure given different P is selected for each mount the wrapping argument
> > > does not hold 100% but here comes the advantage of primes - if you have two
> > > different primes P and Q, then a collision means that k*P mod 2^63 = l*Q
> > > mod 2^63 and that holds for exactly one pair k,l from 1..2^63 range. So the
> > > chances of early collision even after selecting a different prime on each
> > > mount are *very* low.
> > 
> > I think we'll have to start avoiding 1 as a value for P if we do this,
> > but the rest makes sense.  I like this idea, Jan!
> 
> Yes, 1 is kind of special so we should better avoid it in this scheme.
> Especially if we're going to select only smaller primes.
> 
> > > So I think we should select from a relatively large set of primes so that
> > > the chance of randomly selecting the same prime (and thus reissuing the
> > > same change attr for different inode state sometime later) are small.
> > > 
> > 
> > Monotonicity allows you to discard "old" attr updates. For instance,
> > sometimes a NFS GETATTR response may be delayed for various reasons. If
> > the client sees a change attr that is provably older than one it has
> > already seen, it can discard the update. So, there is value in servers
> > advertising that property, and NFSv4.2 has a way to do that.
> > 
> > The Linux NFS client (at least) uses the same trick we do with jiffies
> > to handle wrapping for MONOTONIC values. We should be able to advertise
> > MONOTONIC as long as the client isn't comparing values that are more
> > than ~2^62 apart. 
> > 
> > Once we start talking about applications storing these values for
> > incremental backups, then the time between checks could be very long.
> > 
> > So, I think we don't want _too_ large a value for P. The big question is
> > how many individual change attr increments do we need to account for?
> > 
> > We have 64 bits total (it's an atomic64_t). We consume the lowest bit
> > for the QUERIED flag. That leaves us 63 bits of counter (currently).
> > When we increment by a larger value, we're effectively decreasing the
> > size of the counter.
> 
> Yes, the larger value of P we take the sooner it will wrap which defeats
> comparisons attempting to establish any ordering of change cookie values.
> 
> > Let's assume a worst case of one increment per microsecond, interleaved
> > by queries (so that they have to be real increments). 2^48 microseconds
> > is close to 9 years.
> > 
> > That leaves 15 bits for the P, which is primes from 3..32749. Is that a
> > large enough pool of prime numbers?
> 
> Well, there are ~3000 primes in this range so that gives you a 1/3000
> chance that after a crash, backup restore, snapshot rollback etc. you will
> pick the same prime which results in collisions of change cookies and thus
> possibility of data corruption. Is that low enough chance? The events I
> mention above should be relatively rare but given the number of machines
> running this code I would think the collision is bound to happen and the
> consequences could be ... unpleasant. That's why I would prefer to pick
> primes at least say upto 1m (there are ~78k of those). But that makes
> wrapping more frequent (~100 days with 1us update period). Probably still
> usable for NFS but not really for backup purposes. So I'm not sure we
> should be advertising the values have any ordering.
> 

Ok. I'll aim for using values between 3 and 1M and see how that looks.
We should be able to tune this to some degree as well.

> If the last used value would be persisted (e.g. in the filesystem's
> superblock), we could easily make sure the next selected P is different so
> in that case we could get away with a smaller set of primes but it would
> require filesystem on-disk format changes which has its own drawbacks. But
> that would be at least some path forward for providing change cookies that
> can be ordered on larger timescales.
> 

Persisting just the last one might not be sufficient. If the machine
crashes several times then you could still end up re-using the same P
value. i_version is only incremented on changes, so if you're unlucky
and it's only incremented again when the duplicate value of P comes up,
you're back to the same problem. We might want to keep a record of the
last several P values?

OTOH, there is always going to be _some_ way to defeat this. At some
point we just have to decide that a scenario is unlikely enough that we
can ignore it.
 
> > It looks like the kernel already has some infrastructure for handling
> > primes in lib/math/prime_numbers.c. We could just select a global P
> > value to use on every reboot, or just have filesystems set their own
> > (maybe in a new field in the superblock?)
> 
> IMO P needs to be selected on each mount to reliably solve the "restore
> from backup" and "snapshot rollback" scenarios. I agree it can be a new
> field in the VFS part of the superblock so that it is accessible by the
> iversion handling code.
> 

Sounds good. FWIW, I think I'm going to have to approach this in at
least three patchsets:

1) clean up the presentation of the value, and plumb it through struct
kstat (aiming for v6.2 for this part).

2) start incrementing the value after a write in addition to, or instead
of before a write. (I have patches for tmpfs, ext4 and btrfs -- maybe
v6.3?)

3) change the increment to use a prime number we select at mount time to
ward off rollback issues. (still scoping out this part)

Thanks!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 6/9] nfsd: move nfsd4_change_attribute to nfsfh.c
  2022-10-17 10:57 ` [PATCH v7 6/9] nfsd: move nfsd4_change_attribute to nfsfh.c Jeff Layton
@ 2022-11-03 15:42   ` Chuck Lever III
  0 siblings, 0 replies; 29+ messages in thread
From: Chuck Lever III @ 2022-11-03 15:42 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Theodore Ts'o, adilger.kernel, Darrick J. Wong, Dave Chinner,
	Trond Myklebust, Neil Brown, Al Viro, zohar, xiubli,
	Lukas Czerner, Jan Kara, Bruce Fields, Christian Brauner,
	fweimer, linux-btrfs, linux-fsdevel, LKML, ceph-devel,
	linux-ext4, Linux NFS Mailing List, linux-xfs



> On Oct 17, 2022, at 6:57 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> This is a pretty big function for inlining. Move it to being
> non-inlined.
> 
> Reviewed-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Acked-by: Chuck Lever <chuck.lever@oracle.com>


> ---
> fs/nfsd/nfsfh.c | 27 +++++++++++++++++++++++++++
> fs/nfsd/nfsfh.h | 29 +----------------------------
> 2 files changed, 28 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index d73434200df9..7030d9209903 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -748,3 +748,30 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
> 		return FSIDSOURCE_UUID;
> 	return FSIDSOURCE_DEV;
> }
> +
> +/*
> + * We could use i_version alone as the change attribute.  However,
> + * i_version can go backwards after a reboot.  On its own that doesn't
> + * necessarily cause a problem, but if i_version goes backwards and then
> + * is incremented again it could reuse a value that was previously used
> + * before boot, and a client who queried the two values might
> + * incorrectly assume nothing changed.
> + *
> + * By using both ctime and the i_version counter we guarantee that as
> + * long as time doesn't go backwards we never reuse an old value.
> + */
> +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
> +{
> +	if (inode->i_sb->s_export_op->fetch_iversion)
> +		return inode->i_sb->s_export_op->fetch_iversion(inode);
> +	else if (IS_I_VERSION(inode)) {
> +		u64 chattr;
> +
> +		chattr =  stat->ctime.tv_sec;
> +		chattr <<= 30;
> +		chattr += stat->ctime.tv_nsec;
> +		chattr += inode_query_iversion(inode);
> +		return chattr;
> +	} else
> +		return time_to_chattr(&stat->ctime);
> +}
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index c3ae6414fc5c..4c223a7a91d4 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -291,34 +291,7 @@ static inline void fh_clear_pre_post_attrs(struct svc_fh *fhp)
> 	fhp->fh_pre_saved = false;
> }
> 
> -/*
> - * We could use i_version alone as the change attribute.  However,
> - * i_version can go backwards after a reboot.  On its own that doesn't
> - * necessarily cause a problem, but if i_version goes backwards and then
> - * is incremented again it could reuse a value that was previously used
> - * before boot, and a client who queried the two values might
> - * incorrectly assume nothing changed.
> - *
> - * By using both ctime and the i_version counter we guarantee that as
> - * long as time doesn't go backwards we never reuse an old value.
> - */
> -static inline u64 nfsd4_change_attribute(struct kstat *stat,
> -					 struct inode *inode)
> -{
> -	if (inode->i_sb->s_export_op->fetch_iversion)
> -		return inode->i_sb->s_export_op->fetch_iversion(inode);
> -	else if (IS_I_VERSION(inode)) {
> -		u64 chattr;
> -
> -		chattr =  stat->ctime.tv_sec;
> -		chattr <<= 30;
> -		chattr += stat->ctime.tv_nsec;
> -		chattr += inode_query_iversion(inode);
> -		return chattr;
> -	} else
> -		return time_to_chattr(&stat->ctime);
> -}
> -
> +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode);
> extern void fh_fill_pre_attrs(struct svc_fh *fhp);
> extern void fh_fill_post_attrs(struct svc_fh *fhp);
> extern void fh_fill_both_attrs(struct svc_fh *fhp);
> -- 
> 2.37.3
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 7/9] nfsd: use the getattr operation to fetch i_version
  2022-10-17 10:57 ` [PATCH v7 7/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
@ 2022-11-03 15:42   ` Chuck Lever III
  0 siblings, 0 replies; 29+ messages in thread
From: Chuck Lever III @ 2022-11-03 15:42 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Theodore Ts'o, adilger.kernel, Darrick J. Wong, david,
	trondmy, Neil Brown, Al Viro, zohar, xiubli, lczerner, jack,
	Bruce Fields, brauner, fweimer, linux-btrfs, linux-fsdevel,
	linux-kernel, ceph-devel, linux-ext4, Linux NFS Mailing List,
	linux-xfs



> On Oct 17, 2022, at 6:57 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> Now that we can call into vfs_getattr to get the i_version field, use
> that facility to fetch it instead of doing it in nfsd4_change_attribute.
> 
> Neil also pointed out recently that IS_I_VERSION directory operations
> are always logged, and so we only need to mitigate the rollback problem
> on regular files. Also, we don't need to factor in the ctime when
> reexporting NFS or Ceph.
> 
> Set the STATX_VERSION (and BTIME) bits in the request when we're dealing
> with a v4 request. Then, instead of looking at IS_I_VERSION when
> generating the change attr, look at the result mask and only use it if
> STATX_VERSION is set.
> 
> Change nfsd4_change_attribute to only factor in the ctime if it's a regular
> file and the fs doesn't advertise STATX_ATTR_VERSION_MONOTONIC.
> 
> Reviewed-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Acked-by: Chuck Lever <chuck.lever@oracle.com>


> ---
> fs/nfsd/nfs4xdr.c |  4 +++-
> fs/nfsd/nfsfh.c   | 53 +++++++++++++++++++++++++++++++----------------
> fs/nfsd/vfs.h     |  7 ++++++-
> 3 files changed, 44 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index bcfeb1a922c0..c19b6b00b620 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2906,7 +2906,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> 			goto out;
> 	}
> 
> -	err = vfs_getattr(&path, &stat, STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
> +	err = vfs_getattr(&path, &stat,
> +			  STATX_BASIC_STATS | STATX_BTIME | STATX_VERSION,
> +			  AT_STATX_SYNC_AS_STAT);
> 	if (err)
> 		goto out_nfserr;
> 	if (!(stat.result_mask & STATX_BTIME))
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index 7030d9209903..21b64ac97a06 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -628,6 +628,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
> 		stat.mtime = inode->i_mtime;
> 		stat.ctime = inode->i_ctime;
> 		stat.size  = inode->i_size;
> +		if (v4 && IS_I_VERSION(inode)) {
> +			stat.version = inode_query_iversion(inode);
> +			stat.result_mask |= STATX_VERSION;
> +		}
> 	}
> 	if (v4)
> 		fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
> @@ -659,6 +663,10 @@ void fh_fill_post_attrs(struct svc_fh *fhp)
> 	if (err) {
> 		fhp->fh_post_saved = false;
> 		fhp->fh_post_attr.ctime = inode->i_ctime;
> +		if (v4 && IS_I_VERSION(inode)) {
> +			fhp->fh_post_attr.version = inode_query_iversion(inode);
> +			fhp->fh_post_attr.result_mask |= STATX_VERSION;
> +		}
> 	} else
> 		fhp->fh_post_saved = true;
> 	if (v4)
> @@ -750,28 +758,37 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
> }
> 
> /*
> - * We could use i_version alone as the change attribute.  However,
> - * i_version can go backwards after a reboot.  On its own that doesn't
> - * necessarily cause a problem, but if i_version goes backwards and then
> - * is incremented again it could reuse a value that was previously used
> - * before boot, and a client who queried the two values might
> - * incorrectly assume nothing changed.
> + * We could use i_version alone as the change attribute.  However, i_version
> + * can go backwards on a regular file after an unclean shutdown.  On its own
> + * that doesn't necessarily cause a problem, but if i_version goes backwards
> + * and then is incremented again it could reuse a value that was previously
> + * used before boot, and a client who queried the two values might incorrectly
> + * assume nothing changed.
> + *
> + * By using both ctime and the i_version counter we guarantee that as long as
> + * time doesn't go backwards we never reuse an old value. If the filesystem
> + * advertises STATX_ATTR_VERSION_MONOTONIC, then this mitigation is not needed.
>  *
> - * By using both ctime and the i_version counter we guarantee that as
> - * long as time doesn't go backwards we never reuse an old value.
> + * We only need to do this for regular files as well. For directories, we
> + * assume that the new change attr is always logged to stable storage in some
> + * fashion before the results can be seen.
>  */
> u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
> {
> +	u64 chattr;
> +
> 	if (inode->i_sb->s_export_op->fetch_iversion)
> 		return inode->i_sb->s_export_op->fetch_iversion(inode);
> -	else if (IS_I_VERSION(inode)) {
> -		u64 chattr;
> -
> -		chattr =  stat->ctime.tv_sec;
> -		chattr <<= 30;
> -		chattr += stat->ctime.tv_nsec;
> -		chattr += inode_query_iversion(inode);
> -		return chattr;
> -	} else
> -		return time_to_chattr(&stat->ctime);
> +	if (stat->result_mask & STATX_VERSION) {
> +		chattr = stat->version;
> +
> +		if (S_ISREG(inode->i_mode) &&
> +		    !(stat->attributes & STATX_ATTR_VERSION_MONOTONIC)) {
> +			chattr += (u64)stat->ctime.tv_sec << 30;
> +			chattr += stat->ctime.tv_nsec;
> +		}
> +	} else {
> +		chattr = time_to_chattr(&stat->ctime);
> +	}
> +	return chattr;
> }
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index 120521bc7b24..c98e13ec37b2 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -168,9 +168,14 @@ static inline void fh_drop_write(struct svc_fh *fh)
> 
> static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
> {
> +	u32 request_mask = STATX_BASIC_STATS;
> 	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
> 			 .dentry = fh->fh_dentry};
> -	return nfserrno(vfs_getattr(&p, stat, STATX_BASIC_STATS,
> +
> +	if (fh->fh_maxsize == NFS4_FHSIZE)
> +		request_mask |= (STATX_BTIME | STATX_VERSION);
> +
> +	return nfserrno(vfs_getattr(&p, stat, request_mask,
> 				    AT_STATX_SYNC_AS_STAT));
> }
> 
> -- 
> 2.37.3
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 8/9] nfsd: remove fetch_iversion export operation
  2022-10-17 10:57 ` [PATCH v7 8/9] nfsd: remove fetch_iversion export operation Jeff Layton
@ 2022-11-03 15:43   ` Chuck Lever III
  0 siblings, 0 replies; 29+ messages in thread
From: Chuck Lever III @ 2022-11-03 15:43 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Theodore Ts'o, adilger.kernel, Darrick J. Wong, david,
	trondmy, Neil Brown, Al Viro, zohar, xiubli, lczerner, jack,
	Bruce Fields, brauner, fweimer, linux-btrfs, linux-fsdevel,
	linux-kernel, ceph-devel, linux-ext4, Linux NFS Mailing List,
	linux-xfs



> On Oct 17, 2022, at 6:57 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> Now that the i_version counter is reported in struct kstat, there is no
> need for this export operation.
> 
> Reviewed-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Acked-by: Chuck Lever <chuck.lever@oracle.com>


> ---
> fs/nfs/export.c          | 7 -------
> fs/nfsd/nfsfh.c          | 2 --
> include/linux/exportfs.h | 1 -
> 3 files changed, 10 deletions(-)
> 
> diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> index 01596f2d0a1e..1a9d5aa51dfb 100644
> --- a/fs/nfs/export.c
> +++ b/fs/nfs/export.c
> @@ -145,17 +145,10 @@ nfs_get_parent(struct dentry *dentry)
> 	return parent;
> }
> 
> -static u64 nfs_fetch_iversion(struct inode *inode)
> -{
> -	nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
> -	return inode_peek_iversion_raw(inode);
> -}
> -
> const struct export_operations nfs_export_ops = {
> 	.encode_fh = nfs_encode_fh,
> 	.fh_to_dentry = nfs_fh_to_dentry,
> 	.get_parent = nfs_get_parent,
> -	.fetch_iversion = nfs_fetch_iversion,
> 	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|
> 		EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS|
> 		EXPORT_OP_NOATOMIC_ATTR,
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index 21b64ac97a06..9c1f697ffc72 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -777,8 +777,6 @@ u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
> {
> 	u64 chattr;
> 
> -	if (inode->i_sb->s_export_op->fetch_iversion)
> -		return inode->i_sb->s_export_op->fetch_iversion(inode);
> 	if (stat->result_mask & STATX_VERSION) {
> 		chattr = stat->version;
> 
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index fe848901fcc3..9f4d4bcbf251 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -213,7 +213,6 @@ struct export_operations {
> 			  bool write, u32 *device_generation);
> 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> 			     int nr_iomaps, struct iattr *iattr);
> -	u64 (*fetch_iversion)(struct inode *);
> #define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
> #define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */
> #define	EXPORT_OP_CLOSE_BEFORE_UNLINK	(0x4) /* close files before unlink */
> -- 
> 2.37.3
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-11-03 15:44 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
2022-10-17 10:57 ` [PATCH v7 1/9] fs: uninline inode_query_iversion Jeff Layton
2022-10-17 10:57 ` [PATCH v7 2/9] fs: clarify when the i_version counter must be updated Jeff Layton
2022-10-17 10:57 ` [PATCH v7 3/9] vfs: plumb i_version handling into struct kstat Jeff Layton
2022-10-17 10:57 ` [PATCH v7 4/9] nfs: report the inode version in getattr if requested Jeff Layton
2022-10-17 10:57 ` [PATCH v7 5/9] ceph: " Jeff Layton
2022-10-17 10:57 ` [PATCH v7 6/9] nfsd: move nfsd4_change_attribute to nfsfh.c Jeff Layton
2022-11-03 15:42   ` Chuck Lever III
2022-10-17 10:57 ` [PATCH v7 7/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
2022-11-03 15:42   ` Chuck Lever III
2022-10-17 10:57 ` [PATCH v7 8/9] nfsd: remove fetch_iversion export operation Jeff Layton
2022-11-03 15:43   ` Chuck Lever III
2022-10-17 10:57 ` [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland Jeff Layton
2022-10-17 22:14   ` Dave Chinner
2022-10-18 10:35     ` Jeff Layton
2022-10-18 13:49       ` Jan Kara
2022-10-18 14:21         ` Jeff Layton
2022-10-18 15:17           ` Jan Kara
2022-10-18 17:04             ` Jeff Layton
2022-10-19 17:23               ` Jan Kara
2022-10-19 18:47                 ` Jeff Layton
2022-10-20 10:39                   ` Jan Kara
2022-10-21 10:08                     ` Jeff Layton
2022-10-18 14:56       ` Jeff Layton
2022-10-19 11:13 ` [PATCH v7 0/9] fs: clean up handling of i_version counter Christian Brauner
2022-10-19 12:18   ` Jeff Layton
2022-10-19 15:45     ` Darrick J. Wong
2022-10-19 20:36       ` Jeff Layton
2022-10-20  6:58         ` Christian Brauner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.