All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 RESEND 0/8] fs: clean up internal i_version handling
@ 2023-01-24 19:30 Jeff Layton
  2023-01-24 19:30 ` [PATCH v8 RESEND 1/8] fs: uninline inode_query_iversion Jeff Layton
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Jeff Layton @ 2023-01-24 19:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

I had inteded to send a PR for this for v6.2, but I got sidetracked
with different issues, and didn't get it together in time. This set has
been sitting in linux-next since October, and it seems to be behaving,
so I intend to send a PR when the v6.3 merge window opens.

Though nothing has really changed since last year, I'm resending now
in the hopes I can collect a few more Reviewed-bys (ones from Al, Trond
and Anna would be particularly welcome).

The main consumer of i_version field (knfsd) has to jump through a
number of hoops to fetch it, depending on what sort of inode it is.
Rather than do this, we want to offload the responsibility for
presenting this field to the filesystem's ->getattr operation, which is
a more natural way to deal with a field that may be implemented
differently.

The focus of this patchset is to clean up these internal interfaces.
This should also make it simple to present this attribute to userland in
the future, which should be possible once the semantics are a bit more
consistent across different backing filesystems.

Thanks!

Jeff Layton (8):
  fs: uninline inode_query_iversion
  fs: clarify when the i_version counter must be updated
  vfs: plumb i_version handling into struct kstat
  nfs: report the inode version in getattr if requested
  ceph: report the inode version in getattr if requested
  nfsd: move nfsd4_change_attribute to nfsfh.c
  nfsd: use the getattr operation to fetch i_version
  nfsd: remove fetch_iversion export operation

 fs/ceph/inode.c          | 16 +++++++----
 fs/libfs.c               | 36 ++++++++++++++++++++++++
 fs/nfs/export.c          |  7 -----
 fs/nfs/inode.c           | 16 ++++++++---
 fs/nfsd/nfs4xdr.c        |  4 ++-
 fs/nfsd/nfsfh.c          | 42 ++++++++++++++++++++++++++++
 fs/nfsd/nfsfh.h          | 29 +-------------------
 fs/nfsd/vfs.h            |  7 ++++-
 fs/stat.c                | 17 ++++++++++--
 include/linux/exportfs.h |  1 -
 include/linux/iversion.h | 59 ++++++++++++++--------------------------
 include/linux/stat.h     |  9 ++++++
 12 files changed, 156 insertions(+), 87 deletions(-)

-- 
2.39.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v8 RESEND 1/8] fs: uninline inode_query_iversion
  2023-01-24 19:30 [PATCH v8 RESEND 0/8] fs: clean up internal i_version handling Jeff Layton
@ 2023-01-24 19:30 ` Jeff Layton
  2023-01-25 13:44   ` Jan Kara
  2023-01-26  9:02   ` Christian Brauner
  2023-01-24 19:30 ` [PATCH v8 RESEND 2/8] fs: clarify when the i_version counter must be updated Jeff Layton
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 20+ messages in thread
From: Jeff Layton @ 2023-01-24 19:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/libfs.c               | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/iversion.h | 38 ++------------------------------------
 2 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index aada4e7c8713..17ecc47696e1 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1582,3 +1582,39 @@ bool inode_maybe_inc_iversion(struct inode *inode, bool force)
 	return true;
 }
 EXPORT_SYMBOL(inode_maybe_inc_iversion);
+
+/**
+ * inode_query_iversion - read i_version for later use
+ * @inode: inode from which i_version should be read
+ *
+ * Read the inode i_version counter. This should be used by callers that wish
+ * to store the returned i_version for later comparison. This will guarantee
+ * that a later query of the i_version will result in a different value if
+ * anything has changed.
+ *
+ * In this implementation, we fetch the current value, set the QUERIED flag and
+ * then try to swap it into place with a cmpxchg, if it wasn't already set. If
+ * that fails, we try again with the newly fetched value from the cmpxchg.
+ */
+u64 inode_query_iversion(struct inode *inode)
+{
+	u64 cur, new;
+
+	cur = inode_peek_iversion_raw(inode);
+	do {
+		/* If flag is already set, then no need to swap */
+		if (cur & I_VERSION_QUERIED) {
+			/*
+			 * This barrier (and the implicit barrier in the
+			 * cmpxchg below) pairs with the barrier in
+			 * inode_maybe_inc_iversion().
+			 */
+			smp_mb();
+			break;
+		}
+
+		new = cur | I_VERSION_QUERIED;
+	} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
+	return cur >> I_VERSION_QUERIED_SHIFT;
+}
+EXPORT_SYMBOL(inode_query_iversion);
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index e27bd4f55d84..6755d8b4f20b 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -234,42 +234,6 @@ inode_peek_iversion(const struct inode *inode)
 	return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT;
 }
 
-/**
- * inode_query_iversion - read i_version for later use
- * @inode: inode from which i_version should be read
- *
- * Read the inode i_version counter. This should be used by callers that wish
- * to store the returned i_version for later comparison. This will guarantee
- * that a later query of the i_version will result in a different value if
- * anything has changed.
- *
- * In this implementation, we fetch the current value, set the QUERIED flag and
- * then try to swap it into place with a cmpxchg, if it wasn't already set. If
- * that fails, we try again with the newly fetched value from the cmpxchg.
- */
-static inline u64
-inode_query_iversion(struct inode *inode)
-{
-	u64 cur, new;
-
-	cur = inode_peek_iversion_raw(inode);
-	do {
-		/* If flag is already set, then no need to swap */
-		if (cur & I_VERSION_QUERIED) {
-			/*
-			 * This barrier (and the implicit barrier in the
-			 * cmpxchg below) pairs with the barrier in
-			 * inode_maybe_inc_iversion().
-			 */
-			smp_mb();
-			break;
-		}
-
-		new = cur | I_VERSION_QUERIED;
-	} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
-	return cur >> I_VERSION_QUERIED_SHIFT;
-}
-
 /*
  * For filesystems without any sort of change attribute, the best we can
  * do is fake one up from the ctime:
@@ -283,6 +247,8 @@ static inline u64 time_to_chattr(struct timespec64 *t)
 	return chattr;
 }
 
+u64 inode_query_iversion(struct inode *inode);
+
 /**
  * inode_eq_iversion_raw - check whether the raw i_version counter has changed
  * @inode: inode to check
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 RESEND 2/8] fs: clarify when the i_version counter must be updated
  2023-01-24 19:30 [PATCH v8 RESEND 0/8] fs: clean up internal i_version handling Jeff Layton
  2023-01-24 19:30 ` [PATCH v8 RESEND 1/8] fs: uninline inode_query_iversion Jeff Layton
@ 2023-01-24 19:30 ` Jeff Layton
  2023-01-25 16:06   ` Jan Kara
  2023-01-26  9:01   ` Christian Brauner
  2023-01-24 19:30 ` [PATCH v8 RESEND 3/8] vfs: plumb i_version handling into struct kstat Jeff Layton
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 20+ messages in thread
From: Jeff Layton @ 2023-01-24 19:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

The i_version field in the kernel has had different semantics over
the decades, but NFSv4 has certain expectations. Update the comments
in iversion.h to describe when the i_version must change.

Cc: Colin Walters <walters@verbum.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trondmy@hammerspace.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/iversion.h | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 6755d8b4f20b..fced8115a5f4 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -9,8 +9,25 @@
  * ---------------------------
  * The change attribute (i_version) is mandated by NFSv4 and is mostly for
  * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
- * appear different to observers if there was a change to the inode's data or
- * metadata since it was last queried.
+ * appear larger to observers if there was an explicit change to the inode's
+ * data or metadata since it was last queried.
+ *
+ * An explicit change is one that would ordinarily result in a change to the
+ * inode status change time (aka ctime). i_version must appear to change, even
+ * if the ctime does not (since the whole point is to avoid missing updates due
+ * to timestamp granularity). If POSIX or other relevant spec mandates that the
+ * ctime must change due to an operation, then the i_version counter must be
+ * incremented as well.
+ *
+ * Making the i_version update completely atomic with the operation itself would
+ * be prohibitively expensive. Traditionally the kernel has updated the times on
+ * directories after an operation that changes its contents. For regular files,
+ * the ctime is usually updated before the data is copied into the cache for a
+ * write. This means that there is a window of time when an observer can
+ * associate a new timestamp with old file contents. Since the purpose of the
+ * i_version is to allow for better cache coherency, the i_version must always
+ * be updated after the results of the operation are visible. Updating it before
+ * and after a change is also permitted.
  *
  * Observers see the i_version as a 64-bit number that never decreases. If it
  * remains the same since it was last checked, then nothing has changed in the
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 RESEND 3/8] vfs: plumb i_version handling into struct kstat
  2023-01-24 19:30 [PATCH v8 RESEND 0/8] fs: clean up internal i_version handling Jeff Layton
  2023-01-24 19:30 ` [PATCH v8 RESEND 1/8] fs: uninline inode_query_iversion Jeff Layton
  2023-01-24 19:30 ` [PATCH v8 RESEND 2/8] fs: clarify when the i_version counter must be updated Jeff Layton
@ 2023-01-24 19:30 ` Jeff Layton
  2023-01-25 15:50   ` Christian Brauner
  2023-01-25 16:20   ` Jan Kara
  2023-01-24 19:30 ` [PATCH v8 RESEND 4/8] nfs: report the inode version in getattr if requested Jeff Layton
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 20+ messages in thread
From: Jeff Layton @ 2023-01-24 19:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

The NFS server has a lot of special handling for different types of
change attribute access, depending on the underlying filesystem. In
most cases, it's doing a getattr anyway and then fetching that value
after the fact.

Rather that do that, add a new STATX_CHANGE_COOKIE flag that is a
kernel-only symbol (for now). If requested and getattr can implement it,
it can fill out this field. For IS_I_VERSION inodes, add a generic
implementation in vfs_getattr_nosec. Take care to mask
STATX_CHANGE_COOKIE off in requests from userland and in the result
mask.

Since not all filesystems can give the same guarantees of monotonicity,
claim a STATX_ATTR_CHANGE_MONOTONIC flag that filesystems can set to
indicate that they offer an i_version value that can never go backward.

Eventually if we decide to make the i_version available to userland, we
can just designate a field for it in struct statx, and move the
STATX_CHANGE_COOKIE definition to the uapi header.

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/stat.c            | 17 +++++++++++++++--
 include/linux/stat.h |  9 +++++++++
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/stat.c b/fs/stat.c
index d6cc74ca8486..f43afe0081fe 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -18,6 +18,7 @@
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
 #include <linux/compat.h>
+#include <linux/iversion.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -122,6 +123,11 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	stat->attributes_mask |= (STATX_ATTR_AUTOMOUNT |
 				  STATX_ATTR_DAX);
 
+	if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
+		stat->result_mask |= STATX_CHANGE_COOKIE;
+		stat->change_cookie = inode_query_iversion(inode);
+	}
+
 	mnt_userns = mnt_user_ns(path->mnt);
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(mnt_userns, path, stat,
@@ -602,9 +608,11 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 
 	memset(&tmp, 0, sizeof(tmp));
 
-	tmp.stx_mask = stat->result_mask;
+	/* STATX_CHANGE_COOKIE is kernel-only for now */
+	tmp.stx_mask = stat->result_mask & ~STATX_CHANGE_COOKIE;
 	tmp.stx_blksize = stat->blksize;
-	tmp.stx_attributes = stat->attributes;
+	/* STATX_ATTR_CHANGE_MONOTONIC is kernel-only for now */
+	tmp.stx_attributes = stat->attributes & ~STATX_ATTR_CHANGE_MONOTONIC;
 	tmp.stx_nlink = stat->nlink;
 	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
 	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
@@ -643,6 +651,11 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
 	if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
 		return -EINVAL;
 
+	/* STATX_CHANGE_COOKIE is kernel-only for now. Ignore requests
+	 * from userland.
+	 */
+	mask &= ~STATX_CHANGE_COOKIE;
+
 	error = vfs_statx(dfd, filename, flags, &stat, mask);
 	if (error)
 		return error;
diff --git a/include/linux/stat.h b/include/linux/stat.h
index ff277ced50e9..52150570d37a 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -52,6 +52,15 @@ struct kstat {
 	u64		mnt_id;
 	u32		dio_mem_align;
 	u32		dio_offset_align;
+	u64		change_cookie;
 };
 
+/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
+
+/* mask values */
+#define STATX_CHANGE_COOKIE		0x40000000U	/* Want/got stx_change_attr */
+
+/* file attribute values */
+#define STATX_ATTR_CHANGE_MONOTONIC	0x8000000000000000ULL /* version monotonically increases */
+
 #endif
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 RESEND 4/8] nfs: report the inode version in getattr if requested
  2023-01-24 19:30 [PATCH v8 RESEND 0/8] fs: clean up internal i_version handling Jeff Layton
                   ` (2 preceding siblings ...)
  2023-01-24 19:30 ` [PATCH v8 RESEND 3/8] vfs: plumb i_version handling into struct kstat Jeff Layton
@ 2023-01-24 19:30 ` Jeff Layton
  2023-01-24 19:30 ` [PATCH v8 RESEND 5/8] ceph: " Jeff Layton
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2023-01-24 19:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

Allow NFS to report the i_version in getattr requests. Since the cost to
fetch it is relatively cheap, do it unconditionally and just set the
flag if it looks like it's valid. Also, conditionally enable the
MONOTONIC flag when the server reports its change attr type as such.

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfs/inode.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index e98ee7599eeb..756ad4fa6f2a 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -825,6 +825,8 @@ static u32 nfs_get_valid_attrmask(struct inode *inode)
 		reply_mask |= STATX_UID | STATX_GID;
 	if (!(cache_validity & NFS_INO_INVALID_BLOCKS))
 		reply_mask |= STATX_BLOCKS;
+	if (!(cache_validity & NFS_INO_INVALID_CHANGE))
+		reply_mask |= STATX_CHANGE_COOKIE;
 	return reply_mask;
 }
 
@@ -843,7 +845,8 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 
 	request_mask &= STATX_TYPE | STATX_MODE | STATX_NLINK | STATX_UID |
 			STATX_GID | STATX_ATIME | STATX_MTIME | STATX_CTIME |
-			STATX_INO | STATX_SIZE | STATX_BLOCKS;
+			STATX_INO | STATX_SIZE | STATX_BLOCKS | STATX_BTIME |
+			STATX_CHANGE_COOKIE;
 
 	if ((query_flags & AT_STATX_DONT_SYNC) && !force_sync) {
 		if (readdirplus_enabled)
@@ -851,8 +854,8 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 		goto out_no_revalidate;
 	}
 
-	/* Flush out writes to the server in order to update c/mtime.  */
-	if ((request_mask & (STATX_CTIME | STATX_MTIME)) &&
+	/* Flush out writes to the server in order to update c/mtime/version.  */
+	if ((request_mask & (STATX_CTIME | STATX_MTIME | STATX_CHANGE_COOKIE)) &&
 	    S_ISREG(inode->i_mode))
 		filemap_write_and_wait(inode->i_mapping);
 
@@ -872,7 +875,8 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 	/* Is the user requesting attributes that might need revalidation? */
 	if (!(request_mask & (STATX_MODE|STATX_NLINK|STATX_ATIME|STATX_CTIME|
 					STATX_MTIME|STATX_UID|STATX_GID|
-					STATX_SIZE|STATX_BLOCKS)))
+					STATX_SIZE|STATX_BLOCKS|
+					STATX_CHANGE_COOKIE)))
 		goto out_no_revalidate;
 
 	/* Check whether the cached attributes are stale */
@@ -910,6 +914,10 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 
 	generic_fillattr(&init_user_ns, inode, stat);
 	stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
+	stat->change_cookie = inode_peek_iversion_raw(inode);
+	stat->attributes_mask |= STATX_ATTR_CHANGE_MONOTONIC;
+	if (server->change_attr_type != NFS4_CHANGE_TYPE_IS_UNDEFINED)
+		stat->attributes |= STATX_ATTR_CHANGE_MONOTONIC;
 	if (S_ISDIR(inode->i_mode))
 		stat->blksize = NFS_SERVER(inode)->dtsize;
 out:
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 RESEND 5/8] ceph: report the inode version in getattr if requested
  2023-01-24 19:30 [PATCH v8 RESEND 0/8] fs: clean up internal i_version handling Jeff Layton
                   ` (3 preceding siblings ...)
  2023-01-24 19:30 ` [PATCH v8 RESEND 4/8] nfs: report the inode version in getattr if requested Jeff Layton
@ 2023-01-24 19:30 ` Jeff Layton
  2023-01-24 19:30 ` [PATCH v8 RESEND 6/8] nfsd: move nfsd4_change_attribute to nfsfh.c Jeff Layton
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2023-01-24 19:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

When getattr requests the STX_CHANGE_COOKIE, request the full gamut of
caps (similarly to how ctime is handled). When the change attribute
seems to be valid, return it in the change_cookie field and set the flag
in the reply mask. Also, unconditionally enable
STATX_ATTR_CHANGE_MONOTONIC.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 23d05ec87fcc..e22e631cb115 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2417,10 +2417,10 @@ static int statx_to_caps(u32 want, umode_t mode)
 {
 	int mask = 0;
 
-	if (want & (STATX_MODE|STATX_UID|STATX_GID|STATX_CTIME|STATX_BTIME))
+	if (want & (STATX_MODE|STATX_UID|STATX_GID|STATX_CTIME|STATX_BTIME|STATX_CHANGE_COOKIE))
 		mask |= CEPH_CAP_AUTH_SHARED;
 
-	if (want & (STATX_NLINK|STATX_CTIME)) {
+	if (want & (STATX_NLINK|STATX_CTIME|STATX_CHANGE_COOKIE)) {
 		/*
 		 * The link count for directories depends on inode->i_subdirs,
 		 * and that is only updated when Fs caps are held.
@@ -2431,11 +2431,10 @@ static int statx_to_caps(u32 want, umode_t mode)
 			mask |= CEPH_CAP_LINK_SHARED;
 	}
 
-	if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|
-		    STATX_BLOCKS))
+	if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|STATX_BLOCKS|STATX_CHANGE_COOKIE))
 		mask |= CEPH_CAP_FILE_SHARED;
 
-	if (want & (STATX_CTIME))
+	if (want & (STATX_CTIME|STATX_CHANGE_COOKIE))
 		mask |= CEPH_CAP_XATTR_SHARED;
 
 	return mask;
@@ -2478,6 +2477,11 @@ int ceph_getattr(struct user_namespace *mnt_userns, const struct path *path,
 		valid_mask |= STATX_BTIME;
 	}
 
+	if (request_mask & STATX_CHANGE_COOKIE) {
+		stat->change_cookie = inode_peek_iversion_raw(inode);
+		valid_mask |= STATX_CHANGE_COOKIE;
+	}
+
 	if (ceph_snap(inode) == CEPH_NOSNAP)
 		stat->dev = sb->s_dev;
 	else
@@ -2519,6 +2523,8 @@ int ceph_getattr(struct user_namespace *mnt_userns, const struct path *path,
 			stat->nlink = 1 + 1 + ci->i_subdirs;
 	}
 
+	stat->attributes_mask |= STATX_ATTR_CHANGE_MONOTONIC;
+	stat->attributes |= STATX_ATTR_CHANGE_MONOTONIC;
 	stat->result_mask = request_mask & valid_mask;
 	return err;
 }
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 RESEND 6/8] nfsd: move nfsd4_change_attribute to nfsfh.c
  2023-01-24 19:30 [PATCH v8 RESEND 0/8] fs: clean up internal i_version handling Jeff Layton
                   ` (4 preceding siblings ...)
  2023-01-24 19:30 ` [PATCH v8 RESEND 5/8] ceph: " Jeff Layton
@ 2023-01-24 19:30 ` Jeff Layton
  2023-01-24 19:30 ` [PATCH v8 RESEND 7/8] nfsd: use the getattr operation to fetch i_version Jeff Layton
  2023-01-24 19:30 ` [PATCH v8 RESEND 8/8] nfsd: remove fetch_iversion export operation Jeff Layton
  7 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2023-01-24 19:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

This is a pretty big function for inlining. Move it to being
non-inlined.

Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/nfsfh.c | 27 +++++++++++++++++++++++++++
 fs/nfsd/nfsfh.h | 29 +----------------------------
 2 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 8c52b6c9d31a..ac89e25e7733 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -748,3 +748,30 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
 		return FSIDSOURCE_UUID;
 	return FSIDSOURCE_DEV;
 }
+
+/*
+ * We could use i_version alone as the change attribute.  However,
+ * i_version can go backwards after a reboot.  On its own that doesn't
+ * necessarily cause a problem, but if i_version goes backwards and then
+ * is incremented again it could reuse a value that was previously used
+ * before boot, and a client who queried the two values might
+ * incorrectly assume nothing changed.
+ *
+ * By using both ctime and the i_version counter we guarantee that as
+ * long as time doesn't go backwards we never reuse an old value.
+ */
+u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
+{
+	if (inode->i_sb->s_export_op->fetch_iversion)
+		return inode->i_sb->s_export_op->fetch_iversion(inode);
+	else if (IS_I_VERSION(inode)) {
+		u64 chattr;
+
+		chattr =  stat->ctime.tv_sec;
+		chattr <<= 30;
+		chattr += stat->ctime.tv_nsec;
+		chattr += inode_query_iversion(inode);
+		return chattr;
+	} else
+		return time_to_chattr(&stat->ctime);
+}
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 513e028b0bbe..4e0ecf0ae2cf 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -293,34 +293,7 @@ static inline void fh_clear_pre_post_attrs(struct svc_fh *fhp)
 	fhp->fh_pre_saved = false;
 }
 
-/*
- * We could use i_version alone as the change attribute.  However,
- * i_version can go backwards after a reboot.  On its own that doesn't
- * necessarily cause a problem, but if i_version goes backwards and then
- * is incremented again it could reuse a value that was previously used
- * before boot, and a client who queried the two values might
- * incorrectly assume nothing changed.
- *
- * By using both ctime and the i_version counter we guarantee that as
- * long as time doesn't go backwards we never reuse an old value.
- */
-static inline u64 nfsd4_change_attribute(struct kstat *stat,
-					 struct inode *inode)
-{
-	if (inode->i_sb->s_export_op->fetch_iversion)
-		return inode->i_sb->s_export_op->fetch_iversion(inode);
-	else if (IS_I_VERSION(inode)) {
-		u64 chattr;
-
-		chattr =  stat->ctime.tv_sec;
-		chattr <<= 30;
-		chattr += stat->ctime.tv_nsec;
-		chattr += inode_query_iversion(inode);
-		return chattr;
-	} else
-		return time_to_chattr(&stat->ctime);
-}
-
+u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode);
 extern void fh_fill_pre_attrs(struct svc_fh *fhp);
 extern void fh_fill_post_attrs(struct svc_fh *fhp);
 extern void fh_fill_both_attrs(struct svc_fh *fhp);
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 RESEND 7/8] nfsd: use the getattr operation to fetch i_version
  2023-01-24 19:30 [PATCH v8 RESEND 0/8] fs: clean up internal i_version handling Jeff Layton
                   ` (5 preceding siblings ...)
  2023-01-24 19:30 ` [PATCH v8 RESEND 6/8] nfsd: move nfsd4_change_attribute to nfsfh.c Jeff Layton
@ 2023-01-24 19:30 ` Jeff Layton
  2023-01-24 19:30 ` [PATCH v8 RESEND 8/8] nfsd: remove fetch_iversion export operation Jeff Layton
  7 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2023-01-24 19:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

Now that we can call into vfs_getattr to get the i_version field, use
that facility to fetch it instead of doing it in nfsd4_change_attribute.

Neil also pointed out recently that IS_I_VERSION directory operations
are always logged, and so we only need to mitigate the rollback problem
on regular files. Also, we don't need to factor in the ctime when
reexporting NFS or Ceph.

Set the STATX_CHANGE_COOKIE (and BTIME) bits in the request when we're
dealing with a v4 request. Then, instead of looking at IS_I_VERSION when
generating the change attr, look at the result mask and only use it if
STATX_CHANGE_COOKIE is set.

Change nfsd4_change_attribute to only factor in the ctime if it's a
regular file and the fs doesn't advertise STATX_ATTR_CHANGE_MONOTONIC.

Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/nfs4xdr.c |  4 +++-
 fs/nfsd/nfsfh.c   | 54 +++++++++++++++++++++++++++++++----------------
 fs/nfsd/vfs.h     |  7 +++++-
 3 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 97edb32be77f..e12e5a4ad502 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2965,7 +2965,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 			goto out;
 	}
 
-	err = vfs_getattr(&path, &stat, STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
+	err = vfs_getattr(&path, &stat,
+			  STATX_BASIC_STATS | STATX_BTIME | STATX_CHANGE_COOKIE,
+			  AT_STATX_SYNC_AS_STAT);
 	if (err)
 		goto out_nfserr;
 	if (!(stat.result_mask & STATX_BTIME))
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index ac89e25e7733..3a01c8601712 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -628,6 +628,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
 		stat.mtime = inode->i_mtime;
 		stat.ctime = inode->i_ctime;
 		stat.size  = inode->i_size;
+		if (v4 && IS_I_VERSION(inode)) {
+			stat.change_cookie = inode_query_iversion(inode);
+			stat.result_mask |= STATX_CHANGE_COOKIE;
+		}
 	}
 	if (v4)
 		fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
@@ -659,6 +663,10 @@ void fh_fill_post_attrs(struct svc_fh *fhp)
 	if (err) {
 		fhp->fh_post_saved = false;
 		fhp->fh_post_attr.ctime = inode->i_ctime;
+		if (v4 && IS_I_VERSION(inode)) {
+			fhp->fh_post_attr.change_cookie = inode_query_iversion(inode);
+			fhp->fh_post_attr.result_mask |= STATX_CHANGE_COOKIE;
+		}
 	} else
 		fhp->fh_post_saved = true;
 	if (v4)
@@ -750,28 +758,38 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
 }
 
 /*
- * We could use i_version alone as the change attribute.  However,
- * i_version can go backwards after a reboot.  On its own that doesn't
- * necessarily cause a problem, but if i_version goes backwards and then
- * is incremented again it could reuse a value that was previously used
- * before boot, and a client who queried the two values might
- * incorrectly assume nothing changed.
+ * We could use i_version alone as the change attribute.  However, i_version
+ * can go backwards on a regular file after an unclean shutdown.  On its own
+ * that doesn't necessarily cause a problem, but if i_version goes backwards
+ * and then is incremented again it could reuse a value that was previously
+ * used before boot, and a client who queried the two values might incorrectly
+ * assume nothing changed.
+ *
+ * By using both ctime and the i_version counter we guarantee that as long as
+ * time doesn't go backwards we never reuse an old value. If the filesystem
+ * advertises STATX_ATTR_CHANGE_MONOTONIC, then this mitigation is not
+ * needed.
  *
- * By using both ctime and the i_version counter we guarantee that as
- * long as time doesn't go backwards we never reuse an old value.
+ * We only need to do this for regular files as well. For directories, we
+ * assume that the new change attr is always logged to stable storage in some
+ * fashion before the results can be seen.
  */
 u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
 {
+	u64 chattr;
+
 	if (inode->i_sb->s_export_op->fetch_iversion)
 		return inode->i_sb->s_export_op->fetch_iversion(inode);
-	else if (IS_I_VERSION(inode)) {
-		u64 chattr;
-
-		chattr =  stat->ctime.tv_sec;
-		chattr <<= 30;
-		chattr += stat->ctime.tv_nsec;
-		chattr += inode_query_iversion(inode);
-		return chattr;
-	} else
-		return time_to_chattr(&stat->ctime);
+	if (stat->result_mask & STATX_CHANGE_COOKIE) {
+		chattr = stat->change_cookie;
+
+		if (S_ISREG(inode->i_mode) &&
+		    !(stat->attributes & STATX_ATTR_CHANGE_MONOTONIC)) {
+			chattr += (u64)stat->ctime.tv_sec << 30;
+			chattr += stat->ctime.tv_nsec;
+		}
+	} else {
+		chattr = time_to_chattr(&stat->ctime);
+	}
+	return chattr;
 }
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index dbdfef7ae85b..43fb57a301d3 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -170,9 +170,14 @@ static inline void fh_drop_write(struct svc_fh *fh)
 
 static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
 {
+	u32 request_mask = STATX_BASIC_STATS;
 	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
 			 .dentry = fh->fh_dentry};
-	return nfserrno(vfs_getattr(&p, stat, STATX_BASIC_STATS,
+
+	if (fh->fh_maxsize == NFS4_FHSIZE)
+		request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
+
+	return nfserrno(vfs_getattr(&p, stat, request_mask,
 				    AT_STATX_SYNC_AS_STAT));
 }
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 RESEND 8/8] nfsd: remove fetch_iversion export operation
  2023-01-24 19:30 [PATCH v8 RESEND 0/8] fs: clean up internal i_version handling Jeff Layton
                   ` (6 preceding siblings ...)
  2023-01-24 19:30 ` [PATCH v8 RESEND 7/8] nfsd: use the getattr operation to fetch i_version Jeff Layton
@ 2023-01-24 19:30 ` Jeff Layton
  7 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2023-01-24 19:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

Now that the i_version counter is reported in struct kstat, there is no
need for this export operation.

Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfs/export.c          | 7 -------
 fs/nfsd/nfsfh.c          | 3 ---
 include/linux/exportfs.h | 1 -
 3 files changed, 11 deletions(-)

diff --git a/fs/nfs/export.c b/fs/nfs/export.c
index 01596f2d0a1e..1a9d5aa51dfb 100644
--- a/fs/nfs/export.c
+++ b/fs/nfs/export.c
@@ -145,17 +145,10 @@ nfs_get_parent(struct dentry *dentry)
 	return parent;
 }
 
-static u64 nfs_fetch_iversion(struct inode *inode)
-{
-	nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
-	return inode_peek_iversion_raw(inode);
-}
-
 const struct export_operations nfs_export_ops = {
 	.encode_fh = nfs_encode_fh,
 	.fh_to_dentry = nfs_fh_to_dentry,
 	.get_parent = nfs_get_parent,
-	.fetch_iversion = nfs_fetch_iversion,
 	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|
 		EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS|
 		EXPORT_OP_NOATOMIC_ATTR,
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 3a01c8601712..76ea268dc420 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -778,11 +778,8 @@ u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
 {
 	u64 chattr;
 
-	if (inode->i_sb->s_export_op->fetch_iversion)
-		return inode->i_sb->s_export_op->fetch_iversion(inode);
 	if (stat->result_mask & STATX_CHANGE_COOKIE) {
 		chattr = stat->change_cookie;
-
 		if (S_ISREG(inode->i_mode) &&
 		    !(stat->attributes & STATX_ATTR_CHANGE_MONOTONIC)) {
 			chattr += (u64)stat->ctime.tv_sec << 30;
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index fe848901fcc3..9f4d4bcbf251 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -213,7 +213,6 @@ struct export_operations {
 			  bool write, u32 *device_generation);
 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 			     int nr_iomaps, struct iattr *iattr);
-	u64 (*fetch_iversion)(struct inode *);
 #define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
 #define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */
 #define	EXPORT_OP_CLOSE_BEFORE_UNLINK	(0x4) /* close files before unlink */
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 RESEND 1/8] fs: uninline inode_query_iversion
  2023-01-24 19:30 ` [PATCH v8 RESEND 1/8] fs: uninline inode_query_iversion Jeff Layton
@ 2023-01-25 13:44   ` Jan Kara
  2023-01-26  9:02   ` Christian Brauner
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Kara @ 2023-01-25 13:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Tue 24-01-23 14:30:18, Jeff Layton wrote:
> Reviewed-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Reasonable. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/libfs.c               | 36 ++++++++++++++++++++++++++++++++++++
>  include/linux/iversion.h | 38 ++------------------------------------
>  2 files changed, 38 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index aada4e7c8713..17ecc47696e1 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -1582,3 +1582,39 @@ bool inode_maybe_inc_iversion(struct inode *inode, bool force)
>  	return true;
>  }
>  EXPORT_SYMBOL(inode_maybe_inc_iversion);
> +
> +/**
> + * inode_query_iversion - read i_version for later use
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter. This should be used by callers that wish
> + * to store the returned i_version for later comparison. This will guarantee
> + * that a later query of the i_version will result in a different value if
> + * anything has changed.
> + *
> + * In this implementation, we fetch the current value, set the QUERIED flag and
> + * then try to swap it into place with a cmpxchg, if it wasn't already set. If
> + * that fails, we try again with the newly fetched value from the cmpxchg.
> + */
> +u64 inode_query_iversion(struct inode *inode)
> +{
> +	u64 cur, new;
> +
> +	cur = inode_peek_iversion_raw(inode);
> +	do {
> +		/* If flag is already set, then no need to swap */
> +		if (cur & I_VERSION_QUERIED) {
> +			/*
> +			 * This barrier (and the implicit barrier in the
> +			 * cmpxchg below) pairs with the barrier in
> +			 * inode_maybe_inc_iversion().
> +			 */
> +			smp_mb();
> +			break;
> +		}
> +
> +		new = cur | I_VERSION_QUERIED;
> +	} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
> +	return cur >> I_VERSION_QUERIED_SHIFT;
> +}
> +EXPORT_SYMBOL(inode_query_iversion);
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index e27bd4f55d84..6755d8b4f20b 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -234,42 +234,6 @@ inode_peek_iversion(const struct inode *inode)
>  	return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT;
>  }
>  
> -/**
> - * inode_query_iversion - read i_version for later use
> - * @inode: inode from which i_version should be read
> - *
> - * Read the inode i_version counter. This should be used by callers that wish
> - * to store the returned i_version for later comparison. This will guarantee
> - * that a later query of the i_version will result in a different value if
> - * anything has changed.
> - *
> - * In this implementation, we fetch the current value, set the QUERIED flag and
> - * then try to swap it into place with a cmpxchg, if it wasn't already set. If
> - * that fails, we try again with the newly fetched value from the cmpxchg.
> - */
> -static inline u64
> -inode_query_iversion(struct inode *inode)
> -{
> -	u64 cur, new;
> -
> -	cur = inode_peek_iversion_raw(inode);
> -	do {
> -		/* If flag is already set, then no need to swap */
> -		if (cur & I_VERSION_QUERIED) {
> -			/*
> -			 * This barrier (and the implicit barrier in the
> -			 * cmpxchg below) pairs with the barrier in
> -			 * inode_maybe_inc_iversion().
> -			 */
> -			smp_mb();
> -			break;
> -		}
> -
> -		new = cur | I_VERSION_QUERIED;
> -	} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
> -	return cur >> I_VERSION_QUERIED_SHIFT;
> -}
> -
>  /*
>   * For filesystems without any sort of change attribute, the best we can
>   * do is fake one up from the ctime:
> @@ -283,6 +247,8 @@ static inline u64 time_to_chattr(struct timespec64 *t)
>  	return chattr;
>  }
>  
> +u64 inode_query_iversion(struct inode *inode);
> +
>  /**
>   * inode_eq_iversion_raw - check whether the raw i_version counter has changed
>   * @inode: inode to check
> -- 
> 2.39.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 RESEND 3/8] vfs: plumb i_version handling into struct kstat
  2023-01-24 19:30 ` [PATCH v8 RESEND 3/8] vfs: plumb i_version handling into struct kstat Jeff Layton
@ 2023-01-25 15:50   ` Christian Brauner
  2023-01-25 18:30     ` Jeff Layton
  2023-01-25 16:20   ` Jan Kara
  1 sibling, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2023-01-25 15:50 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Tue, Jan 24, 2023 at 02:30:20PM -0500, Jeff Layton wrote:
> The NFS server has a lot of special handling for different types of
> change attribute access, depending on the underlying filesystem. In
> most cases, it's doing a getattr anyway and then fetching that value
> after the fact.
> 
> Rather that do that, add a new STATX_CHANGE_COOKIE flag that is a
> kernel-only symbol (for now). If requested and getattr can implement it,
> it can fill out this field. For IS_I_VERSION inodes, add a generic
> implementation in vfs_getattr_nosec. Take care to mask
> STATX_CHANGE_COOKIE off in requests from userland and in the result
> mask.
> 
> Since not all filesystems can give the same guarantees of monotonicity,
> claim a STATX_ATTR_CHANGE_MONOTONIC flag that filesystems can set to
> indicate that they offer an i_version value that can never go backward.
> 
> Eventually if we decide to make the i_version available to userland, we
> can just designate a field for it in struct statx, and move the
> STATX_CHANGE_COOKIE definition to the uapi header.
> 
> Reviewed-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/stat.c            | 17 +++++++++++++++--
>  include/linux/stat.h |  9 +++++++++
>  2 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index d6cc74ca8486..f43afe0081fe 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -18,6 +18,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/pagemap.h>
>  #include <linux/compat.h>
> +#include <linux/iversion.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/unistd.h>
> @@ -122,6 +123,11 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>  	stat->attributes_mask |= (STATX_ATTR_AUTOMOUNT |
>  				  STATX_ATTR_DAX);
>  
> +	if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
> +		stat->result_mask |= STATX_CHANGE_COOKIE;
> +		stat->change_cookie = inode_query_iversion(inode);
> +	}
> +
>  	mnt_userns = mnt_user_ns(path->mnt);
>  	if (inode->i_op->getattr)
>  		return inode->i_op->getattr(mnt_userns, path, stat,
> @@ -602,9 +608,11 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
>  
>  	memset(&tmp, 0, sizeof(tmp));
>  
> -	tmp.stx_mask = stat->result_mask;
> +	/* STATX_CHANGE_COOKIE is kernel-only for now */
> +	tmp.stx_mask = stat->result_mask & ~STATX_CHANGE_COOKIE;
>  	tmp.stx_blksize = stat->blksize;
> -	tmp.stx_attributes = stat->attributes;
> +	/* STATX_ATTR_CHANGE_MONOTONIC is kernel-only for now */
> +	tmp.stx_attributes = stat->attributes & ~STATX_ATTR_CHANGE_MONOTONIC;
>  	tmp.stx_nlink = stat->nlink;
>  	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
>  	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
> @@ -643,6 +651,11 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
>  	if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
>  		return -EINVAL;
>  
> +	/* STATX_CHANGE_COOKIE is kernel-only for now. Ignore requests
> +	 * from userland.
> +	 */
> +	mask &= ~STATX_CHANGE_COOKIE;
> +
>  	error = vfs_statx(dfd, filename, flags, &stat, mask);
>  	if (error)
>  		return error;
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index ff277ced50e9..52150570d37a 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h

Sorry being late to the party once again...

> @@ -52,6 +52,15 @@ struct kstat {
>  	u64		mnt_id;
>  	u32		dio_mem_align;
>  	u32		dio_offset_align;
> +	u64		change_cookie;
>  };
>  
> +/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
> +
> +/* mask values */
> +#define STATX_CHANGE_COOKIE		0x40000000U	/* Want/got stx_change_attr */
> +
> +/* file attribute values */
> +#define STATX_ATTR_CHANGE_MONOTONIC	0x8000000000000000ULL /* version monotonically increases */

maybe it would be better to copy what we do for SB_* vs SB_I_* flags and
at least rename them to:

STATX_I_CHANGE_COOKIE
STATX_I_ATTR_CHANGE_MONOTONIC
i_change_cookie

to visually distinguish internal and external flags.

And also if possible it might be useful to move STATX_I_* flags to the
higher 32 bits and then one can use upper_32_bits to retrieve kernel
internal flags and lower_32_bits for userspace flags in tiny wrappers.

(I did something similar for clone3() a few years ago but there to
distinguish between flags available both in clone() and clone3() and
such that are only available in clone3().)

But just a thought. I mostly worry about accidently leaking this to
userspace so ideally we'd even have separate fields in struct kstat for
internal and external attributes but that might bump kstat size, though
I don't think struct kstat is actually ever really allocated all that
much.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 RESEND 2/8] fs: clarify when the i_version counter must be updated
  2023-01-24 19:30 ` [PATCH v8 RESEND 2/8] fs: clarify when the i_version counter must be updated Jeff Layton
@ 2023-01-25 16:06   ` Jan Kara
  2023-01-26 10:54     ` Jeff Layton
  2023-01-26  9:01   ` Christian Brauner
  1 sibling, 1 reply; 20+ messages in thread
From: Jan Kara @ 2023-01-25 16:06 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs, Colin Walters

On Tue 24-01-23 14:30:19, Jeff Layton wrote:
> The i_version field in the kernel has had different semantics over
> the decades, but NFSv4 has certain expectations. Update the comments
> in iversion.h to describe when the i_version must change.
> 
> Cc: Colin Walters <walters@verbum.org>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Trond Myklebust <trondmy@hammerspace.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good to me. But one note below:

> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 6755d8b4f20b..fced8115a5f4 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -9,8 +9,25 @@
>   * ---------------------------
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> - * appear different to observers if there was a change to the inode's data or
> - * metadata since it was last queried.
> + * appear larger to observers if there was an explicit change to the inode's
> + * data or metadata since it was last queried.
> + *
> + * An explicit change is one that would ordinarily result in a change to the
> + * inode status change time (aka ctime). i_version must appear to change, even
> + * if the ctime does not (since the whole point is to avoid missing updates due
> + * to timestamp granularity). If POSIX or other relevant spec mandates that the
> + * ctime must change due to an operation, then the i_version counter must be
> + * incremented as well.
> + *
> + * Making the i_version update completely atomic with the operation itself would
> + * be prohibitively expensive. Traditionally the kernel has updated the times on
> + * directories after an operation that changes its contents. For regular files,
> + * the ctime is usually updated before the data is copied into the cache for a
> + * write. This means that there is a window of time when an observer can
> + * associate a new timestamp with old file contents. Since the purpose of the
> + * i_version is to allow for better cache coherency, the i_version must always
> + * be updated after the results of the operation are visible. Updating it before
> + * and after a change is also permitted.

This sounds good but it is not the case for any of the current filesystems, is
it? Perhaps the documentation should mention this so that people are not
confused?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 RESEND 3/8] vfs: plumb i_version handling into struct kstat
  2023-01-24 19:30 ` [PATCH v8 RESEND 3/8] vfs: plumb i_version handling into struct kstat Jeff Layton
  2023-01-25 15:50   ` Christian Brauner
@ 2023-01-25 16:20   ` Jan Kara
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Kara @ 2023-01-25 16:20 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Tue 24-01-23 14:30:20, Jeff Layton wrote:
> The NFS server has a lot of special handling for different types of
> change attribute access, depending on the underlying filesystem. In
> most cases, it's doing a getattr anyway and then fetching that value
> after the fact.
> 
> Rather that do that, add a new STATX_CHANGE_COOKIE flag that is a
> kernel-only symbol (for now). If requested and getattr can implement it,
> it can fill out this field. For IS_I_VERSION inodes, add a generic
> implementation in vfs_getattr_nosec. Take care to mask
> STATX_CHANGE_COOKIE off in requests from userland and in the result
> mask.
> 
> Since not all filesystems can give the same guarantees of monotonicity,
> claim a STATX_ATTR_CHANGE_MONOTONIC flag that filesystems can set to
> indicate that they offer an i_version value that can never go backward.
> 
> Eventually if we decide to make the i_version available to userland, we
> can just designate a field for it in struct statx, and move the
> STATX_CHANGE_COOKIE definition to the uapi header.
> 
> Reviewed-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/stat.c            | 17 +++++++++++++++--
>  include/linux/stat.h |  9 +++++++++
>  2 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index d6cc74ca8486..f43afe0081fe 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -18,6 +18,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/pagemap.h>
>  #include <linux/compat.h>
> +#include <linux/iversion.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/unistd.h>
> @@ -122,6 +123,11 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>  	stat->attributes_mask |= (STATX_ATTR_AUTOMOUNT |
>  				  STATX_ATTR_DAX);
>  
> +	if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
> +		stat->result_mask |= STATX_CHANGE_COOKIE;
> +		stat->change_cookie = inode_query_iversion(inode);
> +	}
> +
>  	mnt_userns = mnt_user_ns(path->mnt);
>  	if (inode->i_op->getattr)
>  		return inode->i_op->getattr(mnt_userns, path, stat,
> @@ -602,9 +608,11 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
>  
>  	memset(&tmp, 0, sizeof(tmp));
>  
> -	tmp.stx_mask = stat->result_mask;
> +	/* STATX_CHANGE_COOKIE is kernel-only for now */
> +	tmp.stx_mask = stat->result_mask & ~STATX_CHANGE_COOKIE;
>  	tmp.stx_blksize = stat->blksize;
> -	tmp.stx_attributes = stat->attributes;
> +	/* STATX_ATTR_CHANGE_MONOTONIC is kernel-only for now */
> +	tmp.stx_attributes = stat->attributes & ~STATX_ATTR_CHANGE_MONOTONIC;
>  	tmp.stx_nlink = stat->nlink;
>  	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
>  	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
> @@ -643,6 +651,11 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
>  	if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
>  		return -EINVAL;
>  
> +	/* STATX_CHANGE_COOKIE is kernel-only for now. Ignore requests
> +	 * from userland.
> +	 */
> +	mask &= ~STATX_CHANGE_COOKIE;
> +
>  	error = vfs_statx(dfd, filename, flags, &stat, mask);
>  	if (error)
>  		return error;
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index ff277ced50e9..52150570d37a 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -52,6 +52,15 @@ struct kstat {
>  	u64		mnt_id;
>  	u32		dio_mem_align;
>  	u32		dio_offset_align;
> +	u64		change_cookie;
>  };
>  
> +/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
> +
> +/* mask values */
> +#define STATX_CHANGE_COOKIE		0x40000000U	/* Want/got stx_change_attr */
> +
> +/* file attribute values */
> +#define STATX_ATTR_CHANGE_MONOTONIC	0x8000000000000000ULL /* version monotonically increases */
> +
>  #endif
> -- 
> 2.39.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 RESEND 3/8] vfs: plumb i_version handling into struct kstat
  2023-01-25 15:50   ` Christian Brauner
@ 2023-01-25 18:30     ` Jeff Layton
  2023-01-26  8:23       ` Christian Brauner
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2023-01-25 18:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Wed, 2023-01-25 at 16:50 +0100, Christian Brauner wrote:
> On Tue, Jan 24, 2023 at 02:30:20PM -0500, Jeff Layton wrote:
> > The NFS server has a lot of special handling for different types of
> > change attribute access, depending on the underlying filesystem. In
> > most cases, it's doing a getattr anyway and then fetching that value
> > after the fact.
> > 
> > Rather that do that, add a new STATX_CHANGE_COOKIE flag that is a
> > kernel-only symbol (for now). If requested and getattr can implement it,
> > it can fill out this field. For IS_I_VERSION inodes, add a generic
> > implementation in vfs_getattr_nosec. Take care to mask
> > STATX_CHANGE_COOKIE off in requests from userland and in the result
> > mask.
> > 
> > Since not all filesystems can give the same guarantees of monotonicity,
> > claim a STATX_ATTR_CHANGE_MONOTONIC flag that filesystems can set to
> > indicate that they offer an i_version value that can never go backward.
> > 
> > Eventually if we decide to make the i_version available to userland, we
> > can just designate a field for it in struct statx, and move the
> > STATX_CHANGE_COOKIE definition to the uapi header.
> > 
> > Reviewed-by: NeilBrown <neilb@suse.de>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/stat.c            | 17 +++++++++++++++--
> >  include/linux/stat.h |  9 +++++++++
> >  2 files changed, 24 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/stat.c b/fs/stat.c
> > index d6cc74ca8486..f43afe0081fe 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/syscalls.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/compat.h>
> > +#include <linux/iversion.h>
> >  
> >  #include <linux/uaccess.h>
> >  #include <asm/unistd.h>
> > @@ -122,6 +123,11 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
> >  	stat->attributes_mask |= (STATX_ATTR_AUTOMOUNT |
> >  				  STATX_ATTR_DAX);
> >  
> > +	if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
> > +		stat->result_mask |= STATX_CHANGE_COOKIE;
> > +		stat->change_cookie = inode_query_iversion(inode);
> > +	}
> > +
> >  	mnt_userns = mnt_user_ns(path->mnt);
> >  	if (inode->i_op->getattr)
> >  		return inode->i_op->getattr(mnt_userns, path, stat,
> > @@ -602,9 +608,11 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
> >  
> >  	memset(&tmp, 0, sizeof(tmp));
> >  
> > -	tmp.stx_mask = stat->result_mask;
> > +	/* STATX_CHANGE_COOKIE is kernel-only for now */
> > +	tmp.stx_mask = stat->result_mask & ~STATX_CHANGE_COOKIE;
> >  	tmp.stx_blksize = stat->blksize;
> > -	tmp.stx_attributes = stat->attributes;
> > +	/* STATX_ATTR_CHANGE_MONOTONIC is kernel-only for now */
> > +	tmp.stx_attributes = stat->attributes & ~STATX_ATTR_CHANGE_MONOTONIC;
> >  	tmp.stx_nlink = stat->nlink;
> >  	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
> >  	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
> > @@ -643,6 +651,11 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
> >  	if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
> >  		return -EINVAL;
> >  
> > +	/* STATX_CHANGE_COOKIE is kernel-only for now. Ignore requests
> > +	 * from userland.
> > +	 */
> > +	mask &= ~STATX_CHANGE_COOKIE;
> > +
> >  	error = vfs_statx(dfd, filename, flags, &stat, mask);
> >  	if (error)
> >  		return error;
> > diff --git a/include/linux/stat.h b/include/linux/stat.h
> > index ff277ced50e9..52150570d37a 100644
> > --- a/include/linux/stat.h
> > +++ b/include/linux/stat.h
> 
> Sorry being late to the party once again...
> 
> > @@ -52,6 +52,15 @@ struct kstat {
> >  	u64		mnt_id;
> >  	u32		dio_mem_align;
> >  	u32		dio_offset_align;
> > +	u64		change_cookie;
> >  };
> >  
> > +/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
> > +
> > +/* mask values */
> > +#define STATX_CHANGE_COOKIE		0x40000000U	/* Want/got stx_change_attr */
> > +
> > +/* file attribute values */
> > +#define STATX_ATTR_CHANGE_MONOTONIC	0x8000000000000000ULL /* version monotonically increases */
> 
> maybe it would be better to copy what we do for SB_* vs SB_I_* flags and
> at least rename them to:
> 
> STATX_I_CHANGE_COOKIE
> STATX_I_ATTR_CHANGE_MONOTONIC
> i_change_cookie

An "i_"/"I_" prefix says "inode" to me. Maybe I've been at this too
long. ;)

> 
> to visually distinguish internal and external flags.
> 
> And also if possible it might be useful to move STATX_I_* flags to the
> higher 32 bits and then one can use upper_32_bits to retrieve kernel
> internal flags and lower_32_bits for userspace flags in tiny wrappers.
> 
> (I did something similar for clone3() a few years ago but there to
> distinguish between flags available both in clone() and clone3() and
> such that are only available in clone3().)
> 
> But just a thought. I mostly worry about accidently leaking this to
> userspace so ideally we'd even have separate fields in struct kstat for
> internal and external attributes but that might bump kstat size, though
> I don't think struct kstat is actually ever really allocated all that
> much.

I'm not sure that the internal/external distinction matters much for
filesystem providers or consumers of it. The place that it matters is at
the userland interface level, where statx or something similar is
called.

At some point we may want to make STATX_CHANGE_COOKIE queryable via
statx, at which point we'll have to have a big flag day where we do
s/STATX_I_CHANGE_COOKIE/STATX_CHANGE_COOKIE/.

I don't think it's worth it here.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 RESEND 3/8] vfs: plumb i_version handling into struct kstat
  2023-01-25 18:30     ` Jeff Layton
@ 2023-01-26  8:23       ` Christian Brauner
  0 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2023-01-26  8:23 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Wed, Jan 25, 2023 at 01:30:07PM -0500, Jeff Layton wrote:
> On Wed, 2023-01-25 at 16:50 +0100, Christian Brauner wrote:
> > On Tue, Jan 24, 2023 at 02:30:20PM -0500, Jeff Layton wrote:
> > > The NFS server has a lot of special handling for different types of
> > > change attribute access, depending on the underlying filesystem. In
> > > most cases, it's doing a getattr anyway and then fetching that value
> > > after the fact.
> > > 
> > > Rather that do that, add a new STATX_CHANGE_COOKIE flag that is a
> > > kernel-only symbol (for now). If requested and getattr can implement it,
> > > it can fill out this field. For IS_I_VERSION inodes, add a generic
> > > implementation in vfs_getattr_nosec. Take care to mask
> > > STATX_CHANGE_COOKIE off in requests from userland and in the result
> > > mask.
> > > 
> > > Since not all filesystems can give the same guarantees of monotonicity,
> > > claim a STATX_ATTR_CHANGE_MONOTONIC flag that filesystems can set to
> > > indicate that they offer an i_version value that can never go backward.
> > > 
> > > Eventually if we decide to make the i_version available to userland, we
> > > can just designate a field for it in struct statx, and move the
> > > STATX_CHANGE_COOKIE definition to the uapi header.
> > > 
> > > Reviewed-by: NeilBrown <neilb@suse.de>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/stat.c            | 17 +++++++++++++++--
> > >  include/linux/stat.h |  9 +++++++++
> > >  2 files changed, 24 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/stat.c b/fs/stat.c
> > > index d6cc74ca8486..f43afe0081fe 100644
> > > --- a/fs/stat.c
> > > +++ b/fs/stat.c
> > > @@ -18,6 +18,7 @@
> > >  #include <linux/syscalls.h>
> > >  #include <linux/pagemap.h>
> > >  #include <linux/compat.h>
> > > +#include <linux/iversion.h>
> > >  
> > >  #include <linux/uaccess.h>
> > >  #include <asm/unistd.h>
> > > @@ -122,6 +123,11 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
> > >  	stat->attributes_mask |= (STATX_ATTR_AUTOMOUNT |
> > >  				  STATX_ATTR_DAX);
> > >  
> > > +	if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
> > > +		stat->result_mask |= STATX_CHANGE_COOKIE;
> > > +		stat->change_cookie = inode_query_iversion(inode);
> > > +	}
> > > +
> > >  	mnt_userns = mnt_user_ns(path->mnt);
> > >  	if (inode->i_op->getattr)
> > >  		return inode->i_op->getattr(mnt_userns, path, stat,
> > > @@ -602,9 +608,11 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
> > >  
> > >  	memset(&tmp, 0, sizeof(tmp));
> > >  
> > > -	tmp.stx_mask = stat->result_mask;
> > > +	/* STATX_CHANGE_COOKIE is kernel-only for now */
> > > +	tmp.stx_mask = stat->result_mask & ~STATX_CHANGE_COOKIE;
> > >  	tmp.stx_blksize = stat->blksize;
> > > -	tmp.stx_attributes = stat->attributes;
> > > +	/* STATX_ATTR_CHANGE_MONOTONIC is kernel-only for now */
> > > +	tmp.stx_attributes = stat->attributes & ~STATX_ATTR_CHANGE_MONOTONIC;
> > >  	tmp.stx_nlink = stat->nlink;
> > >  	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
> > >  	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
> > > @@ -643,6 +651,11 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
> > >  	if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
> > >  		return -EINVAL;
> > >  
> > > +	/* STATX_CHANGE_COOKIE is kernel-only for now. Ignore requests
> > > +	 * from userland.
> > > +	 */
> > > +	mask &= ~STATX_CHANGE_COOKIE;
> > > +
> > >  	error = vfs_statx(dfd, filename, flags, &stat, mask);
> > >  	if (error)
> > >  		return error;
> > > diff --git a/include/linux/stat.h b/include/linux/stat.h
> > > index ff277ced50e9..52150570d37a 100644
> > > --- a/include/linux/stat.h
> > > +++ b/include/linux/stat.h
> > 
> > Sorry being late to the party once again...
> > 
> > > @@ -52,6 +52,15 @@ struct kstat {
> > >  	u64		mnt_id;
> > >  	u32		dio_mem_align;
> > >  	u32		dio_offset_align;
> > > +	u64		change_cookie;
> > >  };
> > >  
> > > +/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
> > > +
> > > +/* mask values */
> > > +#define STATX_CHANGE_COOKIE		0x40000000U	/* Want/got stx_change_attr */
> > > +
> > > +/* file attribute values */
> > > +#define STATX_ATTR_CHANGE_MONOTONIC	0x8000000000000000ULL /* version monotonically increases */
> > 
> > maybe it would be better to copy what we do for SB_* vs SB_I_* flags and
> > at least rename them to:
> > 
> > STATX_I_CHANGE_COOKIE
> > STATX_I_ATTR_CHANGE_MONOTONIC
> > i_change_cookie
> 
> An "i_"/"I_" prefix says "inode" to me. Maybe I've been at this too
> long. ;)
> 
> > 
> > to visually distinguish internal and external flags.
> > 
> > And also if possible it might be useful to move STATX_I_* flags to the
> > higher 32 bits and then one can use upper_32_bits to retrieve kernel
> > internal flags and lower_32_bits for userspace flags in tiny wrappers.
> > 
> > (I did something similar for clone3() a few years ago but there to
> > distinguish between flags available both in clone() and clone3() and
> > such that are only available in clone3().)
> > 
> > But just a thought. I mostly worry about accidently leaking this to
> > userspace so ideally we'd even have separate fields in struct kstat for
> > internal and external attributes but that might bump kstat size, though
> > I don't think struct kstat is actually ever really allocated all that
> > much.
> 
> I'm not sure that the internal/external distinction matters much for
> filesystem providers or consumers of it. The place that it matters is at
> the userland interface level, where statx or something similar is
> called.
> 
> At some point we may want to make STATX_CHANGE_COOKIE queryable via
> statx, at which point we'll have to have a big flag day where we do
> s/STATX_I_CHANGE_COOKIE/STATX_CHANGE_COOKIE/.
> 
> I don't think it's worth it here.

I'm not fond of internal and external STATX_* flags having the exact
same name but fine. :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 RESEND 2/8] fs: clarify when the i_version counter must be updated
  2023-01-24 19:30 ` [PATCH v8 RESEND 2/8] fs: clarify when the i_version counter must be updated Jeff Layton
  2023-01-25 16:06   ` Jan Kara
@ 2023-01-26  9:01   ` Christian Brauner
  1 sibling, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2023-01-26  9:01 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

On Tue, Jan 24, 2023 at 02:30:19PM -0500, Jeff Layton wrote:
> The i_version field in the kernel has had different semantics over
> the decades, but NFSv4 has certain expectations. Update the comments
> in iversion.h to describe when the i_version must change.
> 
> Cc: Colin Walters <walters@verbum.org>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Trond Myklebust <trondmy@hammerspace.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---

Looks good to me,
Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 RESEND 1/8] fs: uninline inode_query_iversion
  2023-01-24 19:30 ` [PATCH v8 RESEND 1/8] fs: uninline inode_query_iversion Jeff Layton
  2023-01-25 13:44   ` Jan Kara
@ 2023-01-26  9:02   ` Christian Brauner
  1 sibling, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2023-01-26  9:02 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Tue, Jan 24, 2023 at 02:30:18PM -0500, Jeff Layton wrote:
> Reviewed-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---

Looks good to me,
Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 RESEND 2/8] fs: clarify when the i_version counter must be updated
  2023-01-25 16:06   ` Jan Kara
@ 2023-01-26 10:54     ` Jeff Layton
  2023-01-26 11:36       ` Jan Kara
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2023-01-26 10:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

On Wed, 2023-01-25 at 17:06 +0100, Jan Kara wrote:
> On Tue 24-01-23 14:30:19, Jeff Layton wrote:
> > The i_version field in the kernel has had different semantics over
> > the decades, but NFSv4 has certain expectations. Update the comments
> > in iversion.h to describe when the i_version must change.
> > 
> > Cc: Colin Walters <walters@verbum.org>
> > Cc: NeilBrown <neilb@suse.de>
> > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> Looks good to me. But one note below:
> 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 6755d8b4f20b..fced8115a5f4 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -9,8 +9,25 @@
> >   * ---------------------------
> >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > - * appear different to observers if there was a change to the inode's data or
> > - * metadata since it was last queried.
> > + * appear larger to observers if there was an explicit change to the inode's
> > + * data or metadata since it was last queried.
> > + *
> > + * An explicit change is one that would ordinarily result in a change to the
> > + * inode status change time (aka ctime). i_version must appear to change, even
> > + * if the ctime does not (since the whole point is to avoid missing updates due
> > + * to timestamp granularity). If POSIX or other relevant spec mandates that the
> > + * ctime must change due to an operation, then the i_version counter must be
> > + * incremented as well.
> > + *
> > + * Making the i_version update completely atomic with the operation itself would
> > + * be prohibitively expensive. Traditionally the kernel has updated the times on
> > + * directories after an operation that changes its contents. For regular files,
> > + * the ctime is usually updated before the data is copied into the cache for a
> > + * write. This means that there is a window of time when an observer can
> > + * associate a new timestamp with old file contents. Since the purpose of the
> > + * i_version is to allow for better cache coherency, the i_version must always
> > + * be updated after the results of the operation are visible. Updating it before
> > + * and after a change is also permitted.
> 
> This sounds good but it is not the case for any of the current filesystems, is
> it? Perhaps the documentation should mention this so that people are not
> confused?
> 
> 								Honza

Correct. Currently, all filesystems change the times and version before
a write instead of after. I'm hoping that situation will change soon
though, as I've been working on a patchset to fix this for tmpfs, ext4
and btrfs.

If you still want to see something for this though, what would you
suggest for verbiage?

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 RESEND 2/8] fs: clarify when the i_version counter must be updated
  2023-01-26 10:54     ` Jeff Layton
@ 2023-01-26 11:36       ` Jan Kara
  2023-01-26 12:02         ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2023-01-26 11:36 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Jan Kara, tytso, adilger.kernel, djwong, david, trondmy, neilb,
	viro, zohar, xiubli, chuck.lever, lczerner, bfields, brauner,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs, Colin Walters

On Thu 26-01-23 05:54:16, Jeff Layton wrote:
> On Wed, 2023-01-25 at 17:06 +0100, Jan Kara wrote:
> > On Tue 24-01-23 14:30:19, Jeff Layton wrote:
> > > The i_version field in the kernel has had different semantics over
> > > the decades, but NFSv4 has certain expectations. Update the comments
> > > in iversion.h to describe when the i_version must change.
> > > 
> > > Cc: Colin Walters <walters@verbum.org>
> > > Cc: NeilBrown <neilb@suse.de>
> > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > 
> > Looks good to me. But one note below:
> > 
> > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > index 6755d8b4f20b..fced8115a5f4 100644
> > > --- a/include/linux/iversion.h
> > > +++ b/include/linux/iversion.h
> > > @@ -9,8 +9,25 @@
> > >   * ---------------------------
> > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > - * appear different to observers if there was a change to the inode's data or
> > > - * metadata since it was last queried.
> > > + * appear larger to observers if there was an explicit change to the inode's
> > > + * data or metadata since it was last queried.
> > > + *
> > > + * An explicit change is one that would ordinarily result in a change to the
> > > + * inode status change time (aka ctime). i_version must appear to change, even
> > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > + * to timestamp granularity). If POSIX or other relevant spec mandates that the
> > > + * ctime must change due to an operation, then the i_version counter must be
> > > + * incremented as well.
> > > + *
> > > + * Making the i_version update completely atomic with the operation itself would
> > > + * be prohibitively expensive. Traditionally the kernel has updated the times on
> > > + * directories after an operation that changes its contents. For regular files,
> > > + * the ctime is usually updated before the data is copied into the cache for a
> > > + * write. This means that there is a window of time when an observer can
> > > + * associate a new timestamp with old file contents. Since the purpose of the
> > > + * i_version is to allow for better cache coherency, the i_version must always
> > > + * be updated after the results of the operation are visible. Updating it before
> > > + * and after a change is also permitted.
> > 
> > This sounds good but it is not the case for any of the current filesystems, is
> > it? Perhaps the documentation should mention this so that people are not
> > confused?
> 
> Correct. Currently, all filesystems change the times and version before
> a write instead of after. I'm hoping that situation will change soon
> though, as I've been working on a patchset to fix this for tmpfs, ext4
> and btrfs.

That is good but we'll see how long it takes to get merged. AFAIR it is not
a complete nobrainer ;)

> If you still want to see something for this though, what would you
> suggest for verbiage?

Sure:

... the i_version must a be updated after the results of the operation are
visible (note that none of the filesystems currently do this, it is a work
in progress to fix this).

And once your patches are merged, you can also delete this note :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 RESEND 2/8] fs: clarify when the i_version counter must be updated
  2023-01-26 11:36       ` Jan Kara
@ 2023-01-26 12:02         ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2023-01-26 12:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

On Thu, 2023-01-26 at 12:36 +0100, Jan Kara wrote:
> On Thu 26-01-23 05:54:16, Jeff Layton wrote:
> > On Wed, 2023-01-25 at 17:06 +0100, Jan Kara wrote:
> > > On Tue 24-01-23 14:30:19, Jeff Layton wrote:
> > > > The i_version field in the kernel has had different semantics over
> > > > the decades, but NFSv4 has certain expectations. Update the comments
> > > > in iversion.h to describe when the i_version must change.
> > > > 
> > > > Cc: Colin Walters <walters@verbum.org>
> > > > Cc: NeilBrown <neilb@suse.de>
> > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > 
> > > Looks good to me. But one note below:
> > > 
> > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > index 6755d8b4f20b..fced8115a5f4 100644
> > > > --- a/include/linux/iversion.h
> > > > +++ b/include/linux/iversion.h
> > > > @@ -9,8 +9,25 @@
> > > >   * ---------------------------
> > > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > > - * appear different to observers if there was a change to the inode's data or
> > > > - * metadata since it was last queried.
> > > > + * appear larger to observers if there was an explicit change to the inode's
> > > > + * data or metadata since it was last queried.
> > > > + *
> > > > + * An explicit change is one that would ordinarily result in a change to the
> > > > + * inode status change time (aka ctime). i_version must appear to change, even
> > > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > > + * to timestamp granularity). If POSIX or other relevant spec mandates that the
> > > > + * ctime must change due to an operation, then the i_version counter must be
> > > > + * incremented as well.
> > > > + *
> > > > + * Making the i_version update completely atomic with the operation itself would
> > > > + * be prohibitively expensive. Traditionally the kernel has updated the times on
> > > > + * directories after an operation that changes its contents. For regular files,
> > > > + * the ctime is usually updated before the data is copied into the cache for a
> > > > + * write. This means that there is a window of time when an observer can
> > > > + * associate a new timestamp with old file contents. Since the purpose of the
> > > > + * i_version is to allow for better cache coherency, the i_version must always
> > > > + * be updated after the results of the operation are visible. Updating it before
> > > > + * and after a change is also permitted.
> > > 
> > > This sounds good but it is not the case for any of the current filesystems, is
> > > it? Perhaps the documentation should mention this so that people are not
> > > confused?
> > 
> > Correct. Currently, all filesystems change the times and version before
> > a write instead of after. I'm hoping that situation will change soon
> > though, as I've been working on a patchset to fix this for tmpfs, ext4
> > and btrfs.
> 
> That is good but we'll see how long it takes to get merged. AFAIR it is not
> a complete nobrainer ;)
> 
> > If you still want to see something for this though, what would you
> > suggest for verbiage?
> 
> Sure:
> 
> ... the i_version must a be updated after the results of the operation are
> visible (note that none of the filesystems currently do this, it is a work
> in progress to fix this).
> 
> And once your patches are merged, you can also delete this note :).
> 
> 								Honza

Sounds good, I folded something similar to that into the patch and
pushed it into the branch I'm feeding into linux-next. I won't bother
re-posting for just that though:

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index fced8115a5f4..f174ff1b59ee 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -27,7 +27,8 @@
  * associate a new timestamp with old file contents. Since the purpose of the
  * i_version is to allow for better cache coherency, the i_version must always
  * be updated after the results of the operation are visible. Updating it before
- * and after a change is also permitted.
+ * and after a change is also permitted. (Note that no filesystems currently do
+ * this. Fixing that is a work-in-progress).
  *
  * Observers see the i_version as a 64-bit number that never decreases. If it
  * remains the same since it was last checked, then nothing has changed in the

Thanks!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-01-26 12:02 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-24 19:30 [PATCH v8 RESEND 0/8] fs: clean up internal i_version handling Jeff Layton
2023-01-24 19:30 ` [PATCH v8 RESEND 1/8] fs: uninline inode_query_iversion Jeff Layton
2023-01-25 13:44   ` Jan Kara
2023-01-26  9:02   ` Christian Brauner
2023-01-24 19:30 ` [PATCH v8 RESEND 2/8] fs: clarify when the i_version counter must be updated Jeff Layton
2023-01-25 16:06   ` Jan Kara
2023-01-26 10:54     ` Jeff Layton
2023-01-26 11:36       ` Jan Kara
2023-01-26 12:02         ` Jeff Layton
2023-01-26  9:01   ` Christian Brauner
2023-01-24 19:30 ` [PATCH v8 RESEND 3/8] vfs: plumb i_version handling into struct kstat Jeff Layton
2023-01-25 15:50   ` Christian Brauner
2023-01-25 18:30     ` Jeff Layton
2023-01-26  8:23       ` Christian Brauner
2023-01-25 16:20   ` Jan Kara
2023-01-24 19:30 ` [PATCH v8 RESEND 4/8] nfs: report the inode version in getattr if requested Jeff Layton
2023-01-24 19:30 ` [PATCH v8 RESEND 5/8] ceph: " Jeff Layton
2023-01-24 19:30 ` [PATCH v8 RESEND 6/8] nfsd: move nfsd4_change_attribute to nfsfh.c Jeff Layton
2023-01-24 19:30 ` [PATCH v8 RESEND 7/8] nfsd: use the getattr operation to fetch i_version Jeff Layton
2023-01-24 19:30 ` [PATCH v8 RESEND 8/8] nfsd: remove fetch_iversion export operation Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.