ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter
@ 2022-09-30 11:18 Jeff Layton
  2022-09-30 11:18 ` [PATCH v6 1/9] iversion: move inode_query_iversion to libfs.c Jeff Layton
                   ` (8 more replies)
  0 siblings, 9 replies; 35+ messages in thread
From: Jeff Layton @ 2022-09-30 11:18 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

v6: add support for STATX_ATTR_VERSION_MONOTONIC
    add patch to expose i_version counter to userland
    patches to update times/version after copying write data

An updated set of i_version handling changes! I've dropped the earlier
ext4 patches since Ted has picked up the relevant ext4 ones.

This set is based on linux-next, to make sure we don't collide with the
statx DIO alignment patches, and some other i_version cleanups that are
in flight.  I'm hoping those land in 6.1.

There are a few changes since v5, mostly centered around adding
STATX_ATTR_VERSION_MONOTONIC. I've also re-added the patch to expose
STATX_VERSION to userland via statx. What I'm proposing should now
(mostly) conform to the semantics I layed out in the manpage patch I
sent recently [1].

Finally, I've added two patches to make __generic_file_write_iter and
ext4 update the c/mtime after copying file data instead of before, which
Neil pointed out makes for better cache-coherency handling. Those should
take care of ext4 and tmpfs. xfs and btrfs will need to make the same
changes.

One thing I'm not sure of is what we should do if update_times fails
after an otherwise successful write. Should we just ignore that and move
on (and maybe WARN)? Return an error? Set a writeback error? What's the
right recourse there?

I'd like to go ahead and get the first 6 patches from this series into
linux-next fairly soon, so if anyone has objections, please speak up!

[1]: https://lore.kernel.org/linux-nfs/20220928134200.28741-1-jlayton@kernel.org/T/#u

Jeff Layton (9):
  iversion: move inode_query_iversion to libfs.c
  iversion: clarify when the i_version counter must be updated
  vfs: plumb i_version handling into struct kstat
  nfs: report the inode version in getattr if requested
  ceph: report the inode version in getattr if requested
  nfsd: use the getattr operation to fetch i_version
  vfs: expose STATX_VERSION to userland
  vfs: update times after copying data in __generic_file_write_iter
  ext4: update times after I/O in write codepaths

 fs/ceph/inode.c           | 16 +++++++++----
 fs/ext4/file.c            | 20 +++++++++++++---
 fs/libfs.c                | 36 +++++++++++++++++++++++++++++
 fs/nfs/export.c           |  7 ------
 fs/nfs/inode.c            | 10 ++++++--
 fs/nfsd/nfs4xdr.c         |  4 +++-
 fs/nfsd/nfsfh.c           | 40 ++++++++++++++++++++++++++++++++
 fs/nfsd/nfsfh.h           | 29 +----------------------
 fs/nfsd/vfs.h             |  7 +++++-
 fs/stat.c                 |  7 ++++++
 include/linux/exportfs.h  |  1 -
 include/linux/iversion.h  | 48 ++++++++-------------------------------
 include/linux/stat.h      |  2 +-
 include/uapi/linux/stat.h |  6 +++--
 mm/filemap.c              | 17 ++++++++++----
 samples/vfs/test-statx.c  |  8 +++++--
 16 files changed, 163 insertions(+), 95 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v6 1/9] iversion: move inode_query_iversion to libfs.c
  2022-09-30 11:18 [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Jeff Layton
@ 2022-09-30 11:18 ` Jeff Layton
  2022-10-03 23:05   ` NeilBrown
  2022-09-30 11:18 ` [PATCH v6 2/9] iversion: clarify when the i_version counter must be updated Jeff Layton
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 35+ messages in thread
From: Jeff Layton @ 2022-09-30 11:18 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

There's no need to have such a large function forcibly inlined.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/libfs.c               | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/iversion.h | 38 ++------------------------------------
 2 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 682d56345a1c..5ae81466a422 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1566,3 +1566,39 @@ bool inode_maybe_inc_iversion(struct inode *inode, bool force)
 	return true;
 }
 EXPORT_SYMBOL(inode_maybe_inc_iversion);
+
+/**
+ * inode_query_iversion - read i_version for later use
+ * @inode: inode from which i_version should be read
+ *
+ * Read the inode i_version counter. This should be used by callers that wish
+ * to store the returned i_version for later comparison. This will guarantee
+ * that a later query of the i_version will result in a different value if
+ * anything has changed.
+ *
+ * In this implementation, we fetch the current value, set the QUERIED flag and
+ * then try to swap it into place with a cmpxchg, if it wasn't already set. If
+ * that fails, we try again with the newly fetched value from the cmpxchg.
+ */
+u64 inode_query_iversion(struct inode *inode)
+{
+	u64 cur, new;
+
+	cur = inode_peek_iversion_raw(inode);
+	do {
+		/* If flag is already set, then no need to swap */
+		if (cur & I_VERSION_QUERIED) {
+			/*
+			 * This barrier (and the implicit barrier in the
+			 * cmpxchg below) pairs with the barrier in
+			 * inode_maybe_inc_iversion().
+			 */
+			smp_mb();
+			break;
+		}
+
+		new = cur | I_VERSION_QUERIED;
+	} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
+	return cur >> I_VERSION_QUERIED_SHIFT;
+}
+EXPORT_SYMBOL(inode_query_iversion);
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index e27bd4f55d84..6755d8b4f20b 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -234,42 +234,6 @@ inode_peek_iversion(const struct inode *inode)
 	return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT;
 }
 
-/**
- * inode_query_iversion - read i_version for later use
- * @inode: inode from which i_version should be read
- *
- * Read the inode i_version counter. This should be used by callers that wish
- * to store the returned i_version for later comparison. This will guarantee
- * that a later query of the i_version will result in a different value if
- * anything has changed.
- *
- * In this implementation, we fetch the current value, set the QUERIED flag and
- * then try to swap it into place with a cmpxchg, if it wasn't already set. If
- * that fails, we try again with the newly fetched value from the cmpxchg.
- */
-static inline u64
-inode_query_iversion(struct inode *inode)
-{
-	u64 cur, new;
-
-	cur = inode_peek_iversion_raw(inode);
-	do {
-		/* If flag is already set, then no need to swap */
-		if (cur & I_VERSION_QUERIED) {
-			/*
-			 * This barrier (and the implicit barrier in the
-			 * cmpxchg below) pairs with the barrier in
-			 * inode_maybe_inc_iversion().
-			 */
-			smp_mb();
-			break;
-		}
-
-		new = cur | I_VERSION_QUERIED;
-	} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
-	return cur >> I_VERSION_QUERIED_SHIFT;
-}
-
 /*
  * For filesystems without any sort of change attribute, the best we can
  * do is fake one up from the ctime:
@@ -283,6 +247,8 @@ static inline u64 time_to_chattr(struct timespec64 *t)
 	return chattr;
 }
 
+u64 inode_query_iversion(struct inode *inode);
+
 /**
  * inode_eq_iversion_raw - check whether the raw i_version counter has changed
  * @inode: inode to check
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v6 2/9] iversion: clarify when the i_version counter must be updated
  2022-09-30 11:18 [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Jeff Layton
  2022-09-30 11:18 ` [PATCH v6 1/9] iversion: move inode_query_iversion to libfs.c Jeff Layton
@ 2022-09-30 11:18 ` Jeff Layton
  2022-10-03 23:10   ` NeilBrown
  2022-09-30 11:18 ` [PATCH v6 3/9] vfs: plumb i_version handling into struct kstat Jeff Layton
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 35+ messages in thread
From: Jeff Layton @ 2022-09-30 11:18 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

The i_version field in the kernel has had different semantics over
the decades, but NFSv4 has certain expectations. Update the comments
in iversion.h to describe when the i_version must change.

Cc: Colin Walters <walters@verbum.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trondmy@hammerspace.com>
Cc: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/iversion.h | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 6755d8b4f20b..9925cac1fa94 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -9,8 +9,14 @@
  * ---------------------------
  * The change attribute (i_version) is mandated by NFSv4 and is mostly for
  * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
- * appear different to observers if there was a change to the inode's data or
- * metadata since it was last queried.
+ * appear larger to observers if there was an explicit change to the inode's
+ * data or metadata since it was last queried.
+ *
+ * An explicit change is one that would ordinarily result in a change to the
+ * inode status change time (aka ctime). i_version must appear to change, even
+ * if the ctime does not (since the whole point is to avoid missing updates due
+ * to timestamp granularity). If POSIX mandates that the ctime must change due
+ * to an operation, then the i_version counter must be incremented as well.
  *
  * Observers see the i_version as a 64-bit number that never decreases. If it
  * remains the same since it was last checked, then nothing has changed in the
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v6 3/9] vfs: plumb i_version handling into struct kstat
  2022-09-30 11:18 [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Jeff Layton
  2022-09-30 11:18 ` [PATCH v6 1/9] iversion: move inode_query_iversion to libfs.c Jeff Layton
  2022-09-30 11:18 ` [PATCH v6 2/9] iversion: clarify when the i_version counter must be updated Jeff Layton
@ 2022-09-30 11:18 ` Jeff Layton
  2022-10-03 23:14   ` NeilBrown
  2022-09-30 11:18 ` [PATCH v6 4/9] nfs: report the inode version in getattr if requested Jeff Layton
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 35+ messages in thread
From: Jeff Layton @ 2022-09-30 11:18 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Jeff Layton

From: Jeff Layton <jlayton@redhat.com>

The NFS server has a lot of special handling for different types of
change attribute access, depending on what sort of inode we have. In
most cases, it's doing a getattr anyway and then fetching that value
after the fact.

Rather that do that, add a new STATX_VERSION flag that is a kernel-only
symbol (for now). If requested and getattr can implement it, it can fill
out this field. For IS_I_VERSION inodes, add a generic implementation in
vfs_getattr_nosec. Take care to mask STATX_VERSION off in requests from
userland and in the result mask.

Eventually if we decide to make this available to userland, we can just
designate a field for it in struct statx, and move the STATX_VERSION
definition to the uapi header.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/stat.c            | 17 +++++++++++++++--
 include/linux/stat.h |  9 +++++++++
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/stat.c b/fs/stat.c
index a7930d744483..e7f8cd4b24e1 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -17,6 +17,7 @@
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
 #include <linux/compat.h>
+#include <linux/iversion.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -118,6 +119,11 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	stat->attributes_mask |= (STATX_ATTR_AUTOMOUNT |
 				  STATX_ATTR_DAX);
 
+	if ((request_mask & STATX_VERSION) && IS_I_VERSION(inode)) {
+		stat->result_mask |= STATX_VERSION;
+		stat->version = inode_query_iversion(inode);
+	}
+
 	mnt_userns = mnt_user_ns(path->mnt);
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(mnt_userns, path, stat,
@@ -587,9 +593,11 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 
 	memset(&tmp, 0, sizeof(tmp));
 
-	tmp.stx_mask = stat->result_mask;
+	/* STATX_VERSION is kernel-only for now */
+	tmp.stx_mask = stat->result_mask & ~STATX_VERSION;
 	tmp.stx_blksize = stat->blksize;
-	tmp.stx_attributes = stat->attributes;
+	/* STATX_ATTR_VERSION_MONOTONIC is kernel-only for now */
+	tmp.stx_attributes = stat->attributes & ~STATX_ATTR_VERSION_MONOTONIC;
 	tmp.stx_nlink = stat->nlink;
 	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
 	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
@@ -628,6 +636,11 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
 	if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
 		return -EINVAL;
 
+	/* STATX_VERSION is kernel-only for now. Ignore requests
+	 * from userland.
+	 */
+	mask &= ~STATX_VERSION;
+
 	error = vfs_statx(dfd, filename, flags, &stat, mask);
 	if (error)
 		return error;
diff --git a/include/linux/stat.h b/include/linux/stat.h
index ff277ced50e9..4e9428d86a3a 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -52,6 +52,15 @@ struct kstat {
 	u64		mnt_id;
 	u32		dio_mem_align;
 	u32		dio_offset_align;
+	u64		version;
 };
 
+/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
+
+/* mask values */
+#define STATX_VERSION		0x40000000U	/* Want/got stx_change_attr */
+
+/* file attribute values */
+#define STATX_ATTR_VERSION_MONOTONIC	0x8000000000000000ULL /* version monotonically increases */
+
 #endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v6 4/9] nfs: report the inode version in getattr if requested
  2022-09-30 11:18 [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Jeff Layton
                   ` (2 preceding siblings ...)
  2022-09-30 11:18 ` [PATCH v6 3/9] vfs: plumb i_version handling into struct kstat Jeff Layton
@ 2022-09-30 11:18 ` Jeff Layton
  2022-10-03 23:29   ` NeilBrown
  2022-09-30 11:18 ` [PATCH v6 5/9] ceph: " Jeff Layton
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 35+ messages in thread
From: Jeff Layton @ 2022-09-30 11:18 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

Allow NFS to report the i_version in getattr requests. Since the cost to
fetch it is relatively cheap, do it unconditionally and just set the
flag if it looks like it's valid. Also, conditionally enable the
MONOTONIC flag when the server reports its change attr type as such.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfs/inode.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index bea7c005119c..5cb7017e5089 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -830,6 +830,8 @@ static u32 nfs_get_valid_attrmask(struct inode *inode)
 		reply_mask |= STATX_UID | STATX_GID;
 	if (!(cache_validity & NFS_INO_INVALID_BLOCKS))
 		reply_mask |= STATX_BLOCKS;
+	if (!(cache_validity & NFS_INO_INVALID_CHANGE))
+		reply_mask |= STATX_VERSION;
 	return reply_mask;
 }
 
@@ -848,7 +850,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 
 	request_mask &= STATX_TYPE | STATX_MODE | STATX_NLINK | STATX_UID |
 			STATX_GID | STATX_ATIME | STATX_MTIME | STATX_CTIME |
-			STATX_INO | STATX_SIZE | STATX_BLOCKS;
+			STATX_INO | STATX_SIZE | STATX_BLOCKS | STATX_VERSION;
 
 	if ((query_flags & AT_STATX_DONT_SYNC) && !force_sync) {
 		if (readdirplus_enabled)
@@ -877,7 +879,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 	/* Is the user requesting attributes that might need revalidation? */
 	if (!(request_mask & (STATX_MODE|STATX_NLINK|STATX_ATIME|STATX_CTIME|
 					STATX_MTIME|STATX_UID|STATX_GID|
-					STATX_SIZE|STATX_BLOCKS)))
+					STATX_SIZE|STATX_BLOCKS|STATX_VERSION)))
 		goto out_no_revalidate;
 
 	/* Check whether the cached attributes are stale */
@@ -915,6 +917,10 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 
 	generic_fillattr(&init_user_ns, inode, stat);
 	stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
+	stat->version = inode_peek_iversion_raw(inode);
+	stat->attributes_mask |= STATX_ATTR_VERSION_MONOTONIC;
+	if (server->change_attr_type != NFS4_CHANGE_TYPE_IS_UNDEFINED)
+		stat->attributes |= STATX_ATTR_VERSION_MONOTONIC;
 	if (S_ISDIR(inode->i_mode))
 		stat->blksize = NFS_SERVER(inode)->dtsize;
 out:
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v6 5/9] ceph: report the inode version in getattr if requested
  2022-09-30 11:18 [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Jeff Layton
                   ` (3 preceding siblings ...)
  2022-09-30 11:18 ` [PATCH v6 4/9] nfs: report the inode version in getattr if requested Jeff Layton
@ 2022-09-30 11:18 ` Jeff Layton
  2022-09-30 11:18 ` [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 35+ messages in thread
From: Jeff Layton @ 2022-09-30 11:18 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

When getattr requests the STX_VERSION, request the full gamut of caps
(similarly to how ctime is handled). When the change attribute seems to
be valid, return it in the ino_version field and set the flag in the
reply mask. Also, unconditionally enable STATX_ATTR_VERSION_MONOTONIC.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 42351d7a0dd6..bcab855bf1ae 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2415,10 +2415,10 @@ static int statx_to_caps(u32 want, umode_t mode)
 {
 	int mask = 0;
 
-	if (want & (STATX_MODE|STATX_UID|STATX_GID|STATX_CTIME|STATX_BTIME))
+	if (want & (STATX_MODE|STATX_UID|STATX_GID|STATX_CTIME|STATX_BTIME|STATX_VERSION))
 		mask |= CEPH_CAP_AUTH_SHARED;
 
-	if (want & (STATX_NLINK|STATX_CTIME)) {
+	if (want & (STATX_NLINK|STATX_CTIME|STATX_VERSION)) {
 		/*
 		 * The link count for directories depends on inode->i_subdirs,
 		 * and that is only updated when Fs caps are held.
@@ -2429,11 +2429,10 @@ static int statx_to_caps(u32 want, umode_t mode)
 			mask |= CEPH_CAP_LINK_SHARED;
 	}
 
-	if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|
-		    STATX_BLOCKS))
+	if (want & (STATX_ATIME|STATX_MTIME|STATX_CTIME|STATX_SIZE|STATX_BLOCKS|STATX_VERSION))
 		mask |= CEPH_CAP_FILE_SHARED;
 
-	if (want & (STATX_CTIME))
+	if (want & (STATX_CTIME|STATX_VERSION))
 		mask |= CEPH_CAP_XATTR_SHARED;
 
 	return mask;
@@ -2475,6 +2474,11 @@ int ceph_getattr(struct user_namespace *mnt_userns, const struct path *path,
 		valid_mask |= STATX_BTIME;
 	}
 
+	if (request_mask & STATX_VERSION) {
+		stat->version = inode_peek_iversion_raw(inode);
+		valid_mask |= STATX_VERSION;
+	}
+
 	if (ceph_snap(inode) == CEPH_NOSNAP)
 		stat->dev = inode->i_sb->s_dev;
 	else
@@ -2498,6 +2502,8 @@ int ceph_getattr(struct user_namespace *mnt_userns, const struct path *path,
 			stat->nlink = 1 + 1 + ci->i_subdirs;
 	}
 
+	stat->attributes_mask |= STATX_ATTR_VERSION_MONOTONIC;
+	stat->attributes |= STATX_ATTR_VERSION_MONOTONIC;
 	stat->result_mask = request_mask & valid_mask;
 	return err;
 }
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version
  2022-09-30 11:18 [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Jeff Layton
                   ` (4 preceding siblings ...)
  2022-09-30 11:18 ` [PATCH v6 5/9] ceph: " Jeff Layton
@ 2022-09-30 11:18 ` Jeff Layton
  2022-09-30 14:34   ` Chuck Lever III
  2022-10-03 23:39   ` NeilBrown
  2022-09-30 11:18 ` [PATCH v6 7/9] vfs: expose STATX_VERSION to userland Jeff Layton
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 35+ messages in thread
From: Jeff Layton @ 2022-09-30 11:18 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

Now that we can call into vfs_getattr to get the i_version field, use
that facility to fetch it instead of doing it in nfsd4_change_attribute.

Neil also pointed out recently that IS_I_VERSION directory operations
are always logged, and so we only need to mitigate the rollback problem
on regular files. Also, we don't need to factor in the ctime when
reexporting NFS or Ceph.

Set the STATX_VERSION (and BTIME) bits in the request when we're dealing
with a v4 request. Then, instead of looking at IS_I_VERSION when
generating the change attr, look at the result mask and only use it if
STATX_VERSION is set. With this change, we can drop the fetch_iversion
export operation as well.

Move nfsd4_change_attribute into nfsfh.c, and change it to only factor
in the ctime if it's a regular file and the fs doesn't advertise
STATX_ATTR_VERSION_MONOTONIC.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfs/export.c          |  7 -------
 fs/nfsd/nfs4xdr.c        |  4 +++-
 fs/nfsd/nfsfh.c          | 40 ++++++++++++++++++++++++++++++++++++++++
 fs/nfsd/nfsfh.h          | 29 +----------------------------
 fs/nfsd/vfs.h            |  7 ++++++-
 include/linux/exportfs.h |  1 -
 6 files changed, 50 insertions(+), 38 deletions(-)

diff --git a/fs/nfs/export.c b/fs/nfs/export.c
index 01596f2d0a1e..1a9d5aa51dfb 100644
--- a/fs/nfs/export.c
+++ b/fs/nfs/export.c
@@ -145,17 +145,10 @@ nfs_get_parent(struct dentry *dentry)
 	return parent;
 }
 
-static u64 nfs_fetch_iversion(struct inode *inode)
-{
-	nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
-	return inode_peek_iversion_raw(inode);
-}
-
 const struct export_operations nfs_export_ops = {
 	.encode_fh = nfs_encode_fh,
 	.fh_to_dentry = nfs_fh_to_dentry,
 	.get_parent = nfs_get_parent,
-	.fetch_iversion = nfs_fetch_iversion,
 	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|
 		EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS|
 		EXPORT_OP_NOATOMIC_ATTR,
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 1e9690a061ec..779c009314c6 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2869,7 +2869,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 			goto out;
 	}
 
-	err = vfs_getattr(&path, &stat, STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
+	err = vfs_getattr(&path, &stat,
+			  STATX_BASIC_STATS | STATX_BTIME | STATX_VERSION,
+			  AT_STATX_SYNC_AS_STAT);
 	if (err)
 		goto out_nfserr;
 	if (!(stat.result_mask & STATX_BTIME))
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index a5b71526cee0..9168bc657378 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -634,6 +634,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
 		stat.mtime = inode->i_mtime;
 		stat.ctime = inode->i_ctime;
 		stat.size  = inode->i_size;
+		if (v4 && IS_I_VERSION(inode)) {
+			stat.version = inode_query_iversion(inode);
+			stat.result_mask |= STATX_VERSION;
+		}
 	}
 	if (v4)
 		fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
@@ -665,6 +669,8 @@ void fh_fill_post_attrs(struct svc_fh *fhp)
 	if (err) {
 		fhp->fh_post_saved = false;
 		fhp->fh_post_attr.ctime = inode->i_ctime;
+		if (v4 && IS_I_VERSION(inode))
+			fhp->fh_post_attr.version = inode_query_iversion(inode);
 	} else
 		fhp->fh_post_saved = true;
 	if (v4)
@@ -754,3 +760,37 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
 		return FSIDSOURCE_UUID;
 	return FSIDSOURCE_DEV;
 }
+
+/*
+ * We could use i_version alone as the change attribute.  However, i_version
+ * can go backwards on a regular file after an unclean shutdown.  On its own
+ * that doesn't necessarily cause a problem, but if i_version goes backwards
+ * and then is incremented again it could reuse a value that was previously
+ * used before boot, and a client who queried the two values might incorrectly
+ * assume nothing changed.
+ *
+ * By using both ctime and the i_version counter we guarantee that as long as
+ * time doesn't go backwards we never reuse an old value. If the filesystem
+ * advertises STATX_ATTR_VERSION_MONOTONIC, then this mitigation is not needed.
+ *
+ * We only need to do this for regular files as well. For directories, we
+ * assume that the new change attr is always logged to stable storage in some
+ * fashion before the results can be seen.
+ */
+u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
+{
+	u64 chattr;
+
+	if (stat->result_mask & STATX_VERSION) {
+		chattr = stat->version;
+
+		if (S_ISREG(inode->i_mode) &&
+		    !(stat->attributes & STATX_ATTR_VERSION_MONOTONIC)) {
+			chattr += (u64)stat->ctime.tv_sec << 30;
+			chattr += stat->ctime.tv_nsec;
+		}
+	} else {
+		chattr = time_to_chattr(&stat->ctime);
+	}
+	return chattr;
+}
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index c3ae6414fc5c..4c223a7a91d4 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -291,34 +291,7 @@ static inline void fh_clear_pre_post_attrs(struct svc_fh *fhp)
 	fhp->fh_pre_saved = false;
 }
 
-/*
- * We could use i_version alone as the change attribute.  However,
- * i_version can go backwards after a reboot.  On its own that doesn't
- * necessarily cause a problem, but if i_version goes backwards and then
- * is incremented again it could reuse a value that was previously used
- * before boot, and a client who queried the two values might
- * incorrectly assume nothing changed.
- *
- * By using both ctime and the i_version counter we guarantee that as
- * long as time doesn't go backwards we never reuse an old value.
- */
-static inline u64 nfsd4_change_attribute(struct kstat *stat,
-					 struct inode *inode)
-{
-	if (inode->i_sb->s_export_op->fetch_iversion)
-		return inode->i_sb->s_export_op->fetch_iversion(inode);
-	else if (IS_I_VERSION(inode)) {
-		u64 chattr;
-
-		chattr =  stat->ctime.tv_sec;
-		chattr <<= 30;
-		chattr += stat->ctime.tv_nsec;
-		chattr += inode_query_iversion(inode);
-		return chattr;
-	} else
-		return time_to_chattr(&stat->ctime);
-}
-
+u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode);
 extern void fh_fill_pre_attrs(struct svc_fh *fhp);
 extern void fh_fill_post_attrs(struct svc_fh *fhp);
 extern void fh_fill_both_attrs(struct svc_fh *fhp);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index c95cd414b4bb..a905f59481ee 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -168,9 +168,14 @@ static inline void fh_drop_write(struct svc_fh *fh)
 
 static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
 {
+	u32 request_mask = STATX_BASIC_STATS;
 	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
 			 .dentry = fh->fh_dentry};
-	return nfserrno(vfs_getattr(&p, stat, STATX_BASIC_STATS,
+
+	if (fh->fh_maxsize == NFS4_FHSIZE)
+		request_mask |= (STATX_BTIME | STATX_VERSION);
+
+	return nfserrno(vfs_getattr(&p, stat, request_mask,
 				    AT_STATX_SYNC_AS_STAT));
 }
 
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index fe848901fcc3..9f4d4bcbf251 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -213,7 +213,6 @@ struct export_operations {
 			  bool write, u32 *device_generation);
 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 			     int nr_iomaps, struct iattr *iattr);
-	u64 (*fetch_iversion)(struct inode *);
 #define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
 #define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */
 #define	EXPORT_OP_CLOSE_BEFORE_UNLINK	(0x4) /* close files before unlink */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v6 7/9] vfs: expose STATX_VERSION to userland
  2022-09-30 11:18 [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Jeff Layton
                   ` (5 preceding siblings ...)
  2022-09-30 11:18 ` [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
@ 2022-09-30 11:18 ` Jeff Layton
  2022-10-03 23:42   ` NeilBrown
  2022-09-30 11:18 ` [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter Jeff Layton
  2022-09-30 11:18 ` [PATCH v6 9/9] ext4: update times after I/O in write codepaths Jeff Layton
  8 siblings, 1 reply; 35+ messages in thread
From: Jeff Layton @ 2022-09-30 11:18 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Jeff Layton

From: Jeff Layton <jlayton@redhat.com>

Claim one of the spare fields in struct statx to hold a 64-bit inode
version attribute. When userland requests STATX_VERSION, copy the
value from the kstat struct there, and stop masking off
STATX_ATTR_VERSION_MONOTONIC.

Update the test-statx sample program to output the change attr and
MountId.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/stat.c                 | 12 +++---------
 include/linux/stat.h      |  9 ---------
 include/uapi/linux/stat.h |  6 ++++--
 samples/vfs/test-statx.c  |  8 ++++++--
 4 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/fs/stat.c b/fs/stat.c
index e7f8cd4b24e1..8396c372022f 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -593,11 +593,9 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 
 	memset(&tmp, 0, sizeof(tmp));
 
-	/* STATX_VERSION is kernel-only for now */
-	tmp.stx_mask = stat->result_mask & ~STATX_VERSION;
+	tmp.stx_mask = stat->result_mask;
 	tmp.stx_blksize = stat->blksize;
-	/* STATX_ATTR_VERSION_MONOTONIC is kernel-only for now */
-	tmp.stx_attributes = stat->attributes & ~STATX_ATTR_VERSION_MONOTONIC;
+	tmp.stx_attributes = stat->attributes;
 	tmp.stx_nlink = stat->nlink;
 	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
 	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
@@ -621,6 +619,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_mnt_id = stat->mnt_id;
 	tmp.stx_dio_mem_align = stat->dio_mem_align;
 	tmp.stx_dio_offset_align = stat->dio_offset_align;
+	tmp.stx_version = stat->version;
 
 	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
 }
@@ -636,11 +635,6 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
 	if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
 		return -EINVAL;
 
-	/* STATX_VERSION is kernel-only for now. Ignore requests
-	 * from userland.
-	 */
-	mask &= ~STATX_VERSION;
-
 	error = vfs_statx(dfd, filename, flags, &stat, mask);
 	if (error)
 		return error;
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 4e9428d86a3a..69c79e4fd1b1 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -54,13 +54,4 @@ struct kstat {
 	u32		dio_offset_align;
 	u64		version;
 };
-
-/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
-
-/* mask values */
-#define STATX_VERSION		0x40000000U	/* Want/got stx_change_attr */
-
-/* file attribute values */
-#define STATX_ATTR_VERSION_MONOTONIC	0x8000000000000000ULL /* version monotonically increases */
-
 #endif
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7cab2c65d3d7..4a0a1f27c059 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -127,7 +127,8 @@ struct statx {
 	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
 	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
 	/* 0xa0 */
-	__u64	__spare3[12];	/* Spare space for future expansion */
+	__u64	stx_version; /* Inode change attribute */
+	__u64	__spare3[11];	/* Spare space for future expansion */
 	/* 0x100 */
 };
 
@@ -154,6 +155,7 @@ struct statx {
 #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
 #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
 #define STATX_DIOALIGN		0x00002000U	/* Want/got direct I/O alignment info */
+#define STATX_VERSION		0x00004000U	/* Want/got stx_version */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
@@ -189,6 +191,6 @@ struct statx {
 #define STATX_ATTR_MOUNT_ROOT		0x00002000 /* Root of a mount */
 #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
 #define STATX_ATTR_DAX			0x00200000 /* File is currently in DAX state */
-
+#define STATX_ATTR_VERSION_MONOTONIC	0x00400000 /* stx_version increases w/ every change */
 
 #endif /* _UAPI_LINUX_STAT_H */
diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c
index 49c7a46cee07..bdbc371c9774 100644
--- a/samples/vfs/test-statx.c
+++ b/samples/vfs/test-statx.c
@@ -107,6 +107,8 @@ static void dump_statx(struct statx *stx)
 	printf("Device: %-15s", buffer);
 	if (stx->stx_mask & STATX_INO)
 		printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino);
+	if (stx->stx_mask & STATX_MNT_ID)
+		printf(" MountId: %llx", stx->stx_mnt_id);
 	if (stx->stx_mask & STATX_NLINK)
 		printf(" Links: %-5u", stx->stx_nlink);
 	if (stx->stx_mask & STATX_TYPE) {
@@ -145,7 +147,9 @@ static void dump_statx(struct statx *stx)
 	if (stx->stx_mask & STATX_CTIME)
 		print_time("Change: ", &stx->stx_ctime);
 	if (stx->stx_mask & STATX_BTIME)
-		print_time(" Birth: ", &stx->stx_btime);
+		print_time("Birth: ", &stx->stx_btime);
+	if (stx->stx_mask & STATX_VERSION)
+		printf("Inode Version: 0x%llx\n", stx->stx_version);
 
 	if (stx->stx_attributes_mask) {
 		unsigned char bits, mbits;
@@ -218,7 +222,7 @@ int main(int argc, char **argv)
 	struct statx stx;
 	int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
 
-	unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
+	unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_MNT_ID | STATX_VERSION;
 
 	for (argv++; *argv; argv++) {
 		if (strcmp(*argv, "-F") == 0) {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter
  2022-09-30 11:18 [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Jeff Layton
                   ` (6 preceding siblings ...)
  2022-09-30 11:18 ` [PATCH v6 7/9] vfs: expose STATX_VERSION to userland Jeff Layton
@ 2022-09-30 11:18 ` Jeff Layton
  2022-10-02  7:08   ` Amir Goldstein
  2022-09-30 11:18 ` [PATCH v6 9/9] ext4: update times after I/O in write codepaths Jeff Layton
  8 siblings, 1 reply; 35+ messages in thread
From: Jeff Layton @ 2022-09-30 11:18 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

The c/mtime and i_version currently get updated before the data is
copied (or a DIO write is issued), which is problematic for NFS.

READ+GETATTR can race with a write (even a local one) in such a way as
to make the client associate the state of the file with the wrong change
attribute. That association can persist indefinitely if the file sees no
further changes.

Move the setting of times to the bottom of the function in
__generic_file_write_iter and only update it if something was
successfully written.

If the time update fails, log a warning once, but don't fail the write.
All of the existing callers use update_time functions that don't fail,
so we should never trip this.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 mm/filemap.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 15800334147b..72c0ceb75176 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3812,10 +3812,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (err)
 		goto out;
 
-	err = file_update_time(file);
-	if (err)
-		goto out;
-
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		loff_t pos, endbyte;
 
@@ -3868,6 +3864,19 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			iocb->ki_pos += written;
 	}
 out:
+	if (written > 0) {
+		err = file_update_time(file);
+		/*
+		 * There isn't much we can do at this point if updating the
+		 * times fails after a successful write. The times and i_version
+		 * should still be updated in the inode, and it should still be
+		 * marked dirty, so hopefully the next inode update will catch it.
+		 * Log a warning once so we have a record that something untoward
+		 * has occurred.
+		 */
+		WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);
+	}
+
 	current->backing_dev_info = NULL;
 	return written ? written : err;
 }
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v6 9/9] ext4: update times after I/O in write codepaths
  2022-09-30 11:18 [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Jeff Layton
                   ` (7 preceding siblings ...)
  2022-09-30 11:18 ` [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter Jeff Layton
@ 2022-09-30 11:18 ` Jeff Layton
  8 siblings, 0 replies; 35+ messages in thread
From: Jeff Layton @ 2022-09-30 11:18 UTC (permalink / raw)
  To: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer
  Cc: linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

The times currently get updated before the data is copied (or the DIO is
issued) which is problematic for NFSv4. A READ+GETATTR could race with a
write in such a way to make the client associate the state of the file
with the wrong change attribute, and that association could persist
indefinitely if the file sees no further changes.

For this reason, it's better to bump the times and change attribute
after the data has been copied or the DIO write issued.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ext4/file.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 109d07629f81..1fa8e0239856 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -246,7 +246,7 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if (count <= 0)
 		return count;
 
-	ret = file_modified(iocb->ki_filp);
+	ret = file_remove_privs(iocb->ki_filp);
 	if (ret)
 		return ret;
 	return count;
@@ -269,7 +269,11 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 	current->backing_dev_info = inode_to_bdi(inode);
 	ret = generic_perform_write(iocb, from);
 	current->backing_dev_info = NULL;
-
+	if (ret > 0) {
+		ssize_t ret2 = file_update_time(iocb->ki_filp);
+		if (ret2)
+			ret = ret2;
+	}
 out:
 	inode_unlock(inode);
 	if (likely(ret > 0)) {
@@ -455,7 +459,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 		goto restart;
 	}
 
-	ret = file_modified(file);
+	ret = file_remove_privs(file);
 	if (ret < 0)
 		goto out;
 
@@ -572,6 +576,11 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (extend)
 		ret = ext4_handle_inode_extension(inode, offset, ret, count);
 
+	if (ret > 0) {
+		ssize_t ret2 = file_update_time(iocb->ki_filp);
+		if (ret2)
+			ret = ret2;
+	}
 out:
 	if (ilock_shared)
 		inode_unlock_shared(inode);
@@ -653,6 +662,11 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	if (extend)
 		ret = ext4_handle_inode_extension(inode, offset, ret, count);
+	if (ret > 0) {
+		ssize_t ret2 = file_update_time(iocb->ki_filp);
+		if (ret2)
+			ret = ret2;
+	}
 out:
 	inode_unlock(inode);
 	if (ret > 0)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version
  2022-09-30 11:18 ` [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
@ 2022-09-30 14:34   ` Chuck Lever III
  2022-09-30 22:32     ` Dave Chinner
  2022-10-03 23:39   ` NeilBrown
  1 sibling, 1 reply; 35+ messages in thread
From: Chuck Lever III @ 2022-09-30 14:34 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Theodore Ts'o, adilger.kernel, Darrick J. Wong, Dave Chinner,
	Trond Myklebust, Neil Brown, Al Viro, zohar, xiubli,
	Lukas Czerner, Jan Kara, Bruce Fields, Christian Brauner,
	fweimer, linux-btrfs, linux-fsdevel, Linux Kernel Mailing List,
	ceph-devel, linux-ext4, Linux NFS Mailing List, linux-xfs



> On Sep 30, 2022, at 7:18 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> Now that we can call into vfs_getattr to get the i_version field, use
> that facility to fetch it instead of doing it in nfsd4_change_attribute.
> 
> Neil also pointed out recently that IS_I_VERSION directory operations
> are always logged,

^logged^synchronous maybe?

> and so we only need to mitigate the rollback problem
> on regular files. Also, we don't need to factor in the ctime when
> reexporting NFS or Ceph.
> 
> Set the STATX_VERSION (and BTIME) bits in the request when we're dealing
> with a v4 request. Then, instead of looking at IS_I_VERSION when
> generating the change attr, look at the result mask and only use it if
> STATX_VERSION is set. With this change, we can drop the fetch_iversion
> export operation as well.
> 
> Move nfsd4_change_attribute into nfsfh.c, and change it to only factor
> in the ctime if it's a regular file and the fs doesn't advertise
> STATX_ATTR_VERSION_MONOTONIC.

This patch is doing some heavy lifting. I'd prefer to see it split
up to improve bisectability:

1. Move nfsd4_change_attribute() to fs/nfsd/nfsfh.c -- no functional changes

2. Change nfsd4_change_attribute() and vfs_getattr() call sites

3. Eliminate .fetch_iversion


> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
> fs/nfs/export.c          |  7 -------
> fs/nfsd/nfs4xdr.c        |  4 +++-
> fs/nfsd/nfsfh.c          | 40 ++++++++++++++++++++++++++++++++++++++++
> fs/nfsd/nfsfh.h          | 29 +----------------------------
> fs/nfsd/vfs.h            |  7 ++++++-
> include/linux/exportfs.h |  1 -
> 6 files changed, 50 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> index 01596f2d0a1e..1a9d5aa51dfb 100644
> --- a/fs/nfs/export.c
> +++ b/fs/nfs/export.c
> @@ -145,17 +145,10 @@ nfs_get_parent(struct dentry *dentry)
> 	return parent;
> }
> 
> -static u64 nfs_fetch_iversion(struct inode *inode)
> -{
> -	nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
> -	return inode_peek_iversion_raw(inode);
> -}
> -
> const struct export_operations nfs_export_ops = {
> 	.encode_fh = nfs_encode_fh,
> 	.fh_to_dentry = nfs_fh_to_dentry,
> 	.get_parent = nfs_get_parent,
> -	.fetch_iversion = nfs_fetch_iversion,
> 	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|
> 		EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS|
> 		EXPORT_OP_NOATOMIC_ATTR,
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 1e9690a061ec..779c009314c6 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2869,7 +2869,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> 			goto out;
> 	}
> 
> -	err = vfs_getattr(&path, &stat, STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
> +	err = vfs_getattr(&path, &stat,
> +			  STATX_BASIC_STATS | STATX_BTIME | STATX_VERSION,
> +			  AT_STATX_SYNC_AS_STAT);
> 	if (err)
> 		goto out_nfserr;
> 	if (!(stat.result_mask & STATX_BTIME))
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index a5b71526cee0..9168bc657378 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -634,6 +634,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
> 		stat.mtime = inode->i_mtime;
> 		stat.ctime = inode->i_ctime;
> 		stat.size  = inode->i_size;
> +		if (v4 && IS_I_VERSION(inode)) {
> +			stat.version = inode_query_iversion(inode);
> +			stat.result_mask |= STATX_VERSION;
> +		}
> 	}
> 	if (v4)
> 		fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
> @@ -665,6 +669,8 @@ void fh_fill_post_attrs(struct svc_fh *fhp)
> 	if (err) {
> 		fhp->fh_post_saved = false;
> 		fhp->fh_post_attr.ctime = inode->i_ctime;
> +		if (v4 && IS_I_VERSION(inode))
> +			fhp->fh_post_attr.version = inode_query_iversion(inode);

Since fh_fill_post_attrs() calls nfsd4_change_attribute() as its
next step, don't you need to set STATX_VERSION here too?

I kind of hate the way we're handling both NFSv3 and NFSv4 post_attr
here, it's pretty difficult to reason about.


> 	} else
> 		fhp->fh_post_saved = true;
> 	if (v4)
> @@ -754,3 +760,37 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
> 		return FSIDSOURCE_UUID;
> 	return FSIDSOURCE_DEV;
> }
> +
> +/*
> + * We could use i_version alone as the change attribute.  However, i_version
> + * can go backwards on a regular file after an unclean shutdown.  On its own
> + * that doesn't necessarily cause a problem, but if i_version goes backwards
> + * and then is incremented again it could reuse a value that was previously
> + * used before boot, and a client who queried the two values might incorrectly
> + * assume nothing changed.
> + *
> + * By using both ctime and the i_version counter we guarantee that as long as
> + * time doesn't go backwards we never reuse an old value. If the filesystem
> + * advertises STATX_ATTR_VERSION_MONOTONIC, then this mitigation is not needed.
> + *
> + * We only need to do this for regular files as well. For directories, we
> + * assume that the new change attr is always logged to stable storage in some
> + * fashion before the results can be seen.
> + */

Let's make this a kdoc-style comment. Thanks!


> +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)

I'd prefer to use "const" pointers for both function parameters.


> +{
> +	u64 chattr;
> +
> +	if (stat->result_mask & STATX_VERSION) {
> +		chattr = stat->version;
> +
> +		if (S_ISREG(inode->i_mode) &&
> +		    !(stat->attributes & STATX_ATTR_VERSION_MONOTONIC)) {
> +			chattr += (u64)stat->ctime.tv_sec << 30;
> +			chattr += stat->ctime.tv_nsec;
> +		}
> +	} else {
> +		chattr = time_to_chattr(&stat->ctime);
> +	}
> +	return chattr;
> +}
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index c3ae6414fc5c..4c223a7a91d4 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -291,34 +291,7 @@ static inline void fh_clear_pre_post_attrs(struct svc_fh *fhp)
> 	fhp->fh_pre_saved = false;
> }
> 
> -/*
> - * We could use i_version alone as the change attribute.  However,
> - * i_version can go backwards after a reboot.  On its own that doesn't
> - * necessarily cause a problem, but if i_version goes backwards and then
> - * is incremented again it could reuse a value that was previously used
> - * before boot, and a client who queried the two values might
> - * incorrectly assume nothing changed.
> - *
> - * By using both ctime and the i_version counter we guarantee that as
> - * long as time doesn't go backwards we never reuse an old value.
> - */
> -static inline u64 nfsd4_change_attribute(struct kstat *stat,
> -					 struct inode *inode)
> -{
> -	if (inode->i_sb->s_export_op->fetch_iversion)
> -		return inode->i_sb->s_export_op->fetch_iversion(inode);
> -	else if (IS_I_VERSION(inode)) {
> -		u64 chattr;
> -
> -		chattr =  stat->ctime.tv_sec;
> -		chattr <<= 30;
> -		chattr += stat->ctime.tv_nsec;
> -		chattr += inode_query_iversion(inode);
> -		return chattr;
> -	} else
> -		return time_to_chattr(&stat->ctime);
> -}
> -
> +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode);
> extern void fh_fill_pre_attrs(struct svc_fh *fhp);
> extern void fh_fill_post_attrs(struct svc_fh *fhp);
> extern void fh_fill_both_attrs(struct svc_fh *fhp);
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index c95cd414b4bb..a905f59481ee 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -168,9 +168,14 @@ static inline void fh_drop_write(struct svc_fh *fh)
> 
> static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
> {
> +	u32 request_mask = STATX_BASIC_STATS;

Reverse christmas tree, please.


> 	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
> 			 .dentry = fh->fh_dentry};
> -	return nfserrno(vfs_getattr(&p, stat, STATX_BASIC_STATS,
> +
> +	if (fh->fh_maxsize == NFS4_FHSIZE)

Would it hurt to set BTIME and VERSION unconditionally?


> +		request_mask |= (STATX_BTIME | STATX_VERSION);
> +
> +	return nfserrno(vfs_getattr(&p, stat, request_mask,
> 				    AT_STATX_SYNC_AS_STAT));
> }
> 
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index fe848901fcc3..9f4d4bcbf251 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -213,7 +213,6 @@ struct export_operations {
> 			  bool write, u32 *device_generation);
> 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> 			     int nr_iomaps, struct iattr *iattr);
> -	u64 (*fetch_iversion)(struct inode *);
> #define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
> #define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */
> #define	EXPORT_OP_CLOSE_BEFORE_UNLINK	(0x4) /* close files before unlink */
> -- 
> 2.37.3
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version
  2022-09-30 14:34   ` Chuck Lever III
@ 2022-09-30 22:32     ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2022-09-30 22:32 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Jeff Layton, Theodore Ts'o, adilger.kernel, Darrick J. Wong,
	Trond Myklebust, Neil Brown, Al Viro, zohar, xiubli,
	Lukas Czerner, Jan Kara, Bruce Fields, Christian Brauner,
	fweimer, linux-btrfs, linux-fsdevel, Linux Kernel Mailing List,
	ceph-devel, linux-ext4, Linux NFS Mailing List, linux-xfs

On Fri, Sep 30, 2022 at 02:34:51PM +0000, Chuck Lever III wrote:
> 
> 
> > On Sep 30, 2022, at 7:18 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > Now that we can call into vfs_getattr to get the i_version field, use
> > that facility to fetch it instead of doing it in nfsd4_change_attribute.
> > 
> > Neil also pointed out recently that IS_I_VERSION directory operations
> > are always logged,
> 
> ^logged^synchronous maybe?

A pedantic note, but I think necessary because so many people still
get this wrong when it comes to filesystems and IO: synchronous !=
persistent.

Ext4 and XFS both use *asynchronous journalling* - they journal
changes first to memory buffers, and only make those recorded
changes persistent when they hit internal checkpoint thresholds or
something external requires persistence to be guaranteed.

->commit_metadata is the operation filesystems provide the NFS
server to *guarantee persistence*. This allows filesystems to use
asynchronous journalling for most operations, right up to the point
the NFS server requires a change to be persistent. "synchronous
operation" is a side effect of guaranteeing persistence on some
filesytems and storage media, whereas "synchronous operation"
does not provide any guarantee of persistence...

IOWs, please talk about persistence guarantees the NFS server
application requires and implements, not about the operations (or
the nature of the operations) that may be performed by the
underlying filesystems to provide that persistence guarantee.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter
  2022-09-30 11:18 ` [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter Jeff Layton
@ 2022-10-02  7:08   ` Amir Goldstein
  2022-10-03 13:01     ` Jeff Layton
  0 siblings, 1 reply; 35+ messages in thread
From: Amir Goldstein @ 2022-10-02  7:08 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> The c/mtime and i_version currently get updated before the data is
> copied (or a DIO write is issued), which is problematic for NFS.
>
> READ+GETATTR can race with a write (even a local one) in such a way as
> to make the client associate the state of the file with the wrong change
> attribute. That association can persist indefinitely if the file sees no
> further changes.
>
> Move the setting of times to the bottom of the function in
> __generic_file_write_iter and only update it if something was
> successfully written.
>

This solution is wrong for several reasons:

1. There is still file_update_time() in ->page_mkwrite() so you haven't
    solved the problem completely
2. The other side of the coin is that post crash state is more likely to end
    up data changes without mtime/ctime change

If I read the problem description correctly, then a solution that invalidates
the NFS cache before AND after the write would be acceptable. Right?
Would an extra i_version bump after the write solve the race?

> If the time update fails, log a warning once, but don't fail the write.
> All of the existing callers use update_time functions that don't fail,
> so we should never trip this.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  mm/filemap.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 15800334147b..72c0ceb75176 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3812,10 +3812,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>         if (err)
>                 goto out;
>
> -       err = file_update_time(file);
> -       if (err)
> -               goto out;
> -
>         if (iocb->ki_flags & IOCB_DIRECT) {
>                 loff_t pos, endbyte;
>
> @@ -3868,6 +3864,19 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>                         iocb->ki_pos += written;
>         }
>  out:
> +       if (written > 0) {
> +               err = file_update_time(file);
> +               /*
> +                * There isn't much we can do at this point if updating the
> +                * times fails after a successful write. The times and i_version
> +                * should still be updated in the inode, and it should still be
> +                * marked dirty, so hopefully the next inode update will catch it.
> +                * Log a warning once so we have a record that something untoward
> +                * has occurred.
> +                */
> +               WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);

pr_warn_once() please - this is not a programming assertion.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter
  2022-10-02  7:08   ` Amir Goldstein
@ 2022-10-03 13:01     ` Jeff Layton
  2022-10-03 13:52       ` Amir Goldstein
  0 siblings, 1 reply; 35+ messages in thread
From: Jeff Layton @ 2022-10-03 13:01 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > The c/mtime and i_version currently get updated before the data is
> > copied (or a DIO write is issued), which is problematic for NFS.
> > 
> > READ+GETATTR can race with a write (even a local one) in such a way as
> > to make the client associate the state of the file with the wrong change
> > attribute. That association can persist indefinitely if the file sees no
> > further changes.
> > 
> > Move the setting of times to the bottom of the function in
> > __generic_file_write_iter and only update it if something was
> > successfully written.
> > 
> 
> This solution is wrong for several reasons:
> 
> 1. There is still file_update_time() in ->page_mkwrite() so you haven't
>     solved the problem completely

Right. I don't think there is a way to solve the problem vs. mmap.
Userland can write to a writeable mmap'ed page at any time and we'd
never know. We have to specifically carve out mmap as an exception here.
I'll plan to add something to the manpage patch for this.

> 2. The other side of the coin is that post crash state is more likely to end
>     up data changes without mtime/ctime change
> 

Is this really something filesystems rely on? I suppose the danger is
that some cached data gets written to disk before the write returns and
the inode on disk never gets updated.

But...isn't that a danger now? Some of the cached data could get written
out and the updated inode just never makes it to disk before a crash
(AFAIU). I'm not sure that this increases our exposure to that problem.


> If I read the problem description correctly, then a solution that invalidates
> the NFS cache before AND after the write would be acceptable. Right?
> Would an extra i_version bump after the write solve the race?
> 

I based this patch on Neil's assertion that updating the time before an
operation was pointless if we were going to do it afterward. The NFS
client only really cares about seeing it change after a write.

Doing both would be fine from a correctness standpoint, and in most
cases, the second would be a no-op anyway since a query would have to
race in between the two for that to happen.

FWIW, I think we should update the m/ctime and version at the same time.
If the version changes, then there is always the potential that a timer
tick has occurred. So, that would translate to a second call to
file_update_time in here.

The downside of bumping the times/version both before and after is that
these are hot codepaths, and we'd be adding extra operations there. Even
in the case where nothing has changed, we'd have to call
inode_needs_update_time a second time for every write. Is that worth the
cost?

> > If the time update fails, log a warning once, but don't fail the write.
> > All of the existing callers use update_time functions that don't fail,
> > so we should never trip this.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  mm/filemap.c | 17 +++++++++++++----
> >  1 file changed, 13 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 15800334147b..72c0ceb75176 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3812,10 +3812,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >         if (err)
> >                 goto out;
> > 
> > -       err = file_update_time(file);
> > -       if (err)
> > -               goto out;
> > -
> >         if (iocb->ki_flags & IOCB_DIRECT) {
> >                 loff_t pos, endbyte;
> > 
> > @@ -3868,6 +3864,19 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >                         iocb->ki_pos += written;
> >         }
> >  out:
> > +       if (written > 0) {
> > +               err = file_update_time(file);
> > +               /*
> > +                * There isn't much we can do at this point if updating the
> > +                * times fails after a successful write. The times and i_version
> > +                * should still be updated in the inode, and it should still be
> > +                * marked dirty, so hopefully the next inode update will catch it.
> > +                * Log a warning once so we have a record that something untoward
> > +                * has occurred.
> > +                */
> > +               WARN_ONCE(err, "Failed to update m/ctime after write: %ld\n", err);
> 
> pr_warn_once() please - this is not a programming assertion.
> 

ACK. I'll change that.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter
  2022-10-03 13:01     ` Jeff Layton
@ 2022-10-03 13:52       ` Amir Goldstein
  2022-10-03 22:56         ` NeilBrown
  0 siblings, 1 reply; 35+ messages in thread
From: Amir Goldstein @ 2022-10-03 13:52 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, neilb, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > The c/mtime and i_version currently get updated before the data is
> > > copied (or a DIO write is issued), which is problematic for NFS.
> > >
> > > READ+GETATTR can race with a write (even a local one) in such a way as
> > > to make the client associate the state of the file with the wrong change
> > > attribute. That association can persist indefinitely if the file sees no
> > > further changes.
> > >
> > > Move the setting of times to the bottom of the function in
> > > __generic_file_write_iter and only update it if something was
> > > successfully written.
> > >
> >
> > This solution is wrong for several reasons:
> >
> > 1. There is still file_update_time() in ->page_mkwrite() so you haven't
> >     solved the problem completely
>
> Right. I don't think there is a way to solve the problem vs. mmap.
> Userland can write to a writeable mmap'ed page at any time and we'd
> never know. We have to specifically carve out mmap as an exception here.
> I'll plan to add something to the manpage patch for this.
>
> > 2. The other side of the coin is that post crash state is more likely to end
> >     up data changes without mtime/ctime change
> >
>
> Is this really something filesystems rely on? I suppose the danger is
> that some cached data gets written to disk before the write returns and
> the inode on disk never gets updated.
>
> But...isn't that a danger now? Some of the cached data could get written
> out and the updated inode just never makes it to disk before a crash
> (AFAIU). I'm not sure that this increases our exposure to that problem.
>
>

You are correct that that danger exists, but it only exists for overwriting
to allocated blocks.

For writing to new blocks, mtime change is recorded in transaction
before the block mapping is recorded in transaction so there is no
danger in this case (before your patch).

Also, observing size change without observing mtime change
after crash seems like a very bad outcome that may be possible
after your change.

These are just a few cases that I could think of, they may be filesystem
dependent, but my gut feeling is that if you remove the time update before
the operation, that has been like that forever, a lot of s#!t is going to float
for various filesystems and applications.

And it is not one of those things that are discovered  during rc or even
stable kernel testing - they are discovered much later when users start to
realize their applications got bogged up after crash, so it feels like to me
like playing with fire.

> > If I read the problem description correctly, then a solution that invalidates
> > the NFS cache before AND after the write would be acceptable. Right?
> > Would an extra i_version bump after the write solve the race?
> >
>
> I based this patch on Neil's assertion that updating the time before an
> operation was pointless if we were going to do it afterward. The NFS
> client only really cares about seeing it change after a write.
>

Pointless to NFS client maybe.
Whether or not this is not changing user behavior for other applications
is up to you to prove and I doubt that you can prove it because I doubt
that it is true.

> Doing both would be fine from a correctness standpoint, and in most
> cases, the second would be a no-op anyway since a query would have to
> race in between the two for that to happen.
>
> FWIW, I think we should update the m/ctime and version at the same time.
> If the version changes, then there is always the potential that a timer
> tick has occurred. So, that would translate to a second call to
> file_update_time in here.
>
> The downside of bumping the times/version both before and after is that
> these are hot codepaths, and we'd be adding extra operations there. Even
> in the case where nothing has changed, we'd have to call
> inode_needs_update_time a second time for every write. Is that worth the
> cost?

Is there a practical cost for iversion bump AFTER write as I suggested?
If you NEED m/ctime update AFTER write and iversion update is not enough
then I did not understand from your commit message why that is.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter
  2022-10-03 13:52       ` Amir Goldstein
@ 2022-10-03 22:56         ` NeilBrown
  2022-10-05 16:40           ` Jeff Layton
  0 siblings, 1 reply; 35+ messages in thread
From: NeilBrown @ 2022-10-03 22:56 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jeff Layton, tytso, adilger.kernel, djwong, david, trondmy, viro,
	zohar, xiubli, chuck.lever, lczerner, jack, bfields, brauner,
	fweimer, linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel,
	linux-ext4, linux-nfs, linux-xfs

On Tue, 04 Oct 2022, Amir Goldstein wrote:
> On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> > > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > >
> > > > The c/mtime and i_version currently get updated before the data is
> > > > copied (or a DIO write is issued), which is problematic for NFS.
> > > >
> > > > READ+GETATTR can race with a write (even a local one) in such a way as
> > > > to make the client associate the state of the file with the wrong change
> > > > attribute. That association can persist indefinitely if the file sees no
> > > > further changes.
> > > >
> > > > Move the setting of times to the bottom of the function in
> > > > __generic_file_write_iter and only update it if something was
> > > > successfully written.
> > > >
> > >
> > > This solution is wrong for several reasons:
> > >
> > > 1. There is still file_update_time() in ->page_mkwrite() so you haven't
> > >     solved the problem completely
> >
> > Right. I don't think there is a way to solve the problem vs. mmap.
> > Userland can write to a writeable mmap'ed page at any time and we'd
> > never know. We have to specifically carve out mmap as an exception here.
> > I'll plan to add something to the manpage patch for this.
> >
> > > 2. The other side of the coin is that post crash state is more likely to end
> > >     up data changes without mtime/ctime change
> > >
> >
> > Is this really something filesystems rely on? I suppose the danger is
> > that some cached data gets written to disk before the write returns and
> > the inode on disk never gets updated.
> >
> > But...isn't that a danger now? Some of the cached data could get written
> > out and the updated inode just never makes it to disk before a crash
> > (AFAIU). I'm not sure that this increases our exposure to that problem.
> >
> >
> 
> You are correct that that danger exists, but it only exists for overwriting
> to allocated blocks.
> 
> For writing to new blocks, mtime change is recorded in transaction
> before the block mapping is recorded in transaction so there is no
> danger in this case (before your patch).
> 
> Also, observing size change without observing mtime change
> after crash seems like a very bad outcome that may be possible
> after your change.
> 
> These are just a few cases that I could think of, they may be filesystem
> dependent, but my gut feeling is that if you remove the time update before
> the operation, that has been like that forever, a lot of s#!t is going to float
> for various filesystems and applications.
> 
> And it is not one of those things that are discovered  during rc or even
> stable kernel testing - they are discovered much later when users start to
> realize their applications got bogged up after crash, so it feels like to me
> like playing with fire.
> 
> > > If I read the problem description correctly, then a solution that invalidates
> > > the NFS cache before AND after the write would be acceptable. Right?
> > > Would an extra i_version bump after the write solve the race?
> > >
> >
> > I based this patch on Neil's assertion that updating the time before an
> > operation was pointless if we were going to do it afterward. The NFS
> > client only really cares about seeing it change after a write.
> >
> 
> Pointless to NFS client maybe.
> Whether or not this is not changing user behavior for other applications
> is up to you to prove and I doubt that you can prove it because I doubt
> that it is true.
> 
> > Doing both would be fine from a correctness standpoint, and in most
> > cases, the second would be a no-op anyway since a query would have to
> > race in between the two for that to happen.
> >
> > FWIW, I think we should update the m/ctime and version at the same time.
> > If the version changes, then there is always the potential that a timer
> > tick has occurred. So, that would translate to a second call to
> > file_update_time in here.
> >
> > The downside of bumping the times/version both before and after is that
> > these are hot codepaths, and we'd be adding extra operations there. Even
> > in the case where nothing has changed, we'd have to call
> > inode_needs_update_time a second time for every write. Is that worth the
> > cost?
> 
> Is there a practical cost for iversion bump AFTER write as I suggested?
> If you NEED m/ctime update AFTER write and iversion update is not enough
> then I did not understand from your commit message why that is.
> 
> Thanks,
> Amir.
> 

Maybe we should split i_version updates from ctime updates.

While it isn't true that ctime updates have happened before the write
"forever" it has been true since 2.3.43[1] which is close to forever.

For ctime there doesn't appear to be a strong specification of when the
change happens, so history provides a good case for leaving it before.
For i_version we want to provide clear and unambiguous semantics.
Performing 2 updates makes the specification muddy.

So I would prefer a single update for i_version, performed after the
change becomes visible.  If that means it has to be separate from ctime,
then so be it.

NeilBrown


[1]:  https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=636b38438001a00b25f23e38747a91cb8428af29

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 1/9] iversion: move inode_query_iversion to libfs.c
  2022-09-30 11:18 ` [PATCH v6 1/9] iversion: move inode_query_iversion to libfs.c Jeff Layton
@ 2022-10-03 23:05   ` NeilBrown
  0 siblings, 0 replies; 35+ messages in thread
From: NeilBrown @ 2022-10-03 23:05 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Fri, 30 Sep 2022, Jeff Layton wrote:
> There's no need to have such a large function forcibly inlined.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Reviewed-by: NeilBrown <neilb@suse.de>

you could possible make the "I_VERSION_QUERIED already set" case inline
and the "needs to be set" case out-of-line, but it probably isn't worth
it. 

Thanks,
NeilBrown


> ---
>  fs/libfs.c               | 36 ++++++++++++++++++++++++++++++++++++
>  include/linux/iversion.h | 38 ++------------------------------------
>  2 files changed, 38 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 682d56345a1c..5ae81466a422 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -1566,3 +1566,39 @@ bool inode_maybe_inc_iversion(struct inode *inode, bool force)
>  	return true;
>  }
>  EXPORT_SYMBOL(inode_maybe_inc_iversion);
> +
> +/**
> + * inode_query_iversion - read i_version for later use
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter. This should be used by callers that wish
> + * to store the returned i_version for later comparison. This will guarantee
> + * that a later query of the i_version will result in a different value if
> + * anything has changed.
> + *
> + * In this implementation, we fetch the current value, set the QUERIED flag and
> + * then try to swap it into place with a cmpxchg, if it wasn't already set. If
> + * that fails, we try again with the newly fetched value from the cmpxchg.
> + */
> +u64 inode_query_iversion(struct inode *inode)
> +{
> +	u64 cur, new;
> +
> +	cur = inode_peek_iversion_raw(inode);
> +	do {
> +		/* If flag is already set, then no need to swap */
> +		if (cur & I_VERSION_QUERIED) {
> +			/*
> +			 * This barrier (and the implicit barrier in the
> +			 * cmpxchg below) pairs with the barrier in
> +			 * inode_maybe_inc_iversion().
> +			 */
> +			smp_mb();
> +			break;
> +		}
> +
> +		new = cur | I_VERSION_QUERIED;
> +	} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
> +	return cur >> I_VERSION_QUERIED_SHIFT;
> +}
> +EXPORT_SYMBOL(inode_query_iversion);
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index e27bd4f55d84..6755d8b4f20b 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -234,42 +234,6 @@ inode_peek_iversion(const struct inode *inode)
>  	return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT;
>  }
>  
> -/**
> - * inode_query_iversion - read i_version for later use
> - * @inode: inode from which i_version should be read
> - *
> - * Read the inode i_version counter. This should be used by callers that wish
> - * to store the returned i_version for later comparison. This will guarantee
> - * that a later query of the i_version will result in a different value if
> - * anything has changed.
> - *
> - * In this implementation, we fetch the current value, set the QUERIED flag and
> - * then try to swap it into place with a cmpxchg, if it wasn't already set. If
> - * that fails, we try again with the newly fetched value from the cmpxchg.
> - */
> -static inline u64
> -inode_query_iversion(struct inode *inode)
> -{
> -	u64 cur, new;
> -
> -	cur = inode_peek_iversion_raw(inode);
> -	do {
> -		/* If flag is already set, then no need to swap */
> -		if (cur & I_VERSION_QUERIED) {
> -			/*
> -			 * This barrier (and the implicit barrier in the
> -			 * cmpxchg below) pairs with the barrier in
> -			 * inode_maybe_inc_iversion().
> -			 */
> -			smp_mb();
> -			break;
> -		}
> -
> -		new = cur | I_VERSION_QUERIED;
> -	} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
> -	return cur >> I_VERSION_QUERIED_SHIFT;
> -}
> -
>  /*
>   * For filesystems without any sort of change attribute, the best we can
>   * do is fake one up from the ctime:
> @@ -283,6 +247,8 @@ static inline u64 time_to_chattr(struct timespec64 *t)
>  	return chattr;
>  }
>  
> +u64 inode_query_iversion(struct inode *inode);
> +
>  /**
>   * inode_eq_iversion_raw - check whether the raw i_version counter has changed
>   * @inode: inode to check
> -- 
> 2.37.3
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 2/9] iversion: clarify when the i_version counter must be updated
  2022-09-30 11:18 ` [PATCH v6 2/9] iversion: clarify when the i_version counter must be updated Jeff Layton
@ 2022-10-03 23:10   ` NeilBrown
  2022-10-04  9:53     ` Jeff Layton
  0 siblings, 1 reply; 35+ messages in thread
From: NeilBrown @ 2022-10-03 23:10 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

On Fri, 30 Sep 2022, Jeff Layton wrote:
> The i_version field in the kernel has had different semantics over
> the decades, but NFSv4 has certain expectations. Update the comments
> in iversion.h to describe when the i_version must change.
> 
> Cc: Colin Walters <walters@verbum.org>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Trond Myklebust <trondmy@hammerspace.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  include/linux/iversion.h | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 6755d8b4f20b..9925cac1fa94 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -9,8 +9,14 @@
>   * ---------------------------
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> - * appear different to observers if there was a change to the inode's data or
> - * metadata since it was last queried.
> + * appear larger to observers if there was an explicit change to the inode's
> + * data or metadata since it was last queried.
> + *
> + * An explicit change is one that would ordinarily result in a change to the
> + * inode status change time (aka ctime). i_version must appear to change, even
> + * if the ctime does not (since the whole point is to avoid missing updates due
> + * to timestamp granularity). If POSIX mandates that the ctime must change due
> + * to an operation, then the i_version counter must be incremented as well.

POSIX doesn't (that I can see) describe when the ctime changes w.r.t
when the file changes.  For i_version we do want to specify that
i_version change is not visible before the file change is visible.
So this goes beyond the POSIX mandate.  I might be worth making that
explicit.
But this patch is nonetheless an improvement, so:

Reviewed-by: NeilBrown <neilb@suse.de>

Thanks,
NeilBrown


>   *
>   * Observers see the i_version as a 64-bit number that never decreases. If it
>   * remains the same since it was last checked, then nothing has changed in the
> -- 
> 2.37.3
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 3/9] vfs: plumb i_version handling into struct kstat
  2022-09-30 11:18 ` [PATCH v6 3/9] vfs: plumb i_version handling into struct kstat Jeff Layton
@ 2022-10-03 23:14   ` NeilBrown
  0 siblings, 0 replies; 35+ messages in thread
From: NeilBrown @ 2022-10-03 23:14 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Jeff Layton

On Fri, 30 Sep 2022, Jeff Layton wrote:
> From: Jeff Layton <jlayton@redhat.com>
> 
> The NFS server has a lot of special handling for different types of
> change attribute access, depending on what sort of inode we have. In
> most cases, it's doing a getattr anyway and then fetching that value
> after the fact.
> 
> Rather that do that, add a new STATX_VERSION flag that is a kernel-only
> symbol (for now). If requested and getattr can implement it, it can fill
> out this field. For IS_I_VERSION inodes, add a generic implementation in
> vfs_getattr_nosec. Take care to mask STATX_VERSION off in requests from
> userland and in the result mask.
> 
> Eventually if we decide to make this available to userland, we can just
> designate a field for it in struct statx, and move the STATX_VERSION
> definition to the uapi header.

Above does not mention STATX_ATTR_VERSION_MONOTONIC, but it appears in
that patch - which is confusing.
But the patch is good, so

Reviewed-by: NeilBrown <neilb@suse.de>

Thanks,
NeilBrown


> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/stat.c            | 17 +++++++++++++++--
>  include/linux/stat.h |  9 +++++++++
>  2 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index a7930d744483..e7f8cd4b24e1 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -17,6 +17,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/pagemap.h>
>  #include <linux/compat.h>
> +#include <linux/iversion.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/unistd.h>
> @@ -118,6 +119,11 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>  	stat->attributes_mask |= (STATX_ATTR_AUTOMOUNT |
>  				  STATX_ATTR_DAX);
>  
> +	if ((request_mask & STATX_VERSION) && IS_I_VERSION(inode)) {
> +		stat->result_mask |= STATX_VERSION;
> +		stat->version = inode_query_iversion(inode);
> +	}
> +
>  	mnt_userns = mnt_user_ns(path->mnt);
>  	if (inode->i_op->getattr)
>  		return inode->i_op->getattr(mnt_userns, path, stat,
> @@ -587,9 +593,11 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
>  
>  	memset(&tmp, 0, sizeof(tmp));
>  
> -	tmp.stx_mask = stat->result_mask;
> +	/* STATX_VERSION is kernel-only for now */
> +	tmp.stx_mask = stat->result_mask & ~STATX_VERSION;
>  	tmp.stx_blksize = stat->blksize;
> -	tmp.stx_attributes = stat->attributes;
> +	/* STATX_ATTR_VERSION_MONOTONIC is kernel-only for now */
> +	tmp.stx_attributes = stat->attributes & ~STATX_ATTR_VERSION_MONOTONIC;
>  	tmp.stx_nlink = stat->nlink;
>  	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
>  	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
> @@ -628,6 +636,11 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
>  	if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
>  		return -EINVAL;
>  
> +	/* STATX_VERSION is kernel-only for now. Ignore requests
> +	 * from userland.
> +	 */
> +	mask &= ~STATX_VERSION;
> +
>  	error = vfs_statx(dfd, filename, flags, &stat, mask);
>  	if (error)
>  		return error;
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index ff277ced50e9..4e9428d86a3a 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -52,6 +52,15 @@ struct kstat {
>  	u64		mnt_id;
>  	u32		dio_mem_align;
>  	u32		dio_offset_align;
> +	u64		version;
>  };
>  
> +/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
> +
> +/* mask values */
> +#define STATX_VERSION		0x40000000U	/* Want/got stx_change_attr */
> +
> +/* file attribute values */
> +#define STATX_ATTR_VERSION_MONOTONIC	0x8000000000000000ULL /* version monotonically increases */
> +
>  #endif
> -- 
> 2.37.3
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 4/9] nfs: report the inode version in getattr if requested
  2022-09-30 11:18 ` [PATCH v6 4/9] nfs: report the inode version in getattr if requested Jeff Layton
@ 2022-10-03 23:29   ` NeilBrown
  2022-10-04  9:43     ` Jeff Layton
  0 siblings, 1 reply; 35+ messages in thread
From: NeilBrown @ 2022-10-03 23:29 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Fri, 30 Sep 2022, Jeff Layton wrote:
> Allow NFS to report the i_version in getattr requests. Since the cost to
> fetch it is relatively cheap, do it unconditionally and just set the
> flag if it looks like it's valid. Also, conditionally enable the
> MONOTONIC flag when the server reports its change attr type as such.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/nfs/inode.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index bea7c005119c..5cb7017e5089 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -830,6 +830,8 @@ static u32 nfs_get_valid_attrmask(struct inode *inode)
>  		reply_mask |= STATX_UID | STATX_GID;
>  	if (!(cache_validity & NFS_INO_INVALID_BLOCKS))
>  		reply_mask |= STATX_BLOCKS;
> +	if (!(cache_validity & NFS_INO_INVALID_CHANGE))
> +		reply_mask |= STATX_VERSION;
>  	return reply_mask;
>  }
>  
> @@ -848,7 +850,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
>  
>  	request_mask &= STATX_TYPE | STATX_MODE | STATX_NLINK | STATX_UID |
>  			STATX_GID | STATX_ATIME | STATX_MTIME | STATX_CTIME |
> -			STATX_INO | STATX_SIZE | STATX_BLOCKS;
> +			STATX_INO | STATX_SIZE | STATX_BLOCKS | STATX_VERSION;
>  
>  	if ((query_flags & AT_STATX_DONT_SYNC) && !force_sync) {
>  		if (readdirplus_enabled)
> @@ -877,7 +879,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
>  	/* Is the user requesting attributes that might need revalidation? */
>  	if (!(request_mask & (STATX_MODE|STATX_NLINK|STATX_ATIME|STATX_CTIME|
>  					STATX_MTIME|STATX_UID|STATX_GID|
> -					STATX_SIZE|STATX_BLOCKS)))
> +					STATX_SIZE|STATX_BLOCKS|STATX_VERSION)))
>  		goto out_no_revalidate;
>  
>  	/* Check whether the cached attributes are stale */
> @@ -915,6 +917,10 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
>  
>  	generic_fillattr(&init_user_ns, inode, stat);
>  	stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
> +	stat->version = inode_peek_iversion_raw(inode);

This looks wrong.
1/ it includes the I_VERSION_QUERIED bit, which should be hidden.
2/ it doesn't set that bit.

I understand that the bit was already set when the generic code called
inode_query_iversion(), but it might have changed if we needed to
refresh the attrs.

I'm beginning to think I shouldn't have approved the 3/9 patch.  The
stat->version shouldn't be set in vfs_getattr_nosec() - maybe in
generic_fillattr(), but not a lot of point.

> +	stat->attributes_mask |= STATX_ATTR_VERSION_MONOTONIC;
> +	if (server->change_attr_type != NFS4_CHANGE_TYPE_IS_UNDEFINED)
> +		stat->attributes |= STATX_ATTR_VERSION_MONOTONIC;

So if the server tells us that the change attrs is based on time
metadata, we accept that it will be monotonic (and RFC7862 encourages
this), even though we seem to worry about timestamps going backwards
(which we know that can)...  Interesting.

Thanks,
NeilBrown


>  	if (S_ISDIR(inode->i_mode))
>  		stat->blksize = NFS_SERVER(inode)->dtsize;
>  out:
> -- 
> 2.37.3
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version
  2022-09-30 11:18 ` [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
  2022-09-30 14:34   ` Chuck Lever III
@ 2022-10-03 23:39   ` NeilBrown
  2022-10-05 10:06     ` Jeff Layton
  2022-10-06 21:17     ` Dave Chinner
  1 sibling, 2 replies; 35+ messages in thread
From: NeilBrown @ 2022-10-03 23:39 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Fri, 30 Sep 2022, Jeff Layton wrote:
> Now that we can call into vfs_getattr to get the i_version field, use
> that facility to fetch it instead of doing it in nfsd4_change_attribute.
> 
> Neil also pointed out recently that IS_I_VERSION directory operations
> are always logged, and so we only need to mitigate the rollback problem
> on regular files. Also, we don't need to factor in the ctime when
> reexporting NFS or Ceph.
> 
> Set the STATX_VERSION (and BTIME) bits in the request when we're dealing
> with a v4 request. Then, instead of looking at IS_I_VERSION when
> generating the change attr, look at the result mask and only use it if
> STATX_VERSION is set. With this change, we can drop the fetch_iversion
> export operation as well.
> 
> Move nfsd4_change_attribute into nfsfh.c, and change it to only factor
> in the ctime if it's a regular file and the fs doesn't advertise
> STATX_ATTR_VERSION_MONOTONIC.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/nfs/export.c          |  7 -------
>  fs/nfsd/nfs4xdr.c        |  4 +++-
>  fs/nfsd/nfsfh.c          | 40 ++++++++++++++++++++++++++++++++++++++++
>  fs/nfsd/nfsfh.h          | 29 +----------------------------
>  fs/nfsd/vfs.h            |  7 ++++++-
>  include/linux/exportfs.h |  1 -
>  6 files changed, 50 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> index 01596f2d0a1e..1a9d5aa51dfb 100644
> --- a/fs/nfs/export.c
> +++ b/fs/nfs/export.c
> @@ -145,17 +145,10 @@ nfs_get_parent(struct dentry *dentry)
>  	return parent;
>  }
>  
> -static u64 nfs_fetch_iversion(struct inode *inode)
> -{
> -	nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
> -	return inode_peek_iversion_raw(inode);
> -}
> -
>  const struct export_operations nfs_export_ops = {
>  	.encode_fh = nfs_encode_fh,
>  	.fh_to_dentry = nfs_fh_to_dentry,
>  	.get_parent = nfs_get_parent,
> -	.fetch_iversion = nfs_fetch_iversion,
>  	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|
>  		EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS|
>  		EXPORT_OP_NOATOMIC_ATTR,
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 1e9690a061ec..779c009314c6 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2869,7 +2869,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  			goto out;
>  	}
>  
> -	err = vfs_getattr(&path, &stat, STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
> +	err = vfs_getattr(&path, &stat,
> +			  STATX_BASIC_STATS | STATX_BTIME | STATX_VERSION,
> +			  AT_STATX_SYNC_AS_STAT);
>  	if (err)
>  		goto out_nfserr;
>  	if (!(stat.result_mask & STATX_BTIME))
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index a5b71526cee0..9168bc657378 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -634,6 +634,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
>  		stat.mtime = inode->i_mtime;
>  		stat.ctime = inode->i_ctime;
>  		stat.size  = inode->i_size;
> +		if (v4 && IS_I_VERSION(inode)) {
> +			stat.version = inode_query_iversion(inode);
> +			stat.result_mask |= STATX_VERSION;
> +		}

This is increasingly ugly.  I wonder if it is justified at all...

>  	}
>  	if (v4)
>  		fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
> @@ -665,6 +669,8 @@ void fh_fill_post_attrs(struct svc_fh *fhp)
>  	if (err) {
>  		fhp->fh_post_saved = false;
>  		fhp->fh_post_attr.ctime = inode->i_ctime;
> +		if (v4 && IS_I_VERSION(inode))
> +			fhp->fh_post_attr.version = inode_query_iversion(inode);

... ditto ...

>  	} else
>  		fhp->fh_post_saved = true;
>  	if (v4)
> @@ -754,3 +760,37 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
>  		return FSIDSOURCE_UUID;
>  	return FSIDSOURCE_DEV;
>  }
> +
> +/*
> + * We could use i_version alone as the change attribute.  However, i_version
> + * can go backwards on a regular file after an unclean shutdown.  On its own
> + * that doesn't necessarily cause a problem, but if i_version goes backwards
> + * and then is incremented again it could reuse a value that was previously
> + * used before boot, and a client who queried the two values might incorrectly
> + * assume nothing changed.
> + *
> + * By using both ctime and the i_version counter we guarantee that as long as
> + * time doesn't go backwards we never reuse an old value. If the filesystem
> + * advertises STATX_ATTR_VERSION_MONOTONIC, then this mitigation is not needed.
> + *
> + * We only need to do this for regular files as well. For directories, we
> + * assume that the new change attr is always logged to stable storage in some
> + * fashion before the results can be seen.
> + */
> +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
> +{
> +	u64 chattr;
> +
> +	if (stat->result_mask & STATX_VERSION) {
> +		chattr = stat->version;
> +
> +		if (S_ISREG(inode->i_mode) &&
> +		    !(stat->attributes & STATX_ATTR_VERSION_MONOTONIC)) {

I would really rather that the fs got to make this decision.
If it can guarantee that the i_version is monotonic even over a crash
(which is probably can for directory, and might need changes to do for
files) then it sets STATX_ATTR_VERSION_MONOTONIC and nfsd trusts it
completely.
If it cannot, then it doesn't set the flag.
i.e. the S_ISREG() test should be in the filesystem, not in nfsd.


> +			chattr += (u64)stat->ctime.tv_sec << 30;
> +			chattr += stat->ctime.tv_nsec;
> +		}
> +	} else {
> +		chattr = time_to_chattr(&stat->ctime);
> +	}
> +	return chattr;
> +}
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index c3ae6414fc5c..4c223a7a91d4 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -291,34 +291,7 @@ static inline void fh_clear_pre_post_attrs(struct svc_fh *fhp)
>  	fhp->fh_pre_saved = false;
>  }
>  
> -/*
> - * We could use i_version alone as the change attribute.  However,
> - * i_version can go backwards after a reboot.  On its own that doesn't
> - * necessarily cause a problem, but if i_version goes backwards and then
> - * is incremented again it could reuse a value that was previously used
> - * before boot, and a client who queried the two values might
> - * incorrectly assume nothing changed.
> - *
> - * By using both ctime and the i_version counter we guarantee that as
> - * long as time doesn't go backwards we never reuse an old value.
> - */
> -static inline u64 nfsd4_change_attribute(struct kstat *stat,
> -					 struct inode *inode)
> -{
> -	if (inode->i_sb->s_export_op->fetch_iversion)
> -		return inode->i_sb->s_export_op->fetch_iversion(inode);
> -	else if (IS_I_VERSION(inode)) {
> -		u64 chattr;
> -
> -		chattr =  stat->ctime.tv_sec;
> -		chattr <<= 30;
> -		chattr += stat->ctime.tv_nsec;
> -		chattr += inode_query_iversion(inode);
> -		return chattr;
> -	} else
> -		return time_to_chattr(&stat->ctime);
> -}
> -
> +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode);
>  extern void fh_fill_pre_attrs(struct svc_fh *fhp);
>  extern void fh_fill_post_attrs(struct svc_fh *fhp);
>  extern void fh_fill_both_attrs(struct svc_fh *fhp);
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index c95cd414b4bb..a905f59481ee 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -168,9 +168,14 @@ static inline void fh_drop_write(struct svc_fh *fh)
>  
>  static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
>  {
> +	u32 request_mask = STATX_BASIC_STATS;
>  	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
>  			 .dentry = fh->fh_dentry};
> -	return nfserrno(vfs_getattr(&p, stat, STATX_BASIC_STATS,
> +
> +	if (fh->fh_maxsize == NFS4_FHSIZE)
> +		request_mask |= (STATX_BTIME | STATX_VERSION);
> +
> +	return nfserrno(vfs_getattr(&p, stat, request_mask,
>  				    AT_STATX_SYNC_AS_STAT));
>  }
>  
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index fe848901fcc3..9f4d4bcbf251 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -213,7 +213,6 @@ struct export_operations {
>  			  bool write, u32 *device_generation);
>  	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
>  			     int nr_iomaps, struct iattr *iattr);
> -	u64 (*fetch_iversion)(struct inode *);
>  #define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
>  #define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */
>  #define	EXPORT_OP_CLOSE_BEFORE_UNLINK	(0x4) /* close files before unlink */
> -- 
> 2.37.3
> 
> 

Definitely more to like than to dislike here, so

Reviewed-by: NeilBrown <neilb@suse.de>

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 7/9] vfs: expose STATX_VERSION to userland
  2022-09-30 11:18 ` [PATCH v6 7/9] vfs: expose STATX_VERSION to userland Jeff Layton
@ 2022-10-03 23:42   ` NeilBrown
  2022-10-05 10:08     ` Jeff Layton
  0 siblings, 1 reply; 35+ messages in thread
From: NeilBrown @ 2022-10-03 23:42 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Jeff Layton

On Fri, 30 Sep 2022, Jeff Layton wrote:
> From: Jeff Layton <jlayton@redhat.com>
> 
> Claim one of the spare fields in struct statx to hold a 64-bit inode
> version attribute. When userland requests STATX_VERSION, copy the
> value from the kstat struct there, and stop masking off
> STATX_ATTR_VERSION_MONOTONIC.
> 
> Update the test-statx sample program to output the change attr and
> MountId.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/stat.c                 | 12 +++---------
>  include/linux/stat.h      |  9 ---------
>  include/uapi/linux/stat.h |  6 ++++--
>  samples/vfs/test-statx.c  |  8 ++++++--
>  4 files changed, 13 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index e7f8cd4b24e1..8396c372022f 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -593,11 +593,9 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
>  
>  	memset(&tmp, 0, sizeof(tmp));
>  
> -	/* STATX_VERSION is kernel-only for now */
> -	tmp.stx_mask = stat->result_mask & ~STATX_VERSION;
> +	tmp.stx_mask = stat->result_mask;
>  	tmp.stx_blksize = stat->blksize;
> -	/* STATX_ATTR_VERSION_MONOTONIC is kernel-only for now */
> -	tmp.stx_attributes = stat->attributes & ~STATX_ATTR_VERSION_MONOTONIC;
> +	tmp.stx_attributes = stat->attributes;
>  	tmp.stx_nlink = stat->nlink;
>  	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
>  	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
> @@ -621,6 +619,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
>  	tmp.stx_mnt_id = stat->mnt_id;
>  	tmp.stx_dio_mem_align = stat->dio_mem_align;
>  	tmp.stx_dio_offset_align = stat->dio_offset_align;
> +	tmp.stx_version = stat->version;
>  
>  	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
>  }
> @@ -636,11 +635,6 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
>  	if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
>  		return -EINVAL;
>  
> -	/* STATX_VERSION is kernel-only for now. Ignore requests
> -	 * from userland.
> -	 */
> -	mask &= ~STATX_VERSION;
> -
>  	error = vfs_statx(dfd, filename, flags, &stat, mask);
>  	if (error)
>  		return error;
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index 4e9428d86a3a..69c79e4fd1b1 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -54,13 +54,4 @@ struct kstat {
>  	u32		dio_offset_align;
>  	u64		version;
>  };
> -
> -/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
> -
> -/* mask values */
> -#define STATX_VERSION		0x40000000U	/* Want/got stx_change_attr */
> -
> -/* file attribute values */
> -#define STATX_ATTR_VERSION_MONOTONIC	0x8000000000000000ULL /* version monotonically increases */
> -
>  #endif
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7cab2c65d3d7..4a0a1f27c059 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -127,7 +127,8 @@ struct statx {
>  	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
>  	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
>  	/* 0xa0 */
> -	__u64	__spare3[12];	/* Spare space for future expansion */
> +	__u64	stx_version; /* Inode change attribute */
> +	__u64	__spare3[11];	/* Spare space for future expansion */
>  	/* 0x100 */
>  };
>  
> @@ -154,6 +155,7 @@ struct statx {
>  #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
>  #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
>  #define STATX_DIOALIGN		0x00002000U	/* Want/got direct I/O alignment info */
> +#define STATX_VERSION		0x00004000U	/* Want/got stx_version */
>  
>  #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
>  
> @@ -189,6 +191,6 @@ struct statx {
>  #define STATX_ATTR_MOUNT_ROOT		0x00002000 /* Root of a mount */
>  #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
>  #define STATX_ATTR_DAX			0x00200000 /* File is currently in DAX state */
> -
> +#define STATX_ATTR_VERSION_MONOTONIC	0x00400000 /* stx_version increases w/ every change */
>  
>  #endif /* _UAPI_LINUX_STAT_H */
> diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c
> index 49c7a46cee07..bdbc371c9774 100644
> --- a/samples/vfs/test-statx.c
> +++ b/samples/vfs/test-statx.c
> @@ -107,6 +107,8 @@ static void dump_statx(struct statx *stx)
>  	printf("Device: %-15s", buffer);
>  	if (stx->stx_mask & STATX_INO)
>  		printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino);
> +	if (stx->stx_mask & STATX_MNT_ID)
> +		printf(" MountId: %llx", stx->stx_mnt_id);
>  	if (stx->stx_mask & STATX_NLINK)
>  		printf(" Links: %-5u", stx->stx_nlink);
>  	if (stx->stx_mask & STATX_TYPE) {
> @@ -145,7 +147,9 @@ static void dump_statx(struct statx *stx)
>  	if (stx->stx_mask & STATX_CTIME)
>  		print_time("Change: ", &stx->stx_ctime);
>  	if (stx->stx_mask & STATX_BTIME)
> -		print_time(" Birth: ", &stx->stx_btime);
> +		print_time("Birth: ", &stx->stx_btime);
> +	if (stx->stx_mask & STATX_VERSION)
> +		printf("Inode Version: 0x%llx\n", stx->stx_version);

Why hex? not decimal?  I don't really care but it seems like an odd choice.

>  
>  	if (stx->stx_attributes_mask) {
>  		unsigned char bits, mbits;
> @@ -218,7 +222,7 @@ int main(int argc, char **argv)
>  	struct statx stx;
>  	int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
>  
> -	unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
> +	unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_MNT_ID | STATX_VERSION;
>  
>  	for (argv++; *argv; argv++) {
>  		if (strcmp(*argv, "-F") == 0) {
> -- 
> 2.37.3
> 
> 

Reviewed-by: NeilBrown <neilb@suse.de>

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 4/9] nfs: report the inode version in getattr if requested
  2022-10-03 23:29   ` NeilBrown
@ 2022-10-04  9:43     ` Jeff Layton
  2022-10-04 22:27       ` NeilBrown
  0 siblings, 1 reply; 35+ messages in thread
From: Jeff Layton @ 2022-10-04  9:43 UTC (permalink / raw)
  To: NeilBrown
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Tue, 2022-10-04 at 10:29 +1100, NeilBrown wrote:
> On Fri, 30 Sep 2022, Jeff Layton wrote:
> > Allow NFS to report the i_version in getattr requests. Since the cost to
> > fetch it is relatively cheap, do it unconditionally and just set the
> > flag if it looks like it's valid. Also, conditionally enable the
> > MONOTONIC flag when the server reports its change attr type as such.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/nfs/inode.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> > index bea7c005119c..5cb7017e5089 100644
> > --- a/fs/nfs/inode.c
> > +++ b/fs/nfs/inode.c
> > @@ -830,6 +830,8 @@ static u32 nfs_get_valid_attrmask(struct inode *inode)
> >  		reply_mask |= STATX_UID | STATX_GID;
> >  	if (!(cache_validity & NFS_INO_INVALID_BLOCKS))
> >  		reply_mask |= STATX_BLOCKS;
> > +	if (!(cache_validity & NFS_INO_INVALID_CHANGE))
> > +		reply_mask |= STATX_VERSION;
> >  	return reply_mask;
> >  }
> >  
> > @@ -848,7 +850,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
> >  
> >  	request_mask &= STATX_TYPE | STATX_MODE | STATX_NLINK | STATX_UID |
> >  			STATX_GID | STATX_ATIME | STATX_MTIME | STATX_CTIME |
> > -			STATX_INO | STATX_SIZE | STATX_BLOCKS;
> > +			STATX_INO | STATX_SIZE | STATX_BLOCKS | STATX_VERSION;
> >  
> >  	if ((query_flags & AT_STATX_DONT_SYNC) && !force_sync) {
> >  		if (readdirplus_enabled)
> > @@ -877,7 +879,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
> >  	/* Is the user requesting attributes that might need revalidation? */
> >  	if (!(request_mask & (STATX_MODE|STATX_NLINK|STATX_ATIME|STATX_CTIME|
> >  					STATX_MTIME|STATX_UID|STATX_GID|
> > -					STATX_SIZE|STATX_BLOCKS)))
> > +					STATX_SIZE|STATX_BLOCKS|STATX_VERSION)))
> >  		goto out_no_revalidate;
> >  
> >  	/* Check whether the cached attributes are stale */
> > @@ -915,6 +917,10 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
> >  
> >  	generic_fillattr(&init_user_ns, inode, stat);
> >  	stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
> > +	stat->version = inode_peek_iversion_raw(inode);
> 
> This looks wrong.
> 1/ it includes the I_VERSION_QUERIED bit, which should be hidden.
> 2/ it doesn't set that bit.
> 
> I understand that the bit was already set when the generic code called
> inode_query_iversion(), but it might have changed if we needed to
> refresh the attrs.
> 
> I'm beginning to think I shouldn't have approved the 3/9 patch.  The
> stat->version shouldn't be set in vfs_getattr_nosec() - maybe in
> generic_fillattr(), but not a lot of point.
> 

NFS (and Ceph), do not set the SB_I_VERSION flag and they don't use the
QUERIED bit. These are "server managed" implementations of i_version.
The server is responsible for incrementing the value, and we just store
the result in the i_version field and present it when needed. That's why
the patch for NFS is using the "raw" API.

> > +	stat->attributes_mask |= STATX_ATTR_VERSION_MONOTONIC;
> > +	if (server->change_attr_type != NFS4_CHANGE_TYPE_IS_UNDEFINED)
> > +		stat->attributes |= STATX_ATTR_VERSION_MONOTONIC;
> 
> So if the server tells us that the change attrs is based on time
> metadata, we accept that it will be monotonic (and RFC7862 encourages
> this), even though we seem to worry about timestamps going backwards
> (which we know that can)...  Interesting.
> 
> 

I followed suit from nfs_inode_attrs_cmp(). It seems to treat any value
that isn't UNDEFINED as MONOTONIC, though it does use a less strict
comparator for NFS4_CHANGE_TYPE_IS_TIME_METADATA. It may make sense to
carve that out as an exception.

This is probably an indicator that we need a more strict definition for
STATX_ATTR_VERSION_MONOTONIC.


> 
> >  	if (S_ISDIR(inode->i_mode))
> >  		stat->blksize = NFS_SERVER(inode)->dtsize;
> >  out:
> > -- 
> > 2.37.3
> > 
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 2/9] iversion: clarify when the i_version counter must be updated
  2022-10-03 23:10   ` NeilBrown
@ 2022-10-04  9:53     ` Jeff Layton
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff Layton @ 2022-10-04  9:53 UTC (permalink / raw)
  To: NeilBrown
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs, Colin Walters

On Tue, 2022-10-04 at 10:10 +1100, NeilBrown wrote:
> On Fri, 30 Sep 2022, Jeff Layton wrote:
> > The i_version field in the kernel has had different semantics over
> > the decades, but NFSv4 has certain expectations. Update the comments
> > in iversion.h to describe when the i_version must change.
> > 
> > Cc: Colin Walters <walters@verbum.org>
> > Cc: NeilBrown <neilb@suse.de>
> > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  include/linux/iversion.h | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 6755d8b4f20b..9925cac1fa94 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -9,8 +9,14 @@
> >   * ---------------------------
> >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > - * appear different to observers if there was a change to the inode's data or
> > - * metadata since it was last queried.
> > + * appear larger to observers if there was an explicit change to the inode's
> > + * data or metadata since it was last queried.
> > + *
> > + * An explicit change is one that would ordinarily result in a change to the
> > + * inode status change time (aka ctime). i_version must appear to change, even
> > + * if the ctime does not (since the whole point is to avoid missing updates due
> > + * to timestamp granularity). If POSIX mandates that the ctime must change due
> > + * to an operation, then the i_version counter must be incremented as well.
> 
> POSIX doesn't (that I can see) describe when the ctime changes w.r.t
> when the file changes.  For i_version we do want to specify that
> i_version change is not visible before the file change is visible.
> So this goes beyond the POSIX mandate.  I might be worth making that
> explicit.
> But this patch is nonetheless an improvement, so:
> 
> Reviewed-by: NeilBrown <neilb@suse.de>
> 
> 

Thanks,

Now that we're looking at setting the ctime and i_version separately,
I'll plan to add something that makes this explict.

> >   *
> >   * Observers see the i_version as a 64-bit number that never decreases. If it
> >   * remains the same since it was last checked, then nothing has changed in the
> > -- 
> > 2.37.3
> > 
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 4/9] nfs: report the inode version in getattr if requested
  2022-10-04  9:43     ` Jeff Layton
@ 2022-10-04 22:27       ` NeilBrown
  0 siblings, 0 replies; 35+ messages in thread
From: NeilBrown @ 2022-10-04 22:27 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Tue, 04 Oct 2022, Jeff Layton wrote:
> On Tue, 2022-10-04 at 10:29 +1100, NeilBrown wrote:
> > On Fri, 30 Sep 2022, Jeff Layton wrote:
> > > Allow NFS to report the i_version in getattr requests. Since the cost to
> > > fetch it is relatively cheap, do it unconditionally and just set the
> > > flag if it looks like it's valid. Also, conditionally enable the
> > > MONOTONIC flag when the server reports its change attr type as such.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/nfs/inode.c | 10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> > > index bea7c005119c..5cb7017e5089 100644
> > > --- a/fs/nfs/inode.c
> > > +++ b/fs/nfs/inode.c
> > > @@ -830,6 +830,8 @@ static u32 nfs_get_valid_attrmask(struct inode *inode)
> > >  		reply_mask |= STATX_UID | STATX_GID;
> > >  	if (!(cache_validity & NFS_INO_INVALID_BLOCKS))
> > >  		reply_mask |= STATX_BLOCKS;
> > > +	if (!(cache_validity & NFS_INO_INVALID_CHANGE))
> > > +		reply_mask |= STATX_VERSION;
> > >  	return reply_mask;
> > >  }
> > >  
> > > @@ -848,7 +850,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
> > >  
> > >  	request_mask &= STATX_TYPE | STATX_MODE | STATX_NLINK | STATX_UID |
> > >  			STATX_GID | STATX_ATIME | STATX_MTIME | STATX_CTIME |
> > > -			STATX_INO | STATX_SIZE | STATX_BLOCKS;
> > > +			STATX_INO | STATX_SIZE | STATX_BLOCKS | STATX_VERSION;
> > >  
> > >  	if ((query_flags & AT_STATX_DONT_SYNC) && !force_sync) {
> > >  		if (readdirplus_enabled)
> > > @@ -877,7 +879,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
> > >  	/* Is the user requesting attributes that might need revalidation? */
> > >  	if (!(request_mask & (STATX_MODE|STATX_NLINK|STATX_ATIME|STATX_CTIME|
> > >  					STATX_MTIME|STATX_UID|STATX_GID|
> > > -					STATX_SIZE|STATX_BLOCKS)))
> > > +					STATX_SIZE|STATX_BLOCKS|STATX_VERSION)))
> > >  		goto out_no_revalidate;
> > >  
> > >  	/* Check whether the cached attributes are stale */
> > > @@ -915,6 +917,10 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
> > >  
> > >  	generic_fillattr(&init_user_ns, inode, stat);
> > >  	stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
> > > +	stat->version = inode_peek_iversion_raw(inode);
> > 
> > This looks wrong.
> > 1/ it includes the I_VERSION_QUERIED bit, which should be hidden.
> > 2/ it doesn't set that bit.
> > 
> > I understand that the bit was already set when the generic code called
> > inode_query_iversion(), but it might have changed if we needed to
> > refresh the attrs.
> > 
> > I'm beginning to think I shouldn't have approved the 3/9 patch.  The
> > stat->version shouldn't be set in vfs_getattr_nosec() - maybe in
> > generic_fillattr(), but not a lot of point.
> > 
> 
> NFS (and Ceph), do not set the SB_I_VERSION flag and they don't use the
> QUERIED bit. These are "server managed" implementations of i_version.
> The server is responsible for incrementing the value, and we just store
> the result in the i_version field and present it when needed. That's why
> the patch for NFS is using the "raw" API.

Ahh - of course.  I got confused because the "raw" api is used by code
(in iversion.h) that wants to access the QUERIED bit.  Maybe having
different names would help.  Or maybe me re-familiarising myself with
the interfaces would help...

Reviewed-by: NeilBrown <neilb@suse.de>


> 
> > > +	stat->attributes_mask |= STATX_ATTR_VERSION_MONOTONIC;
> > > +	if (server->change_attr_type != NFS4_CHANGE_TYPE_IS_UNDEFINED)
> > > +		stat->attributes |= STATX_ATTR_VERSION_MONOTONIC;
> > 
> > So if the server tells us that the change attrs is based on time
> > metadata, we accept that it will be monotonic (and RFC7862 encourages
> > this), even though we seem to worry about timestamps going backwards
> > (which we know that can)...  Interesting.
> > 
> > 
> 
> I followed suit from nfs_inode_attrs_cmp(). It seems to treat any value
> that isn't UNDEFINED as MONOTONIC, though it does use a less strict
> comparator for NFS4_CHANGE_TYPE_IS_TIME_METADATA. It may make sense to
> carve that out as an exception.
> 
> This is probably an indicator that we need a more strict definition for
> STATX_ATTR_VERSION_MONOTONIC.

Maybe.  Or maybe we decide that if the system time goes backwards and
things break, then you get to keep both halves.
The pedant in me want to handle that properly.  The pragmatist doesn't
think it is worth it.

Thanks,
NeilBrown


> 
> 
> > 
> > >  	if (S_ISDIR(inode->i_mode))
> > >  		stat->blksize = NFS_SERVER(inode)->dtsize;
> > >  out:
> > > -- 
> > > 2.37.3
> > > 
> > > 
> 
> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version
  2022-10-03 23:39   ` NeilBrown
@ 2022-10-05 10:06     ` Jeff Layton
  2022-10-05 13:33       ` Chuck Lever III
                         ` (2 more replies)
  2022-10-06 21:17     ` Dave Chinner
  1 sibling, 3 replies; 35+ messages in thread
From: Jeff Layton @ 2022-10-05 10:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Tue, 2022-10-04 at 10:39 +1100, NeilBrown wrote:
> On Fri, 30 Sep 2022, Jeff Layton wrote:
> > Now that we can call into vfs_getattr to get the i_version field, use
> > that facility to fetch it instead of doing it in nfsd4_change_attribute.
> > 
> > Neil also pointed out recently that IS_I_VERSION directory operations
> > are always logged, and so we only need to mitigate the rollback problem
> > on regular files. Also, we don't need to factor in the ctime when
> > reexporting NFS or Ceph.
> > 
> > Set the STATX_VERSION (and BTIME) bits in the request when we're dealing
> > with a v4 request. Then, instead of looking at IS_I_VERSION when
> > generating the change attr, look at the result mask and only use it if
> > STATX_VERSION is set. With this change, we can drop the fetch_iversion
> > export operation as well.
> > 
> > Move nfsd4_change_attribute into nfsfh.c, and change it to only factor
> > in the ctime if it's a regular file and the fs doesn't advertise
> > STATX_ATTR_VERSION_MONOTONIC.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/nfs/export.c          |  7 -------
> >  fs/nfsd/nfs4xdr.c        |  4 +++-
> >  fs/nfsd/nfsfh.c          | 40 ++++++++++++++++++++++++++++++++++++++++
> >  fs/nfsd/nfsfh.h          | 29 +----------------------------
> >  fs/nfsd/vfs.h            |  7 ++++++-
> >  include/linux/exportfs.h |  1 -
> >  6 files changed, 50 insertions(+), 38 deletions(-)
> > 
> > diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> > index 01596f2d0a1e..1a9d5aa51dfb 100644
> > --- a/fs/nfs/export.c
> > +++ b/fs/nfs/export.c
> > @@ -145,17 +145,10 @@ nfs_get_parent(struct dentry *dentry)
> >  	return parent;
> >  }
> >  
> > -static u64 nfs_fetch_iversion(struct inode *inode)
> > -{
> > -	nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
> > -	return inode_peek_iversion_raw(inode);
> > -}
> > -
> >  const struct export_operations nfs_export_ops = {
> >  	.encode_fh = nfs_encode_fh,
> >  	.fh_to_dentry = nfs_fh_to_dentry,
> >  	.get_parent = nfs_get_parent,
> > -	.fetch_iversion = nfs_fetch_iversion,
> >  	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|
> >  		EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS|
> >  		EXPORT_OP_NOATOMIC_ATTR,
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index 1e9690a061ec..779c009314c6 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -2869,7 +2869,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> >  			goto out;
> >  	}
> >  
> > -	err = vfs_getattr(&path, &stat, STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
> > +	err = vfs_getattr(&path, &stat,
> > +			  STATX_BASIC_STATS | STATX_BTIME | STATX_VERSION,
> > +			  AT_STATX_SYNC_AS_STAT);
> >  	if (err)
> >  		goto out_nfserr;
> >  	if (!(stat.result_mask & STATX_BTIME))
> > diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> > index a5b71526cee0..9168bc657378 100644
> > --- a/fs/nfsd/nfsfh.c
> > +++ b/fs/nfsd/nfsfh.c
> > @@ -634,6 +634,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
> >  		stat.mtime = inode->i_mtime;
> >  		stat.ctime = inode->i_ctime;
> >  		stat.size  = inode->i_size;
> > +		if (v4 && IS_I_VERSION(inode)) {
> > +			stat.version = inode_query_iversion(inode);
> > +			stat.result_mask |= STATX_VERSION;
> > +		}
> 
> This is increasingly ugly.  I wonder if it is justified at all...
> 

I'm fine with dropping that. So if the getattrs fail, we should just not
offer up pre/post attrs?

> >  	}
> >  	if (v4)
> >  		fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
> > @@ -665,6 +669,8 @@ void fh_fill_post_attrs(struct svc_fh *fhp)
> >  	if (err) {
> >  		fhp->fh_post_saved = false;
> >  		fhp->fh_post_attr.ctime = inode->i_ctime;
> > +		if (v4 && IS_I_VERSION(inode))
> > +			fhp->fh_post_attr.version = inode_query_iversion(inode);
> 
> ... ditto ...
> 
> >  	} else
> >  		fhp->fh_post_saved = true;
> >  	if (v4)
> > @@ -754,3 +760,37 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
> >  		return FSIDSOURCE_UUID;
> >  	return FSIDSOURCE_DEV;
> >  }
> > +
> > +/*
> > + * We could use i_version alone as the change attribute.  However, i_version
> > + * can go backwards on a regular file after an unclean shutdown.  On its own
> > + * that doesn't necessarily cause a problem, but if i_version goes backwards
> > + * and then is incremented again it could reuse a value that was previously
> > + * used before boot, and a client who queried the two values might incorrectly
> > + * assume nothing changed.
> > + *
> > + * By using both ctime and the i_version counter we guarantee that as long as
> > + * time doesn't go backwards we never reuse an old value. If the filesystem
> > + * advertises STATX_ATTR_VERSION_MONOTONIC, then this mitigation is not needed.
> > + *
> > + * We only need to do this for regular files as well. For directories, we
> > + * assume that the new change attr is always logged to stable storage in some
> > + * fashion before the results can be seen.
> > + */
> > +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
> > +{
> > +	u64 chattr;
> > +
> > +	if (stat->result_mask & STATX_VERSION) {
> > +		chattr = stat->version;
> > +
> > +		if (S_ISREG(inode->i_mode) &&
> > +		    !(stat->attributes & STATX_ATTR_VERSION_MONOTONIC)) {
> 
> I would really rather that the fs got to make this decision.
> If it can guarantee that the i_version is monotonic even over a crash
> (which is probably can for directory, and might need changes to do for
> files) then it sets STATX_ATTR_VERSION_MONOTONIC and nfsd trusts it
> completely.
> If it cannot, then it doesn't set the flag.
> i.e. the S_ISREG() test should be in the filesystem, not in nfsd.
> 

This sounds reasonable, but for one thing.

From RFC 7862:

   While Section 5.4 of [RFC5661] discusses
   per-file system attributes, it is expected that the value of
   change_attr_type will not depend on the value of "homogeneous" and
   will only change in the event of a migration.

The change_attr_type4 must be the same for all filehandles under a
particular filesystem.

If we do what you suggest though, then it's easily possible for the fs
to set STATX_ATTR_VERSION_MONOTONIC on directories but not files. If we
later want to allow nfsd to advertise a change_attr_type4, we won't be
able to rely on the STATX_ATTR_VERSION_MONOTONIC to tell us how to fill
that out.

Maybe that's ok. I suppose we could add a new field to the export
options that filesystems can set to advertise what sort of change attr
they offer?

> 
> > +			chattr += (u64)stat->ctime.tv_sec << 30;
> > +			chattr += stat->ctime.tv_nsec;
> > +		}
> > +	} else {
> > +		chattr = time_to_chattr(&stat->ctime);
> > +	}
> > +	return chattr;
> > +}
> > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> > index c3ae6414fc5c..4c223a7a91d4 100644
> > --- a/fs/nfsd/nfsfh.h
> > +++ b/fs/nfsd/nfsfh.h
> > @@ -291,34 +291,7 @@ static inline void fh_clear_pre_post_attrs(struct svc_fh *fhp)
> >  	fhp->fh_pre_saved = false;
> >  }
> >  
> > -/*
> > - * We could use i_version alone as the change attribute.  However,
> > - * i_version can go backwards after a reboot.  On its own that doesn't
> > - * necessarily cause a problem, but if i_version goes backwards and then
> > - * is incremented again it could reuse a value that was previously used
> > - * before boot, and a client who queried the two values might
> > - * incorrectly assume nothing changed.
> > - *
> > - * By using both ctime and the i_version counter we guarantee that as
> > - * long as time doesn't go backwards we never reuse an old value.
> > - */
> > -static inline u64 nfsd4_change_attribute(struct kstat *stat,
> > -					 struct inode *inode)
> > -{
> > -	if (inode->i_sb->s_export_op->fetch_iversion)
> > -		return inode->i_sb->s_export_op->fetch_iversion(inode);
> > -	else if (IS_I_VERSION(inode)) {
> > -		u64 chattr;
> > -
> > -		chattr =  stat->ctime.tv_sec;
> > -		chattr <<= 30;
> > -		chattr += stat->ctime.tv_nsec;
> > -		chattr += inode_query_iversion(inode);
> > -		return chattr;
> > -	} else
> > -		return time_to_chattr(&stat->ctime);
> > -}
> > -
> > +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode);
> >  extern void fh_fill_pre_attrs(struct svc_fh *fhp);
> >  extern void fh_fill_post_attrs(struct svc_fh *fhp);
> >  extern void fh_fill_both_attrs(struct svc_fh *fhp);
> > diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> > index c95cd414b4bb..a905f59481ee 100644
> > --- a/fs/nfsd/vfs.h
> > +++ b/fs/nfsd/vfs.h
> > @@ -168,9 +168,14 @@ static inline void fh_drop_write(struct svc_fh *fh)
> >  
> >  static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
> >  {
> > +	u32 request_mask = STATX_BASIC_STATS;
> >  	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
> >  			 .dentry = fh->fh_dentry};
> > -	return nfserrno(vfs_getattr(&p, stat, STATX_BASIC_STATS,
> > +
> > +	if (fh->fh_maxsize == NFS4_FHSIZE)
> > +		request_mask |= (STATX_BTIME | STATX_VERSION);
> > +
> > +	return nfserrno(vfs_getattr(&p, stat, request_mask,
> >  				    AT_STATX_SYNC_AS_STAT));
> >  }
> >  
> > diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> > index fe848901fcc3..9f4d4bcbf251 100644
> > --- a/include/linux/exportfs.h
> > +++ b/include/linux/exportfs.h
> > @@ -213,7 +213,6 @@ struct export_operations {
> >  			  bool write, u32 *device_generation);
> >  	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> >  			     int nr_iomaps, struct iattr *iattr);
> > -	u64 (*fetch_iversion)(struct inode *);
> >  #define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
> >  #define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */
> >  #define	EXPORT_OP_CLOSE_BEFORE_UNLINK	(0x4) /* close files before unlink */
> > -- 
> > 2.37.3
> > 
> > 
> 
> Definitely more to like than to dislike here, so
> 
> Reviewed-by: NeilBrown <neilb@suse.de>
> 
> Thanks,
> NeilBrown

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 7/9] vfs: expose STATX_VERSION to userland
  2022-10-03 23:42   ` NeilBrown
@ 2022-10-05 10:08     ` Jeff Layton
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff Layton @ 2022-10-05 10:08 UTC (permalink / raw)
  To: NeilBrown
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Tue, 2022-10-04 at 10:42 +1100, NeilBrown wrote:
> On Fri, 30 Sep 2022, Jeff Layton wrote:
> > From: Jeff Layton <jlayton@redhat.com>
> > 
> > Claim one of the spare fields in struct statx to hold a 64-bit inode
> > version attribute. When userland requests STATX_VERSION, copy the
> > value from the kstat struct there, and stop masking off
> > STATX_ATTR_VERSION_MONOTONIC.
> > 
> > Update the test-statx sample program to output the change attr and
> > MountId.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/stat.c                 | 12 +++---------
> >  include/linux/stat.h      |  9 ---------
> >  include/uapi/linux/stat.h |  6 ++++--
> >  samples/vfs/test-statx.c  |  8 ++++++--
> >  4 files changed, 13 insertions(+), 22 deletions(-)
> > 
> > diff --git a/fs/stat.c b/fs/stat.c
> > index e7f8cd4b24e1..8396c372022f 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -593,11 +593,9 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
> >  
> >  	memset(&tmp, 0, sizeof(tmp));
> >  
> > -	/* STATX_VERSION is kernel-only for now */
> > -	tmp.stx_mask = stat->result_mask & ~STATX_VERSION;
> > +	tmp.stx_mask = stat->result_mask;
> >  	tmp.stx_blksize = stat->blksize;
> > -	/* STATX_ATTR_VERSION_MONOTONIC is kernel-only for now */
> > -	tmp.stx_attributes = stat->attributes & ~STATX_ATTR_VERSION_MONOTONIC;
> > +	tmp.stx_attributes = stat->attributes;
> >  	tmp.stx_nlink = stat->nlink;
> >  	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
> >  	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
> > @@ -621,6 +619,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
> >  	tmp.stx_mnt_id = stat->mnt_id;
> >  	tmp.stx_dio_mem_align = stat->dio_mem_align;
> >  	tmp.stx_dio_offset_align = stat->dio_offset_align;
> > +	tmp.stx_version = stat->version;
> >  
> >  	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
> >  }
> > @@ -636,11 +635,6 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags,
> >  	if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE)
> >  		return -EINVAL;
> >  
> > -	/* STATX_VERSION is kernel-only for now. Ignore requests
> > -	 * from userland.
> > -	 */
> > -	mask &= ~STATX_VERSION;
> > -
> >  	error = vfs_statx(dfd, filename, flags, &stat, mask);
> >  	if (error)
> >  		return error;
> > diff --git a/include/linux/stat.h b/include/linux/stat.h
> > index 4e9428d86a3a..69c79e4fd1b1 100644
> > --- a/include/linux/stat.h
> > +++ b/include/linux/stat.h
> > @@ -54,13 +54,4 @@ struct kstat {
> >  	u32		dio_offset_align;
> >  	u64		version;
> >  };
> > -
> > -/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
> > -
> > -/* mask values */
> > -#define STATX_VERSION		0x40000000U	/* Want/got stx_change_attr */
> > -
> > -/* file attribute values */
> > -#define STATX_ATTR_VERSION_MONOTONIC	0x8000000000000000ULL /* version monotonically increases */
> > -
> >  #endif
> > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> > index 7cab2c65d3d7..4a0a1f27c059 100644
> > --- a/include/uapi/linux/stat.h
> > +++ b/include/uapi/linux/stat.h
> > @@ -127,7 +127,8 @@ struct statx {
> >  	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
> >  	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
> >  	/* 0xa0 */
> > -	__u64	__spare3[12];	/* Spare space for future expansion */
> > +	__u64	stx_version; /* Inode change attribute */
> > +	__u64	__spare3[11];	/* Spare space for future expansion */
> >  	/* 0x100 */
> >  };
> >  
> > @@ -154,6 +155,7 @@ struct statx {
> >  #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
> >  #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
> >  #define STATX_DIOALIGN		0x00002000U	/* Want/got direct I/O alignment info */
> > +#define STATX_VERSION		0x00004000U	/* Want/got stx_version */
> >  
> >  #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
> >  
> > @@ -189,6 +191,6 @@ struct statx {
> >  #define STATX_ATTR_MOUNT_ROOT		0x00002000 /* Root of a mount */
> >  #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
> >  #define STATX_ATTR_DAX			0x00200000 /* File is currently in DAX state */
> > -
> > +#define STATX_ATTR_VERSION_MONOTONIC	0x00400000 /* stx_version increases w/ every change */
> >  
> >  #endif /* _UAPI_LINUX_STAT_H */
> > diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c
> > index 49c7a46cee07..bdbc371c9774 100644
> > --- a/samples/vfs/test-statx.c
> > +++ b/samples/vfs/test-statx.c
> > @@ -107,6 +107,8 @@ static void dump_statx(struct statx *stx)
> >  	printf("Device: %-15s", buffer);
> >  	if (stx->stx_mask & STATX_INO)
> >  		printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino);
> > +	if (stx->stx_mask & STATX_MNT_ID)
> > +		printf(" MountId: %llx", stx->stx_mnt_id);
> >  	if (stx->stx_mask & STATX_NLINK)
> >  		printf(" Links: %-5u", stx->stx_nlink);
> >  	if (stx->stx_mask & STATX_TYPE) {
> > @@ -145,7 +147,9 @@ static void dump_statx(struct statx *stx)
> >  	if (stx->stx_mask & STATX_CTIME)
> >  		print_time("Change: ", &stx->stx_ctime);
> >  	if (stx->stx_mask & STATX_BTIME)
> > -		print_time(" Birth: ", &stx->stx_btime);
> > +		print_time("Birth: ", &stx->stx_btime);
> > +	if (stx->stx_mask & STATX_VERSION)
> > +		printf("Inode Version: 0x%llx\n", stx->stx_version);
> 
> Why hex? not decimal?  I don't really care but it seems like an odd choice.
> 

Habit. You're probably right that this is better viewed in decimal. I'll
change it in the next iteration. We should probably also have this
display the new DIOALIGN fields as well. I'll roll that in too.

> >  
> >  	if (stx->stx_attributes_mask) {
> >  		unsigned char bits, mbits;
> > @@ -218,7 +222,7 @@ int main(int argc, char **argv)
> >  	struct statx stx;
> >  	int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
> >  
> > -	unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
> > +	unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_MNT_ID | STATX_VERSION;
> >  
> >  	for (argv++; *argv; argv++) {
> >  		if (strcmp(*argv, "-F") == 0) {
> > -- 
> > 2.37.3
> > 
> > 
> 
> Reviewed-by: NeilBrown <neilb@suse.de>
> 
> Thanks,
> NeilBrown

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version
  2022-10-05 10:06     ` Jeff Layton
@ 2022-10-05 13:33       ` Chuck Lever III
  2022-10-05 13:34       ` Trond Myklebust
  2022-10-05 21:14       ` NeilBrown
  2 siblings, 0 replies; 35+ messages in thread
From: Chuck Lever III @ 2022-10-05 13:33 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Neil Brown, Theodore Ts'o, adilger.kernel, Darrick J. Wong,
	Dave Chinner, Trond Myklebust, Al Viro, zohar, xiubli,
	Lukas Czerner, Jan Kara, Bruce Fields, Christian Brauner,
	fweimer, linux-btrfs, linux-fsdevel, Linux Kernel Mailing List,
	ceph-devel, linux-ext4, Linux NFS Mailing List, linux-xfs



> On Oct 5, 2022, at 6:06 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Tue, 2022-10-04 at 10:39 +1100, NeilBrown wrote:
>> On Fri, 30 Sep 2022, Jeff Layton wrote:
>>> 
>>> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
>>> index a5b71526cee0..9168bc657378 100644
>>> --- a/fs/nfsd/nfsfh.c
>>> +++ b/fs/nfsd/nfsfh.c
>>> @@ -634,6 +634,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
>>> 		stat.mtime = inode->i_mtime;
>>> 		stat.ctime = inode->i_ctime;
>>> 		stat.size  = inode->i_size;
>>> +		if (v4 && IS_I_VERSION(inode)) {
>>> +			stat.version = inode_query_iversion(inode);
>>> +			stat.result_mask |= STATX_VERSION;
>>> +		}
>> 
>> This is increasingly ugly.  I wonder if it is justified at all...
>> 
> 
> I'm fine with dropping that. So if the getattrs fail, we should just not
> offer up pre/post attrs?

That sounds good to me.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version
  2022-10-05 10:06     ` Jeff Layton
  2022-10-05 13:33       ` Chuck Lever III
@ 2022-10-05 13:34       ` Trond Myklebust
  2022-10-05 13:57         ` Jeff Layton
  2022-10-05 21:14       ` NeilBrown
  2 siblings, 1 reply; 35+ messages in thread
From: Trond Myklebust @ 2022-10-05 13:34 UTC (permalink / raw)
  To: jlayton, neilb
  Cc: zohar, djwong, xiubli, brauner, bfields, linux-btrfs, linux-xfs,
	david, fweimer, linux-kernel, chuck.lever, linux-nfs, tytso,
	viro, jack, linux-ext4, linux-fsdevel, lczerner, adilger.kernel,
	ceph-devel

On Wed, 2022-10-05 at 06:06 -0400, Jeff Layton wrote:
> On Tue, 2022-10-04 at 10:39 +1100, NeilBrown wrote:
> > On Fri, 30 Sep 2022, Jeff Layton wrote:
> > > Now that we can call into vfs_getattr to get the i_version field,
> > > use
> > > that facility to fetch it instead of doing it in
> > > nfsd4_change_attribute.
> > > 
> > > Neil also pointed out recently that IS_I_VERSION directory
> > > operations
> > > are always logged, and so we only need to mitigate the rollback
> > > problem
> > > on regular files. Also, we don't need to factor in the ctime when
> > > reexporting NFS or Ceph.
> > > 
> > > Set the STATX_VERSION (and BTIME) bits in the request when we're
> > > dealing
> > > with a v4 request. Then, instead of looking at IS_I_VERSION when
> > > generating the change attr, look at the result mask and only use
> > > it if
> > > STATX_VERSION is set. With this change, we can drop the
> > > fetch_iversion
> > > export operation as well.
> > > 
> > > Move nfsd4_change_attribute into nfsfh.c, and change it to only
> > > factor
> > > in the ctime if it's a regular file and the fs doesn't advertise
> > > STATX_ATTR_VERSION_MONOTONIC.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/nfs/export.c          |  7 -------
> > >  fs/nfsd/nfs4xdr.c        |  4 +++-
> > >  fs/nfsd/nfsfh.c          | 40
> > > ++++++++++++++++++++++++++++++++++++++++
> > >  fs/nfsd/nfsfh.h          | 29 +----------------------------
> > >  fs/nfsd/vfs.h            |  7 ++++++-
> > >  include/linux/exportfs.h |  1 -
> > >  6 files changed, 50 insertions(+), 38 deletions(-)
> > > 
> > > diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> > > index 01596f2d0a1e..1a9d5aa51dfb 100644
> > > --- a/fs/nfs/export.c
> > > +++ b/fs/nfs/export.c
> > > @@ -145,17 +145,10 @@ nfs_get_parent(struct dentry *dentry)
> > >         return parent;
> > >  }
> > >  
> > > -static u64 nfs_fetch_iversion(struct inode *inode)
> > > -{
> > > -       nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
> > > -       return inode_peek_iversion_raw(inode);
> > > -}
> > > -
> > >  const struct export_operations nfs_export_ops = {
> > >         .encode_fh = nfs_encode_fh,
> > >         .fh_to_dentry = nfs_fh_to_dentry,
> > >         .get_parent = nfs_get_parent,
> > > -       .fetch_iversion = nfs_fetch_iversion,
> > >         .flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|
> > >                 EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS
> > > |
> > >                 EXPORT_OP_NOATOMIC_ATTR,
> > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > > index 1e9690a061ec..779c009314c6 100644
> > > --- a/fs/nfsd/nfs4xdr.c
> > > +++ b/fs/nfsd/nfs4xdr.c
> > > @@ -2869,7 +2869,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr,
> > > struct svc_fh *fhp,
> > >                         goto out;
> > >         }
> > >  
> > > -       err = vfs_getattr(&path, &stat, STATX_BASIC_STATS,
> > > AT_STATX_SYNC_AS_STAT);
> > > +       err = vfs_getattr(&path, &stat,
> > > +                         STATX_BASIC_STATS | STATX_BTIME |
> > > STATX_VERSION,
> > > +                         AT_STATX_SYNC_AS_STAT);
> > >         if (err)
> > >                 goto out_nfserr;
> > >         if (!(stat.result_mask & STATX_BTIME))
> > > diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> > > index a5b71526cee0..9168bc657378 100644
> > > --- a/fs/nfsd/nfsfh.c
> > > +++ b/fs/nfsd/nfsfh.c
> > > @@ -634,6 +634,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
> > >                 stat.mtime = inode->i_mtime;
> > >                 stat.ctime = inode->i_ctime;
> > >                 stat.size  = inode->i_size;
> > > +               if (v4 && IS_I_VERSION(inode)) {
> > > +                       stat.version =
> > > inode_query_iversion(inode);
> > > +                       stat.result_mask |= STATX_VERSION;
> > > +               }
> > 
> > This is increasingly ugly.  I wonder if it is justified at all...
> > 
> 
> I'm fine with dropping that. So if the getattrs fail, we should just
> not
> offer up pre/post attrs?
> 
> > >         }
> > >         if (v4)
> > >                 fhp->fh_pre_change =
> > > nfsd4_change_attribute(&stat, inode);
> > > @@ -665,6 +669,8 @@ void fh_fill_post_attrs(struct svc_fh *fhp)
> > >         if (err) {
> > >                 fhp->fh_post_saved = false;
> > >                 fhp->fh_post_attr.ctime = inode->i_ctime;
> > > +               if (v4 && IS_I_VERSION(inode))
> > > +                       fhp->fh_post_attr.version =
> > > inode_query_iversion(inode);
> > 
> > ... ditto ...
> > 
> > >         } else
> > >                 fhp->fh_post_saved = true;
> > >         if (v4)
> > > @@ -754,3 +760,37 @@ enum fsid_source fsid_source(const struct
> > > svc_fh *fhp)
> > >                 return FSIDSOURCE_UUID;
> > >         return FSIDSOURCE_DEV;
> > >  }
> > > +
> > > +/*
> > > + * We could use i_version alone as the change attribute. 
> > > However, i_version
> > > + * can go backwards on a regular file after an unclean
> > > shutdown.  On its own
> > > + * that doesn't necessarily cause a problem, but if i_version
> > > goes backwards
> > > + * and then is incremented again it could reuse a value that was
> > > previously
> > > + * used before boot, and a client who queried the two values
> > > might incorrectly
> > > + * assume nothing changed.
> > > + *
> > > + * By using both ctime and the i_version counter we guarantee
> > > that as long as
> > > + * time doesn't go backwards we never reuse an old value. If the
> > > filesystem
> > > + * advertises STATX_ATTR_VERSION_MONOTONIC, then this mitigation
> > > is not needed.
> > > + *
> > > + * We only need to do this for regular files as well. For
> > > directories, we
> > > + * assume that the new change attr is always logged to stable
> > > storage in some
> > > + * fashion before the results can be seen.
> > > + */
> > > +u64 nfsd4_change_attribute(struct kstat *stat, struct inode
> > > *inode)
> > > +{
> > > +       u64 chattr;
> > > +
> > > +       if (stat->result_mask & STATX_VERSION) {
> > > +               chattr = stat->version;
> > > +
> > > +               if (S_ISREG(inode->i_mode) &&
> > > +                   !(stat->attributes &
> > > STATX_ATTR_VERSION_MONOTONIC)) {
> > 
> > I would really rather that the fs got to make this decision.
> > If it can guarantee that the i_version is monotonic even over a
> > crash
> > (which is probably can for directory, and might need changes to do
> > for
> > files) then it sets STATX_ATTR_VERSION_MONOTONIC and nfsd trusts it
> > completely.
> > If it cannot, then it doesn't set the flag.
> > i.e. the S_ISREG() test should be in the filesystem, not in nfsd.
> > 
> 
> This sounds reasonable, but for one thing.
> 
> From RFC 7862:
> 
>    While Section 5.4 of [RFC5661] discusses
>    per-file system attributes, it is expected that the value of
>    change_attr_type will not depend on the value of "homogeneous" and
>    will only change in the event of a migration.
> 
> The change_attr_type4 must be the same for all filehandles under a
> particular filesystem.
> 
> If we do what you suggest though, then it's easily possible for the
> fs
> to set STATX_ATTR_VERSION_MONOTONIC on directories but not files. If
> we
> later want to allow nfsd to advertise a change_attr_type4, we won't
> be
> able to rely on the STATX_ATTR_VERSION_MONOTONIC to tell us how to
> fill
> that out.

That will break clients. So no, that's not acceptable.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version
  2022-10-05 13:34       ` Trond Myklebust
@ 2022-10-05 13:57         ` Jeff Layton
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff Layton @ 2022-10-05 13:57 UTC (permalink / raw)
  To: Trond Myklebust, neilb
  Cc: zohar, djwong, xiubli, brauner, bfields, linux-btrfs, linux-xfs,
	david, fweimer, linux-kernel, chuck.lever, linux-nfs, tytso,
	viro, jack, linux-ext4, linux-fsdevel, lczerner, adilger.kernel,
	ceph-devel

On Wed, 2022-10-05 at 13:34 +0000, Trond Myklebust wrote:
> On Wed, 2022-10-05 at 06:06 -0400, Jeff Layton wrote:
> > On Tue, 2022-10-04 at 10:39 +1100, NeilBrown wrote:
> > > On Fri, 30 Sep 2022, Jeff Layton wrote:
> > > > Now that we can call into vfs_getattr to get the i_version field,
> > > > use
> > > > that facility to fetch it instead of doing it in
> > > > nfsd4_change_attribute.
> > > > 
> > > > Neil also pointed out recently that IS_I_VERSION directory
> > > > operations
> > > > are always logged, and so we only need to mitigate the rollback
> > > > problem
> > > > on regular files. Also, we don't need to factor in the ctime when
> > > > reexporting NFS or Ceph.
> > > > 
> > > > Set the STATX_VERSION (and BTIME) bits in the request when we're
> > > > dealing
> > > > with a v4 request. Then, instead of looking at IS_I_VERSION when
> > > > generating the change attr, look at the result mask and only use
> > > > it if
> > > > STATX_VERSION is set. With this change, we can drop the
> > > > fetch_iversion
> > > > export operation as well.
> > > > 
> > > > Move nfsd4_change_attribute into nfsfh.c, and change it to only
> > > > factor
> > > > in the ctime if it's a regular file and the fs doesn't advertise
> > > > STATX_ATTR_VERSION_MONOTONIC.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  fs/nfs/export.c          |  7 -------
> > > >  fs/nfsd/nfs4xdr.c        |  4 +++-
> > > >  fs/nfsd/nfsfh.c          | 40
> > > > ++++++++++++++++++++++++++++++++++++++++
> > > >  fs/nfsd/nfsfh.h          | 29 +----------------------------
> > > >  fs/nfsd/vfs.h            |  7 ++++++-
> > > >  include/linux/exportfs.h |  1 -
> > > >  6 files changed, 50 insertions(+), 38 deletions(-)
> > > > 
> > > > diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> > > > index 01596f2d0a1e..1a9d5aa51dfb 100644
> > > > --- a/fs/nfs/export.c
> > > > +++ b/fs/nfs/export.c
> > > > @@ -145,17 +145,10 @@ nfs_get_parent(struct dentry *dentry)
> > > >         return parent;
> > > >  }
> > > >  
> > > > -static u64 nfs_fetch_iversion(struct inode *inode)
> > > > -{
> > > > -       nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
> > > > -       return inode_peek_iversion_raw(inode);
> > > > -}
> > > > -
> > > >  const struct export_operations nfs_export_ops = {
> > > >         .encode_fh = nfs_encode_fh,
> > > >         .fh_to_dentry = nfs_fh_to_dentry,
> > > >         .get_parent = nfs_get_parent,
> > > > -       .fetch_iversion = nfs_fetch_iversion,
> > > >         .flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|
> > > >                 EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS
> > > > > 
> > > >                 EXPORT_OP_NOATOMIC_ATTR,
> > > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > > > index 1e9690a061ec..779c009314c6 100644
> > > > --- a/fs/nfsd/nfs4xdr.c
> > > > +++ b/fs/nfsd/nfs4xdr.c
> > > > @@ -2869,7 +2869,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr,
> > > > struct svc_fh *fhp,
> > > >                         goto out;
> > > >         }
> > > >  
> > > > -       err = vfs_getattr(&path, &stat, STATX_BASIC_STATS,
> > > > AT_STATX_SYNC_AS_STAT);
> > > > +       err = vfs_getattr(&path, &stat,
> > > > +                         STATX_BASIC_STATS | STATX_BTIME |
> > > > STATX_VERSION,
> > > > +                         AT_STATX_SYNC_AS_STAT);
> > > >         if (err)
> > > >                 goto out_nfserr;
> > > >         if (!(stat.result_mask & STATX_BTIME))
> > > > diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> > > > index a5b71526cee0..9168bc657378 100644
> > > > --- a/fs/nfsd/nfsfh.c
> > > > +++ b/fs/nfsd/nfsfh.c
> > > > @@ -634,6 +634,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
> > > >                 stat.mtime = inode->i_mtime;
> > > >                 stat.ctime = inode->i_ctime;
> > > >                 stat.size  = inode->i_size;
> > > > +               if (v4 && IS_I_VERSION(inode)) {
> > > > +                       stat.version =
> > > > inode_query_iversion(inode);
> > > > +                       stat.result_mask |= STATX_VERSION;
> > > > +               }
> > > 
> > > This is increasingly ugly.  I wonder if it is justified at all...
> > > 
> > 
> > I'm fine with dropping that. So if the getattrs fail, we should just
> > not
> > offer up pre/post attrs?
> > 
> > > >         }
> > > >         if (v4)
> > > >                 fhp->fh_pre_change =
> > > > nfsd4_change_attribute(&stat, inode);
> > > > @@ -665,6 +669,8 @@ void fh_fill_post_attrs(struct svc_fh *fhp)
> > > >         if (err) {
> > > >                 fhp->fh_post_saved = false;
> > > >                 fhp->fh_post_attr.ctime = inode->i_ctime;
> > > > +               if (v4 && IS_I_VERSION(inode))
> > > > +                       fhp->fh_post_attr.version =
> > > > inode_query_iversion(inode);
> > > 
> > > ... ditto ...
> > > 
> > > >         } else
> > > >                 fhp->fh_post_saved = true;
> > > >         if (v4)
> > > > @@ -754,3 +760,37 @@ enum fsid_source fsid_source(const struct
> > > > svc_fh *fhp)
> > > >                 return FSIDSOURCE_UUID;
> > > >         return FSIDSOURCE_DEV;
> > > >  }
> > > > +
> > > > +/*
> > > > + * We could use i_version alone as the change attribute. 
> > > > However, i_version
> > > > + * can go backwards on a regular file after an unclean
> > > > shutdown.  On its own
> > > > + * that doesn't necessarily cause a problem, but if i_version
> > > > goes backwards
> > > > + * and then is incremented again it could reuse a value that was
> > > > previously
> > > > + * used before boot, and a client who queried the two values
> > > > might incorrectly
> > > > + * assume nothing changed.
> > > > + *
> > > > + * By using both ctime and the i_version counter we guarantee
> > > > that as long as
> > > > + * time doesn't go backwards we never reuse an old value. If the
> > > > filesystem
> > > > + * advertises STATX_ATTR_VERSION_MONOTONIC, then this mitigation
> > > > is not needed.
> > > > + *
> > > > + * We only need to do this for regular files as well. For
> > > > directories, we
> > > > + * assume that the new change attr is always logged to stable
> > > > storage in some
> > > > + * fashion before the results can be seen.
> > > > + */
> > > > +u64 nfsd4_change_attribute(struct kstat *stat, struct inode
> > > > *inode)
> > > > +{
> > > > +       u64 chattr;
> > > > +
> > > > +       if (stat->result_mask & STATX_VERSION) {
> > > > +               chattr = stat->version;
> > > > +
> > > > +               if (S_ISREG(inode->i_mode) &&
> > > > +                   !(stat->attributes &
> > > > STATX_ATTR_VERSION_MONOTONIC)) {
> > > 
> > > I would really rather that the fs got to make this decision.
> > > If it can guarantee that the i_version is monotonic even over a
> > > crash
> > > (which is probably can for directory, and might need changes to do
> > > for
> > > files) then it sets STATX_ATTR_VERSION_MONOTONIC and nfsd trusts it
> > > completely.
> > > If it cannot, then it doesn't set the flag.
> > > i.e. the S_ISREG() test should be in the filesystem, not in nfsd.
> > > 
> > 
> > This sounds reasonable, but for one thing.
> > 
> > From RFC 7862:
> > 
> >    While Section 5.4 of [RFC5661] discusses
> >    per-file system attributes, it is expected that the value of
> >    change_attr_type will not depend on the value of "homogeneous" and
> >    will only change in the event of a migration.
> > 
> > The change_attr_type4 must be the same for all filehandles under a
> > particular filesystem.
> > 
> > If we do what you suggest though, then it's easily possible for the
> > fs
> > to set STATX_ATTR_VERSION_MONOTONIC on directories but not files. If
> > we
> > later want to allow nfsd to advertise a change_attr_type4, we won't
> > be
> > able to rely on the STATX_ATTR_VERSION_MONOTONIC to tell us how to
> > fill
> > that out.
> 
> That will break clients. So no, that's not acceptable.
> 

Yeah. This is why I mentioned that this flag would have been better
advertised via fsinfo(), had that been a thing.

One option is to just document that an fs must advertise the same flag
value for all inodes.

Alternately, we could allow the fs to set the STATX_ATTR_* flag with
per-inode granularity, and for nfsd, just add a new change_attr_type()
op to export_operations. Most filesystems would just have that return a
hardcoded value, but an nfs reexport could just pass through whatever
value it got from the server.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter
  2022-10-03 22:56         ` NeilBrown
@ 2022-10-05 16:40           ` Jeff Layton
  2022-10-05 21:40             ` NeilBrown
  0 siblings, 1 reply; 35+ messages in thread
From: Jeff Layton @ 2022-10-05 16:40 UTC (permalink / raw)
  To: NeilBrown, Amir Goldstein
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Tue, 2022-10-04 at 09:56 +1100, NeilBrown wrote:
> On Tue, 04 Oct 2022, Amir Goldstein wrote:
> > On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > 
> > > On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> > > > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > 
> > > > > The c/mtime and i_version currently get updated before the data is
> > > > > copied (or a DIO write is issued), which is problematic for NFS.
> > > > > 
> > > > > READ+GETATTR can race with a write (even a local one) in such a way as
> > > > > to make the client associate the state of the file with the wrong change
> > > > > attribute. That association can persist indefinitely if the file sees no
> > > > > further changes.
> > > > > 
> > > > > Move the setting of times to the bottom of the function in
> > > > > __generic_file_write_iter and only update it if something was
> > > > > successfully written.
> > > > > 
> > > > 
> > > > This solution is wrong for several reasons:
> > > > 
> > > > 1. There is still file_update_time() in ->page_mkwrite() so you haven't
> > > >     solved the problem completely
> > > 
> > > Right. I don't think there is a way to solve the problem vs. mmap.
> > > Userland can write to a writeable mmap'ed page at any time and we'd
> > > never know. We have to specifically carve out mmap as an exception here.
> > > I'll plan to add something to the manpage patch for this.
> > > 
> > > > 2. The other side of the coin is that post crash state is more likely to end
> > > >     up data changes without mtime/ctime change
> > > > 
> > > 
> > > Is this really something filesystems rely on? I suppose the danger is
> > > that some cached data gets written to disk before the write returns and
> > > the inode on disk never gets updated.
> > > 
> > > But...isn't that a danger now? Some of the cached data could get written
> > > out and the updated inode just never makes it to disk before a crash
> > > (AFAIU). I'm not sure that this increases our exposure to that problem.
> > > 
> > > 
> > 
> > You are correct that that danger exists, but it only exists for overwriting
> > to allocated blocks.
> > 
> > For writing to new blocks, mtime change is recorded in transaction
> > before the block mapping is recorded in transaction so there is no
> > danger in this case (before your patch).
> > 
> > Also, observing size change without observing mtime change
> > after crash seems like a very bad outcome that may be possible
> > after your change.
> > 
> > These are just a few cases that I could think of, they may be filesystem
> > dependent, but my gut feeling is that if you remove the time update before
> > the operation, that has been like that forever, a lot of s#!t is going to float
> > for various filesystems and applications.
> > 
> > And it is not one of those things that are discovered  during rc or even
> > stable kernel testing - they are discovered much later when users start to
> > realize their applications got bogged up after crash, so it feels like to me
> > like playing with fire.
> > 
> > > > If I read the problem description correctly, then a solution that invalidates
> > > > the NFS cache before AND after the write would be acceptable. Right?
> > > > Would an extra i_version bump after the write solve the race?
> > > > 
> > > 
> > > I based this patch on Neil's assertion that updating the time before an
> > > operation was pointless if we were going to do it afterward. The NFS
> > > client only really cares about seeing it change after a write.
> > > 
> > 
> > Pointless to NFS client maybe.
> > Whether or not this is not changing user behavior for other applications
> > is up to you to prove and I doubt that you can prove it because I doubt
> > that it is true.
> > 
> > > Doing both would be fine from a correctness standpoint, and in most
> > > cases, the second would be a no-op anyway since a query would have to
> > > race in between the two for that to happen.
> > > 
> > > FWIW, I think we should update the m/ctime and version at the same time.
> > > If the version changes, then there is always the potential that a timer
> > > tick has occurred. So, that would translate to a second call to
> > > file_update_time in here.
> > > 
> > > The downside of bumping the times/version both before and after is that
> > > these are hot codepaths, and we'd be adding extra operations there. Even
> > > in the case where nothing has changed, we'd have to call
> > > inode_needs_update_time a second time for every write. Is that worth the
> > > cost?
> > 
> > Is there a practical cost for iversion bump AFTER write as I suggested?
> > If you NEED m/ctime update AFTER write and iversion update is not enough
> > then I did not understand from your commit message why that is.
> > 
> > Thanks,
> > Amir.
> > 
> 
> Maybe we should split i_version updates from ctime updates.
> 
> While it isn't true that ctime updates have happened before the write
> "forever" it has been true since 2.3.43[1] which is close to forever.
> 
> For ctime there doesn't appear to be a strong specification of when the
> change happens, so history provides a good case for leaving it before.
> For i_version we want to provide clear and unambiguous semantics.
> Performing 2 updates makes the specification muddy.
> 
> So I would prefer a single update for i_version, performed after the
> change becomes visible.  If that means it has to be separate from ctime,
> then so be it.
> 
> NeilBrown
> 
> 
> [1]:  https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=636b38438001a00b25f23e38747a91cb8428af29


Not necessarily. We can document it in such a way that bumping it twice
is allowed, but not required.

My main concern with splitting them up is that we'd have to dirty the
inode twice if both the times and the i_version need updating. If the
inode gets written out in between, then we end up doing twice the I/O.
The interim on-disk metadata would be in sort of a weird state too --
the ctime would have changed but the version would still be old.

It might be worthwhile to just go ahead and continue bumping it in
file_update_time, and then we'd just attempt to bump the i_version again
afterward. The second bump will almost always be a no-op anyway.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version
  2022-10-05 10:06     ` Jeff Layton
  2022-10-05 13:33       ` Chuck Lever III
  2022-10-05 13:34       ` Trond Myklebust
@ 2022-10-05 21:14       ` NeilBrown
  2022-10-06 11:15         ` Jeff Layton
  2 siblings, 1 reply; 35+ messages in thread
From: NeilBrown @ 2022-10-05 21:14 UTC (permalink / raw)
  To: Jeff Layton
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Wed, 05 Oct 2022, Jeff Layton wrote:
> On Tue, 2022-10-04 at 10:39 +1100, NeilBrown wrote:
> > On Fri, 30 Sep 2022, Jeff Layton wrote:
> > > Now that we can call into vfs_getattr to get the i_version field, use
> > > that facility to fetch it instead of doing it in nfsd4_change_attribute.
> > > 
> > > Neil also pointed out recently that IS_I_VERSION directory operations
> > > are always logged, and so we only need to mitigate the rollback problem
> > > on regular files. Also, we don't need to factor in the ctime when
> > > reexporting NFS or Ceph.
> > > 
> > > Set the STATX_VERSION (and BTIME) bits in the request when we're dealing
> > > with a v4 request. Then, instead of looking at IS_I_VERSION when
> > > generating the change attr, look at the result mask and only use it if
> > > STATX_VERSION is set. With this change, we can drop the fetch_iversion
> > > export operation as well.
> > > 
> > > Move nfsd4_change_attribute into nfsfh.c, and change it to only factor
> > > in the ctime if it's a regular file and the fs doesn't advertise
> > > STATX_ATTR_VERSION_MONOTONIC.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/nfs/export.c          |  7 -------
> > >  fs/nfsd/nfs4xdr.c        |  4 +++-
> > >  fs/nfsd/nfsfh.c          | 40 ++++++++++++++++++++++++++++++++++++++++
> > >  fs/nfsd/nfsfh.h          | 29 +----------------------------
> > >  fs/nfsd/vfs.h            |  7 ++++++-
> > >  include/linux/exportfs.h |  1 -
> > >  6 files changed, 50 insertions(+), 38 deletions(-)
> > > 
> > > diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> > > index 01596f2d0a1e..1a9d5aa51dfb 100644
> > > --- a/fs/nfs/export.c
> > > +++ b/fs/nfs/export.c
> > > @@ -145,17 +145,10 @@ nfs_get_parent(struct dentry *dentry)
> > >  	return parent;
> > >  }
> > >  
> > > -static u64 nfs_fetch_iversion(struct inode *inode)
> > > -{
> > > -	nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
> > > -	return inode_peek_iversion_raw(inode);
> > > -}
> > > -
> > >  const struct export_operations nfs_export_ops = {
> > >  	.encode_fh = nfs_encode_fh,
> > >  	.fh_to_dentry = nfs_fh_to_dentry,
> > >  	.get_parent = nfs_get_parent,
> > > -	.fetch_iversion = nfs_fetch_iversion,
> > >  	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|
> > >  		EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS|
> > >  		EXPORT_OP_NOATOMIC_ATTR,
> > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > > index 1e9690a061ec..779c009314c6 100644
> > > --- a/fs/nfsd/nfs4xdr.c
> > > +++ b/fs/nfsd/nfs4xdr.c
> > > @@ -2869,7 +2869,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> > >  			goto out;
> > >  	}
> > >  
> > > -	err = vfs_getattr(&path, &stat, STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
> > > +	err = vfs_getattr(&path, &stat,
> > > +			  STATX_BASIC_STATS | STATX_BTIME | STATX_VERSION,
> > > +			  AT_STATX_SYNC_AS_STAT);
> > >  	if (err)
> > >  		goto out_nfserr;
> > >  	if (!(stat.result_mask & STATX_BTIME))
> > > diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> > > index a5b71526cee0..9168bc657378 100644
> > > --- a/fs/nfsd/nfsfh.c
> > > +++ b/fs/nfsd/nfsfh.c
> > > @@ -634,6 +634,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
> > >  		stat.mtime = inode->i_mtime;
> > >  		stat.ctime = inode->i_ctime;
> > >  		stat.size  = inode->i_size;
> > > +		if (v4 && IS_I_VERSION(inode)) {
> > > +			stat.version = inode_query_iversion(inode);
> > > +			stat.result_mask |= STATX_VERSION;
> > > +		}
> > 
> > This is increasingly ugly.  I wonder if it is justified at all...
> > 
> 
> I'm fine with dropping that. So if the getattrs fail, we should just not
> offer up pre/post attrs?
> 
> > >  	}
> > >  	if (v4)
> > >  		fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
> > > @@ -665,6 +669,8 @@ void fh_fill_post_attrs(struct svc_fh *fhp)
> > >  	if (err) {
> > >  		fhp->fh_post_saved = false;
> > >  		fhp->fh_post_attr.ctime = inode->i_ctime;
> > > +		if (v4 && IS_I_VERSION(inode))
> > > +			fhp->fh_post_attr.version = inode_query_iversion(inode);
> > 
> > ... ditto ...
> > 
> > >  	} else
> > >  		fhp->fh_post_saved = true;
> > >  	if (v4)
> > > @@ -754,3 +760,37 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
> > >  		return FSIDSOURCE_UUID;
> > >  	return FSIDSOURCE_DEV;
> > >  }
> > > +
> > > +/*
> > > + * We could use i_version alone as the change attribute.  However, i_version
> > > + * can go backwards on a regular file after an unclean shutdown.  On its own
> > > + * that doesn't necessarily cause a problem, but if i_version goes backwards
> > > + * and then is incremented again it could reuse a value that was previously
> > > + * used before boot, and a client who queried the two values might incorrectly
> > > + * assume nothing changed.
> > > + *
> > > + * By using both ctime and the i_version counter we guarantee that as long as
> > > + * time doesn't go backwards we never reuse an old value. If the filesystem
> > > + * advertises STATX_ATTR_VERSION_MONOTONIC, then this mitigation is not needed.
> > > + *
> > > + * We only need to do this for regular files as well. For directories, we
> > > + * assume that the new change attr is always logged to stable storage in some
> > > + * fashion before the results can be seen.
> > > + */
> > > +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
> > > +{
> > > +	u64 chattr;
> > > +
> > > +	if (stat->result_mask & STATX_VERSION) {
> > > +		chattr = stat->version;
> > > +
> > > +		if (S_ISREG(inode->i_mode) &&
> > > +		    !(stat->attributes & STATX_ATTR_VERSION_MONOTONIC)) {
> > 
> > I would really rather that the fs got to make this decision.
> > If it can guarantee that the i_version is monotonic even over a crash
> > (which is probably can for directory, and might need changes to do for
> > files) then it sets STATX_ATTR_VERSION_MONOTONIC and nfsd trusts it
> > completely.
> > If it cannot, then it doesn't set the flag.
> > i.e. the S_ISREG() test should be in the filesystem, not in nfsd.
> > 
> 
> This sounds reasonable, but for one thing.
> 
> From RFC 7862:
> 
>    While Section 5.4 of [RFC5661] discusses
>    per-file system attributes, it is expected that the value of
>    change_attr_type will not depend on the value of "homogeneous" and
>    will only change in the event of a migration.
> 
> The change_attr_type4 must be the same for all filehandles under a
> particular filesystem.
> 
> If we do what you suggest though, then it's easily possible for the fs
> to set STATX_ATTR_VERSION_MONOTONIC on directories but not files. If we
> later want to allow nfsd to advertise a change_attr_type4, we won't be
> able to rely on the STATX_ATTR_VERSION_MONOTONIC to tell us how to fill
> that out.
> 
> Maybe that's ok. I suppose we could add a new field to the export
> options that filesystems can set to advertise what sort of change attr
> they offer?
> 

There are 3 cases:
1/ a file/dir which advertises MONOTONIC is easy to handle.
2/ an IS_I_VERSION file/dir that does not advertise MONOTONIC will only fail
   to be MONOTONIC across unclean restart (correct?).  nfsd can
   compensate using an xattr on the root to count crashes, or just adding ctime.
3/ a non-IS_I_VERSION fs that does not advertise MONOTONIC cannot
   be compensated for by nfsd.

If we ever want nfsd to advertise MONOTONIC, then we must be able to
reject non-IS_I_VERSION filesystems that don't advertise MONOTONIC on
all files.

Maybe we need a global nfsd option which defaults to "monotoric" and
causes those files to be rejected, but can be set to "non-monotonic" and
then allows all files to be exported.

It would be nice to make it easy to run multiple nfsd instances each on a
different IP address.  Each can then have different options.  This could
also be used to reexport an NFS mount using unmodified filehandles.

Currently you need a network namespace to create a new nfsd.  I wonder
if that is a little too much of a barrier.  But maybe we could automate
the creation of working network namespaces for nfsd....

NeilBrown

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter
  2022-10-05 16:40           ` Jeff Layton
@ 2022-10-05 21:40             ` NeilBrown
  0 siblings, 0 replies; 35+ messages in thread
From: NeilBrown @ 2022-10-05 21:40 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, tytso, adilger.kernel, djwong, david, trondmy,
	viro, zohar, xiubli, chuck.lever, lczerner, jack, bfields,
	brauner, fweimer, linux-btrfs, linux-fsdevel, linux-kernel,
	ceph-devel, linux-ext4, linux-nfs, linux-xfs

On Thu, 06 Oct 2022, Jeff Layton wrote:
> On Tue, 2022-10-04 at 09:56 +1100, NeilBrown wrote:
> > On Tue, 04 Oct 2022, Amir Goldstein wrote:
> > > On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote:
> > > > > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > > 
> > > > > > The c/mtime and i_version currently get updated before the data is
> > > > > > copied (or a DIO write is issued), which is problematic for NFS.
> > > > > > 
> > > > > > READ+GETATTR can race with a write (even a local one) in such a way as
> > > > > > to make the client associate the state of the file with the wrong change
> > > > > > attribute. That association can persist indefinitely if the file sees no
> > > > > > further changes.
> > > > > > 
> > > > > > Move the setting of times to the bottom of the function in
> > > > > > __generic_file_write_iter and only update it if something was
> > > > > > successfully written.
> > > > > > 
> > > > > 
> > > > > This solution is wrong for several reasons:
> > > > > 
> > > > > 1. There is still file_update_time() in ->page_mkwrite() so you haven't
> > > > >     solved the problem completely
> > > > 
> > > > Right. I don't think there is a way to solve the problem vs. mmap.
> > > > Userland can write to a writeable mmap'ed page at any time and we'd
> > > > never know. We have to specifically carve out mmap as an exception here.
> > > > I'll plan to add something to the manpage patch for this.
> > > > 
> > > > > 2. The other side of the coin is that post crash state is more likely to end
> > > > >     up data changes without mtime/ctime change
> > > > > 
> > > > 
> > > > Is this really something filesystems rely on? I suppose the danger is
> > > > that some cached data gets written to disk before the write returns and
> > > > the inode on disk never gets updated.
> > > > 
> > > > But...isn't that a danger now? Some of the cached data could get written
> > > > out and the updated inode just never makes it to disk before a crash
> > > > (AFAIU). I'm not sure that this increases our exposure to that problem.
> > > > 
> > > > 
> > > 
> > > You are correct that that danger exists, but it only exists for overwriting
> > > to allocated blocks.
> > > 
> > > For writing to new blocks, mtime change is recorded in transaction
> > > before the block mapping is recorded in transaction so there is no
> > > danger in this case (before your patch).
> > > 
> > > Also, observing size change without observing mtime change
> > > after crash seems like a very bad outcome that may be possible
> > > after your change.
> > > 
> > > These are just a few cases that I could think of, they may be filesystem
> > > dependent, but my gut feeling is that if you remove the time update before
> > > the operation, that has been like that forever, a lot of s#!t is going to float
> > > for various filesystems and applications.
> > > 
> > > And it is not one of those things that are discovered  during rc or even
> > > stable kernel testing - they are discovered much later when users start to
> > > realize their applications got bogged up after crash, so it feels like to me
> > > like playing with fire.
> > > 
> > > > > If I read the problem description correctly, then a solution that invalidates
> > > > > the NFS cache before AND after the write would be acceptable. Right?
> > > > > Would an extra i_version bump after the write solve the race?
> > > > > 
> > > > 
> > > > I based this patch on Neil's assertion that updating the time before an
> > > > operation was pointless if we were going to do it afterward. The NFS
> > > > client only really cares about seeing it change after a write.
> > > > 
> > > 
> > > Pointless to NFS client maybe.
> > > Whether or not this is not changing user behavior for other applications
> > > is up to you to prove and I doubt that you can prove it because I doubt
> > > that it is true.
> > > 
> > > > Doing both would be fine from a correctness standpoint, and in most
> > > > cases, the second would be a no-op anyway since a query would have to
> > > > race in between the two for that to happen.
> > > > 
> > > > FWIW, I think we should update the m/ctime and version at the same time.
> > > > If the version changes, then there is always the potential that a timer
> > > > tick has occurred. So, that would translate to a second call to
> > > > file_update_time in here.
> > > > 
> > > > The downside of bumping the times/version both before and after is that
> > > > these are hot codepaths, and we'd be adding extra operations there. Even
> > > > in the case where nothing has changed, we'd have to call
> > > > inode_needs_update_time a second time for every write. Is that worth the
> > > > cost?
> > > 
> > > Is there a practical cost for iversion bump AFTER write as I suggested?
> > > If you NEED m/ctime update AFTER write and iversion update is not enough
> > > then I did not understand from your commit message why that is.
> > > 
> > > Thanks,
> > > Amir.
> > > 
> > 
> > Maybe we should split i_version updates from ctime updates.
> > 
> > While it isn't true that ctime updates have happened before the write
> > "forever" it has been true since 2.3.43[1] which is close to forever.
> > 
> > For ctime there doesn't appear to be a strong specification of when the
> > change happens, so history provides a good case for leaving it before.
> > For i_version we want to provide clear and unambiguous semantics.
> > Performing 2 updates makes the specification muddy.
> > 
> > So I would prefer a single update for i_version, performed after the
> > change becomes visible.  If that means it has to be separate from ctime,
> > then so be it.
> > 
> > NeilBrown
> > 
> > 
> > [1]:  https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=636b38438001a00b25f23e38747a91cb8428af29
> 
> 
> Not necessarily. We can document it in such a way that bumping it twice
> is allowed, but not required.
> 
> My main concern with splitting them up is that we'd have to dirty the
> inode twice if both the times and the i_version need updating. If the
> inode gets written out in between, then we end up doing twice the I/O.
> The interim on-disk metadata would be in sort of a weird state too --
> the ctime would have changed but the version would still be old.
> 
> It might be worthwhile to just go ahead and continue bumping it in
> file_update_time, and then we'd just attempt to bump the i_version again
> afterward. The second bump will almost always be a no-op anyway.

I"m probably starting to sound like a scratched record here, but this is
why I think it should be up to the filesystem to bump i_version when it
determines that it should.  It should be in a position to include the
i_version update any time that it writes the inode and so avoid a double
write.

Having that vfs/mm do so much of the work makes it hard for the
filesystem to do the right amount of work.  The common code should
provide libraries of useful code, the filesystems should call that as
appropriate. Some of our code is structured that way, some of it isn't.

Most callers of file_update_time() are inside filesystems and that is
good - they are in control.
There are 3 in mm/*.c.  Those are all in callbacks from the filesystem,
so the fs could avoid them, but only by duplicating lots of code to
avoid using the callback.  Instead these file_update_time() calls should
become more explicit calls into the filesystem telling the filesystem
what has just happened, or is about to happen.  Then the filesystem can
do the right thing, rather than having something done to it.

See also https://lwn.net/Articles/336262/ and the "midlayer mistake".

But yes, doing the bump afterwards as well is likely to be a no-op most
of the time and is probably the easy solution.  Ugly, but easy.

NeilBrown

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version
  2022-10-05 21:14       ` NeilBrown
@ 2022-10-06 11:15         ` Jeff Layton
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff Layton @ 2022-10-06 11:15 UTC (permalink / raw)
  To: NeilBrown
  Cc: tytso, adilger.kernel, djwong, david, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Thu, 2022-10-06 at 08:14 +1100, NeilBrown wrote:
> On Wed, 05 Oct 2022, Jeff Layton wrote:
> > On Tue, 2022-10-04 at 10:39 +1100, NeilBrown wrote:
> > > On Fri, 30 Sep 2022, Jeff Layton wrote:
> > > > Now that we can call into vfs_getattr to get the i_version field, use
> > > > that facility to fetch it instead of doing it in nfsd4_change_attribute.
> > > > 
> > > > Neil also pointed out recently that IS_I_VERSION directory operations
> > > > are always logged, and so we only need to mitigate the rollback problem
> > > > on regular files. Also, we don't need to factor in the ctime when
> > > > reexporting NFS or Ceph.
> > > > 
> > > > Set the STATX_VERSION (and BTIME) bits in the request when we're dealing
> > > > with a v4 request. Then, instead of looking at IS_I_VERSION when
> > > > generating the change attr, look at the result mask and only use it if
> > > > STATX_VERSION is set. With this change, we can drop the fetch_iversion
> > > > export operation as well.
> > > > 
> > > > Move nfsd4_change_attribute into nfsfh.c, and change it to only factor
> > > > in the ctime if it's a regular file and the fs doesn't advertise
> > > > STATX_ATTR_VERSION_MONOTONIC.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  fs/nfs/export.c          |  7 -------
> > > >  fs/nfsd/nfs4xdr.c        |  4 +++-
> > > >  fs/nfsd/nfsfh.c          | 40 ++++++++++++++++++++++++++++++++++++++++
> > > >  fs/nfsd/nfsfh.h          | 29 +----------------------------
> > > >  fs/nfsd/vfs.h            |  7 ++++++-
> > > >  include/linux/exportfs.h |  1 -
> > > >  6 files changed, 50 insertions(+), 38 deletions(-)
> > > > 
> > > > diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> > > > index 01596f2d0a1e..1a9d5aa51dfb 100644
> > > > --- a/fs/nfs/export.c
> > > > +++ b/fs/nfs/export.c
> > > > @@ -145,17 +145,10 @@ nfs_get_parent(struct dentry *dentry)
> > > >  	return parent;
> > > >  }
> > > >  
> > > > -static u64 nfs_fetch_iversion(struct inode *inode)
> > > > -{
> > > > -	nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
> > > > -	return inode_peek_iversion_raw(inode);
> > > > -}
> > > > -
> > > >  const struct export_operations nfs_export_ops = {
> > > >  	.encode_fh = nfs_encode_fh,
> > > >  	.fh_to_dentry = nfs_fh_to_dentry,
> > > >  	.get_parent = nfs_get_parent,
> > > > -	.fetch_iversion = nfs_fetch_iversion,
> > > >  	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|
> > > >  		EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS|
> > > >  		EXPORT_OP_NOATOMIC_ATTR,
> > > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > > > index 1e9690a061ec..779c009314c6 100644
> > > > --- a/fs/nfsd/nfs4xdr.c
> > > > +++ b/fs/nfsd/nfs4xdr.c
> > > > @@ -2869,7 +2869,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> > > >  			goto out;
> > > >  	}
> > > >  
> > > > -	err = vfs_getattr(&path, &stat, STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
> > > > +	err = vfs_getattr(&path, &stat,
> > > > +			  STATX_BASIC_STATS | STATX_BTIME | STATX_VERSION,
> > > > +			  AT_STATX_SYNC_AS_STAT);
> > > >  	if (err)
> > > >  		goto out_nfserr;
> > > >  	if (!(stat.result_mask & STATX_BTIME))
> > > > diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> > > > index a5b71526cee0..9168bc657378 100644
> > > > --- a/fs/nfsd/nfsfh.c
> > > > +++ b/fs/nfsd/nfsfh.c
> > > > @@ -634,6 +634,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
> > > >  		stat.mtime = inode->i_mtime;
> > > >  		stat.ctime = inode->i_ctime;
> > > >  		stat.size  = inode->i_size;
> > > > +		if (v4 && IS_I_VERSION(inode)) {
> > > > +			stat.version = inode_query_iversion(inode);
> > > > +			stat.result_mask |= STATX_VERSION;
> > > > +		}
> > > 
> > > This is increasingly ugly.  I wonder if it is justified at all...
> > > 
> > 
> > I'm fine with dropping that. So if the getattrs fail, we should just not
> > offer up pre/post attrs?
> > 
> > > >  	}
> > > >  	if (v4)
> > > >  		fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
> > > > @@ -665,6 +669,8 @@ void fh_fill_post_attrs(struct svc_fh *fhp)
> > > >  	if (err) {
> > > >  		fhp->fh_post_saved = false;
> > > >  		fhp->fh_post_attr.ctime = inode->i_ctime;
> > > > +		if (v4 && IS_I_VERSION(inode))
> > > > +			fhp->fh_post_attr.version = inode_query_iversion(inode);
> > > 
> > > ... ditto ...
> > > 
> > > >  	} else
> > > >  		fhp->fh_post_saved = true;
> > > >  	if (v4)
> > > > @@ -754,3 +760,37 @@ enum fsid_source fsid_source(const struct svc_fh *fhp)
> > > >  		return FSIDSOURCE_UUID;
> > > >  	return FSIDSOURCE_DEV;
> > > >  }
> > > > +
> > > > +/*
> > > > + * We could use i_version alone as the change attribute.  However, i_version
> > > > + * can go backwards on a regular file after an unclean shutdown.  On its own
> > > > + * that doesn't necessarily cause a problem, but if i_version goes backwards
> > > > + * and then is incremented again it could reuse a value that was previously
> > > > + * used before boot, and a client who queried the two values might incorrectly
> > > > + * assume nothing changed.
> > > > + *
> > > > + * By using both ctime and the i_version counter we guarantee that as long as
> > > > + * time doesn't go backwards we never reuse an old value. If the filesystem
> > > > + * advertises STATX_ATTR_VERSION_MONOTONIC, then this mitigation is not needed.
> > > > + *
> > > > + * We only need to do this for regular files as well. For directories, we
> > > > + * assume that the new change attr is always logged to stable storage in some
> > > > + * fashion before the results can be seen.
> > > > + */
> > > > +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
> > > > +{
> > > > +	u64 chattr;
> > > > +
> > > > +	if (stat->result_mask & STATX_VERSION) {
> > > > +		chattr = stat->version;
> > > > +
> > > > +		if (S_ISREG(inode->i_mode) &&
> > > > +		    !(stat->attributes & STATX_ATTR_VERSION_MONOTONIC)) {
> > > 
> > > I would really rather that the fs got to make this decision.
> > > If it can guarantee that the i_version is monotonic even over a crash
> > > (which is probably can for directory, and might need changes to do for
> > > files) then it sets STATX_ATTR_VERSION_MONOTONIC and nfsd trusts it
> > > completely.
> > > If it cannot, then it doesn't set the flag.
> > > i.e. the S_ISREG() test should be in the filesystem, not in nfsd.
> > > 
> > 
> > This sounds reasonable, but for one thing.
> > 
> > From RFC 7862:
> > 
> >    While Section 5.4 of [RFC5661] discusses
> >    per-file system attributes, it is expected that the value of
> >    change_attr_type will not depend on the value of "homogeneous" and
> >    will only change in the event of a migration.
> > 
> > The change_attr_type4 must be the same for all filehandles under a
> > particular filesystem.
> > 
> > If we do what you suggest though, then it's easily possible for the fs
> > to set STATX_ATTR_VERSION_MONOTONIC on directories but not files. If we
> > later want to allow nfsd to advertise a change_attr_type4, we won't be
> > able to rely on the STATX_ATTR_VERSION_MONOTONIC to tell us how to fill
> > that out.
> > 
> > Maybe that's ok. I suppose we could add a new field to the export
> > options that filesystems can set to advertise what sort of change attr
> > they offer?
> > 
> 
> There are 3 cases:
> 1/ a file/dir which advertises MONOTONIC is easy to handle.
> 2/ an IS_I_VERSION file/dir that does not advertise MONOTONIC will only fail
>    to be MONOTONIC across unclean restart (correct?).  nfsd can
>    compensate using an xattr on the root to count crashes, or just adding ctime.
> 3/ a non-IS_I_VERSION fs that does not advertise MONOTONIC cannot
>    be compensated for by nfsd.
> 
> If we ever want nfsd to advertise MONOTONIC, then we must be able to
> reject non-IS_I_VERSION filesystems that don't advertise MONOTONIC on
> all files.
> 
> Maybe we need a global nfsd option which defaults to "monotoric" and
> causes those files to be rejected, but can be set to "non-monotonic" and
> then allows all files to be exported.
> 
> It would be nice to make it easy to run multiple nfsd instances each on a
> different IP address.  Each can then have different options.  This could
> also be used to reexport an NFS mount using unmodified filehandles.
> 
> Currently you need a network namespace to create a new nfsd.  I wonder
> if that is a little too much of a barrier.  But maybe we could automate
> the creation of working network namespaces for nfsd....
> 

My current thinking is to just allow the filesystem to set
STATX_ATTR_VERSION_MONOTONIC flag on a per-inode basis, and create a new
change_attr_type() export operation and leave it up to the filesystem to
fill that out appropriately.

I think that'll give us maximum flexibility, and would also allow NFS to
pass the change attr type from the server directly through to the client
when reexporting.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version
  2022-10-03 23:39   ` NeilBrown
  2022-10-05 10:06     ` Jeff Layton
@ 2022-10-06 21:17     ` Dave Chinner
  1 sibling, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2022-10-06 21:17 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, tytso, adilger.kernel, djwong, trondmy, viro, zohar,
	xiubli, chuck.lever, lczerner, jack, bfields, brauner, fweimer,
	linux-btrfs, linux-fsdevel, linux-kernel, ceph-devel, linux-ext4,
	linux-nfs, linux-xfs

On Tue, Oct 04, 2022 at 10:39:09AM +1100, NeilBrown wrote:
> On Fri, 30 Sep 2022, Jeff Layton wrote:
> > Now that we can call into vfs_getattr to get the i_version field, use
> > that facility to fetch it instead of doing it in nfsd4_change_attribute.
> > 
> > Neil also pointed out recently that IS_I_VERSION directory operations
> > are always logged, and so we only need to mitigate the rollback problem
> > on regular files. Also, we don't need to factor in the ctime when
> > reexporting NFS or Ceph.
> > 
> > Set the STATX_VERSION (and BTIME) bits in the request when we're dealing
> > with a v4 request. Then, instead of looking at IS_I_VERSION when
> > generating the change attr, look at the result mask and only use it if
> > STATX_VERSION is set. With this change, we can drop the fetch_iversion
> > export operation as well.
> > 
> > Move nfsd4_change_attribute into nfsfh.c, and change it to only factor
> > in the ctime if it's a regular file and the fs doesn't advertise
> > STATX_ATTR_VERSION_MONOTONIC.
....
> > +
> > +/*
> > + * We could use i_version alone as the change attribute.  However, i_version
> > + * can go backwards on a regular file after an unclean shutdown.  On its own
> > + * that doesn't necessarily cause a problem, but if i_version goes backwards
> > + * and then is incremented again it could reuse a value that was previously
> > + * used before boot, and a client who queried the two values might incorrectly
> > + * assume nothing changed.
> > + *
> > + * By using both ctime and the i_version counter we guarantee that as long as
> > + * time doesn't go backwards we never reuse an old value. If the filesystem
> > + * advertises STATX_ATTR_VERSION_MONOTONIC, then this mitigation is not needed.
> > + *
> > + * We only need to do this for regular files as well. For directories, we
> > + * assume that the new change attr is always logged to stable storage in some
> > + * fashion before the results can be seen.
> > + */
> > +u64 nfsd4_change_attribute(struct kstat *stat, struct inode *inode)
> > +{
> > +	u64 chattr;
> > +
> > +	if (stat->result_mask & STATX_VERSION) {
> > +		chattr = stat->version;
> > +
> > +		if (S_ISREG(inode->i_mode) &&
> > +		    !(stat->attributes & STATX_ATTR_VERSION_MONOTONIC)) {
> 
> I would really rather that the fs got to make this decision.
> If it can guarantee that the i_version is monotonic even over a crash
> (which is probably can for directory, and might need changes to do for
> files) then it sets STATX_ATTR_VERSION_MONOTONIC and nfsd trusts it
> completely.

If you want a internal-to-nfsd solution to this monotonic iversion
problem for file data writes, then all we need to do is take the NFS
server back to NFSv2 days where all server side data writes writes
were performed as O_DSYNC writes. The NFSv3 protocol modifications
for (unstable writes + COMMIT) to allow the NFS server to use
volatile caching is the source of all the i_version persistent
guarantees we are talking about here....

So if the NFS server uses O_DSYNC writes again, by the time the
WRITE response is sent to the client (which can now use
NFS_DATA_SYNC instead of NFS_UNSTABLE so the client can mark pages
clean immediately) both the file data and the inode metadata have
been persisted.

Further, if the nfsd runs ->commit_metadata after the write has
completed, the nfsd now has a guarantee that i_version it places in
the post-ops is both persistent and covers the data change that was
just made. Problem solved, yes?

Ideally we'd want writethrough IO operation with ->commit_metadata
then guaranteeing both completed data and metadata changes are
persistent (XFS already provides this guarantee) for efficiency,
but the point remains that STATX_ATTR_VERSION_MONOTONIC behaviour
can already be implemented by the NFS server application regardless
of when the underlying filesystem bumps i_version for data writes...

Yeah, I know writethrough can result in some workloads being a bit
slower, but NFS clients are already implemented to hide write
latency from the applications. Also, NFS servers are being built
from fast SSDs these days, so this mitigates the worst of the
latency issues that writethrough IO creates....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2022-10-06 21:17 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-30 11:18 [PATCH v6 0/9] vfs/nfsd: clean up handling of i_version counter Jeff Layton
2022-09-30 11:18 ` [PATCH v6 1/9] iversion: move inode_query_iversion to libfs.c Jeff Layton
2022-10-03 23:05   ` NeilBrown
2022-09-30 11:18 ` [PATCH v6 2/9] iversion: clarify when the i_version counter must be updated Jeff Layton
2022-10-03 23:10   ` NeilBrown
2022-10-04  9:53     ` Jeff Layton
2022-09-30 11:18 ` [PATCH v6 3/9] vfs: plumb i_version handling into struct kstat Jeff Layton
2022-10-03 23:14   ` NeilBrown
2022-09-30 11:18 ` [PATCH v6 4/9] nfs: report the inode version in getattr if requested Jeff Layton
2022-10-03 23:29   ` NeilBrown
2022-10-04  9:43     ` Jeff Layton
2022-10-04 22:27       ` NeilBrown
2022-09-30 11:18 ` [PATCH v6 5/9] ceph: " Jeff Layton
2022-09-30 11:18 ` [PATCH v6 6/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
2022-09-30 14:34   ` Chuck Lever III
2022-09-30 22:32     ` Dave Chinner
2022-10-03 23:39   ` NeilBrown
2022-10-05 10:06     ` Jeff Layton
2022-10-05 13:33       ` Chuck Lever III
2022-10-05 13:34       ` Trond Myklebust
2022-10-05 13:57         ` Jeff Layton
2022-10-05 21:14       ` NeilBrown
2022-10-06 11:15         ` Jeff Layton
2022-10-06 21:17     ` Dave Chinner
2022-09-30 11:18 ` [PATCH v6 7/9] vfs: expose STATX_VERSION to userland Jeff Layton
2022-10-03 23:42   ` NeilBrown
2022-10-05 10:08     ` Jeff Layton
2022-09-30 11:18 ` [PATCH v6 8/9] vfs: update times after copying data in __generic_file_write_iter Jeff Layton
2022-10-02  7:08   ` Amir Goldstein
2022-10-03 13:01     ` Jeff Layton
2022-10-03 13:52       ` Amir Goldstein
2022-10-03 22:56         ` NeilBrown
2022-10-05 16:40           ` Jeff Layton
2022-10-05 21:40             ` NeilBrown
2022-09-30 11:18 ` [PATCH v6 9/9] ext4: update times after I/O in write codepaths Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).