All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Patches to support NFS re-exporting
@ 2020-11-30 21:24 trondmy
  2020-11-30 21:24 ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations trondmy
  2020-11-30 21:40 ` [PATCH 0/6] Patches to support NFS re-exporting Chuck Lever
  0 siblings, 2 replies; 28+ messages in thread
From: trondmy @ 2020-11-30 21:24 UTC (permalink / raw)
  To: J. Bruce Fields, Chuck Lever; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

These patches fix a number of issues that Hammerspace has hit when doing
re-exporting of NFS.

Jeff Layton (3):
  nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  nfsd: allow filesystems to opt out of subtree checking
  nfsd: close cached files prior to a REMOVE or RENAME that would
    replace target

Trond Myklebust (3):
  exportfs: Add a function to return the raw output from fh_to_dentry()
  nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE
  nfsd: Set PF_LOCAL_THROTTLE on local filesystems only

 Documentation/filesystems/nfs/exporting.rst | 52 +++++++++++++++++++++
 fs/exportfs/expfs.c                         | 32 +++++++++----
 fs/nfs/export.c                             |  2 +
 fs/nfsd/export.c                            |  6 +++
 fs/nfsd/nfs3xdr.c                           |  7 ++-
 fs/nfsd/nfsfh.c                             | 30 ++++++++++--
 fs/nfsd/nfsfh.h                             |  2 +-
 fs/nfsd/vfs.c                               | 29 ++++++++----
 include/linux/exportfs.h                    | 10 ++++
 9 files changed, 146 insertions(+), 24 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-11-30 21:24 [PATCH 0/6] Patches to support NFS re-exporting trondmy
@ 2020-11-30 21:24 ` trondmy
  2020-11-30 21:24   ` [PATCH 2/6] nfsd: allow filesystems to opt out of subtree checking trondmy
                     ` (3 more replies)
  2020-11-30 21:40 ` [PATCH 0/6] Patches to support NFS re-exporting Chuck Lever
  1 sibling, 4 replies; 28+ messages in thread
From: trondmy @ 2020-11-30 21:24 UTC (permalink / raw)
  To: J. Bruce Fields, Chuck Lever; +Cc: linux-nfs

From: Jeff Layton <jeff.layton@primarydata.com>

With NFSv3 nfsd will always attempt to send along WCC data to the
client. This generally involves saving off the in-core inode information
prior to doing the operation on the given filehandle, and then issuing a
vfs_getattr to it after the op.

Some filesystems (particularly clustered or networked ones) have an
expensive ->getattr inode operation. Atomicitiy is also often difficult
or impossible to guarantee on such filesystems. For those, we're best
off not trying to provide WCC information to the client at all, and to
simply allow it to poll for that information as needed with a GETATTR
RPC.

This patch adds a new flags field to struct export_operations, and
defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate
that nfsd should not attempt to provide WCC info in NFSv3 replies. It
also adds a blurb about the new flags field and flag to the exporting
documentation.

The server will also now skip collecting this information for NFSv2 as
well, since that info is never used there anyway.

Note that this patch does not add this flag to any filesystem
export_operations structures. This was originally developed to allow
reexporting nfs via nfsd. That code is not (and may never be) suitable
for merging into mainline.

Other filesystems may want to consider enabling this flag too. It's hard
to tell however which ones have export operations to enable export via
knfsd and which ones mostly rely on them for open-by-filehandle support,
so I'm leaving that up to the individual maintainers to decide. I am
cc'ing the relevant lists for those filesystems that I think may want to
consider adding this though.

Cc: HPDD-discuss@lists.01.org
Cc: ceph-devel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: fuse-devel@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 Documentation/filesystems/nfs/exporting.rst | 27 +++++++++++++++++++++
 fs/nfs/export.c                             |  1 +
 fs/nfsd/nfs3xdr.c                           |  7 ++++--
 fs/nfsd/nfsfh.c                             | 14 +++++++++++
 fs/nfsd/nfsfh.h                             |  2 +-
 include/linux/exportfs.h                    |  2 ++
 6 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index 33d588a01ace..a3e3805833d1 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -154,6 +154,11 @@ struct which has the following members:
     to find potential names, and matches inode numbers to find the correct
     match.
 
+  flags
+    Some filesystems may need to be handled differently than others. The
+    export_operations struct also includes a flags field that allows the
+    filesystem to communicate such information to nfsd. See the Export
+    Operations Flags section below for more explanation.
 
 A filehandle fragment consists of an array of 1 or more 4byte words,
 together with a one byte "type".
@@ -163,3 +168,25 @@ generated by encode_fh, in which case it will have been padded with
 nuls.  Rather, the encode_fh routine should choose a "type" which
 indicates the decode_fh how much of the filehandle is valid, and how
 it should be interpreted.
+
+Export Operations Flags
+-----------------------
+In addition to the operation vector pointers, struct export_operations also
+contains a "flags" field that allows the filesystem to communicate to nfsd
+that it may want to do things differently when dealing with it. The
+following flags are defined:
+
+  EXPORT_OP_NOWCC
+    RFC 1813 recommends that servers always send weak cache consistency
+    (WCC) data to the client after each operation. The server should
+    atomically collect attributes about the inode, do an operation on it,
+    and then collect the attributes afterward. This allows the client to
+    skip issuing GETATTRs in some situations but means that the server
+    is calling vfs_getattr for almost all RPCs. On some filesystems
+    (particularly those that are clustered or networked) this is expensive
+    and atomicity is difficult to guarantee. This flag indicates to nfsd
+    that it should skip providing WCC attributes to the client in NFSv3
+    replies when doing operations on this filesystem. Consider enabling
+    this on filesystems that have an expensive ->getattr inode operation,
+    or when atomicity between pre and post operation attribute collection
+    is impossible to guarantee.
diff --git a/fs/nfs/export.c b/fs/nfs/export.c
index 3430d6891e89..8f4c528865c5 100644
--- a/fs/nfs/export.c
+++ b/fs/nfs/export.c
@@ -171,4 +171,5 @@ const struct export_operations nfs_export_ops = {
 	.encode_fh = nfs_encode_fh,
 	.fh_to_dentry = nfs_fh_to_dentry,
 	.get_parent = nfs_get_parent,
+	.flags = EXPORT_OP_NOWCC,
 };
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 2277f83da250..480342675292 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -206,7 +206,7 @@ static __be32 *
 encode_post_op_attr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp)
 {
 	struct dentry *dentry = fhp->fh_dentry;
-	if (dentry && d_really_is_positive(dentry)) {
+	if (!fhp->fh_no_wcc && dentry && d_really_is_positive(dentry)) {
 	        __be32 err;
 		struct kstat stat;
 
@@ -261,7 +261,7 @@ void fill_pre_wcc(struct svc_fh *fhp)
 	struct kstat	stat;
 	__be32 err;
 
-	if (fhp->fh_pre_saved)
+	if (fhp->fh_no_wcc || fhp->fh_pre_saved)
 		return;
 
 	inode = d_inode(fhp->fh_dentry);
@@ -287,6 +287,9 @@ void fill_post_wcc(struct svc_fh *fhp)
 {
 	__be32 err;
 
+	if (fhp->fh_no_wcc)
+		return;
+
 	if (fhp->fh_post_saved)
 		printk("nfsd: inode locked twice during operation.\n");
 
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index c81dbbad8792..0c2ee65e46f3 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -291,6 +291,16 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp)
 
 	fhp->fh_dentry = dentry;
 	fhp->fh_export = exp;
+
+	switch (rqstp->rq_vers) {
+	case 3:
+		if (!(dentry->d_sb->s_export_op->flags & EXPORT_OP_NOWCC))
+			break;
+		/* Fallthrough */
+	case 2:
+		fhp->fh_no_wcc = true;
+	}
+
 	return 0;
 out:
 	exp_put(exp);
@@ -559,6 +569,9 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
 	 */
 	set_version_and_fsid_type(fhp, exp, ref_fh);
 
+	/* If we have a ref_fh, then copy the fh_no_wcc setting from it. */
+	fhp->fh_no_wcc = ref_fh ? ref_fh->fh_no_wcc : false;
+
 	if (ref_fh == fhp)
 		fh_put(ref_fh);
 
@@ -662,6 +675,7 @@ fh_put(struct svc_fh *fhp)
 		exp_put(exp);
 		fhp->fh_export = NULL;
 	}
+	fhp->fh_no_wcc = false;
 	return;
 }
 
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 56cfbc361561..fb2b60a76b32 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -35,6 +35,7 @@ typedef struct svc_fh {
 
 	bool			fh_locked;	/* inode locked by us */
 	bool			fh_want_write;	/* remount protection taken */
+	bool			fh_no_wcc;	/* no wcc data needed */
 	int			fh_flags;	/* FH flags */
 #ifdef CONFIG_NFSD_V3
 	bool			fh_post_saved;	/* post-op attrs saved */
@@ -54,7 +55,6 @@ typedef struct svc_fh {
 	struct kstat		fh_post_attr;	/* full attrs after operation */
 	u64			fh_post_change; /* nfsv4 change; see above */
 #endif /* CONFIG_NFSD_V3 */
-
 } svc_fh;
 #define NFSD4_FH_FOREIGN (1<<0)
 #define SET_FH_FLAG(c, f) ((c)->fh_flags |= (f))
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 3ceb72b67a7a..e7de0103a32e 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -213,6 +213,8 @@ struct export_operations {
 			  bool write, u32 *device_generation);
 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 			     int nr_iomaps, struct iattr *iattr);
+#define	EXPORT_OP_NOWCC		(0x1)	/* Don't collect wcc data for NFSv3 replies */
+	unsigned long	flags;
 };
 
 extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/6] nfsd: allow filesystems to opt out of subtree checking
  2020-11-30 21:24 ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations trondmy
@ 2020-11-30 21:24   ` trondmy
  2020-11-30 21:24     ` [PATCH 3/6] nfsd: close cached files prior to a REMOVE or RENAME that would replace target trondmy
  2020-11-30 22:59     ` [PATCH 2/6] nfsd: allow filesystems to opt out of subtree checking J. Bruce Fields
  2020-11-30 22:58   ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations J. Bruce Fields
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 28+ messages in thread
From: trondmy @ 2020-11-30 21:24 UTC (permalink / raw)
  To: J. Bruce Fields, Chuck Lever; +Cc: linux-nfs

From: Jeff Layton <jeff.layton@primarydata.com>

When we start allowing NFS to be reexported, then we have some problems
when it comes to subtree checking. In principle, we could allow it, but
it would mean encoding parent info in the filehandles and there may not
be enough space for that in a NFSv3 filehandle.

To enforce this at export upcall time, we add a new export_ops flag
that declares the filesystem ineligible for subtree checking.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 Documentation/filesystems/nfs/exporting.rst | 14 +++++++++++++-
 fs/nfs/export.c                             |  2 +-
 fs/nfsd/export.c                            |  6 ++++++
 include/linux/exportfs.h                    |  1 +
 4 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index a3e3805833d1..960be64446cb 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -176,7 +176,7 @@ contains a "flags" field that allows the filesystem to communicate to nfsd
 that it may want to do things differently when dealing with it. The
 following flags are defined:
 
-  EXPORT_OP_NOWCC
+  EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem
     RFC 1813 recommends that servers always send weak cache consistency
     (WCC) data to the client after each operation. The server should
     atomically collect attributes about the inode, do an operation on it,
@@ -190,3 +190,15 @@ following flags are defined:
     this on filesystems that have an expensive ->getattr inode operation,
     or when atomicity between pre and post operation attribute collection
     is impossible to guarantee.
+
+  EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs
+    Many NFS operations deal with filehandles, which the server must then
+    vet to ensure that they live inside of an exported tree. When the
+    export consists of an entire filesystem, this is trivial. nfsd can just
+    ensure that the filehandle live on the filesystem. When only part of a
+    filesystem is exported however, then nfsd must walk the ancestors of the
+    inode to ensure that it's within an exported subtree. This is an
+    expensive operation and not all filesystems can support it properly.
+    This flag exempts the filesystem from subtree checking and causes
+    exportfs to get back an error if it tries to enable subtree checking
+    on it.
diff --git a/fs/nfs/export.c b/fs/nfs/export.c
index 8f4c528865c5..b9ba306bf912 100644
--- a/fs/nfs/export.c
+++ b/fs/nfs/export.c
@@ -171,5 +171,5 @@ const struct export_operations nfs_export_ops = {
 	.encode_fh = nfs_encode_fh,
 	.fh_to_dentry = nfs_fh_to_dentry,
 	.get_parent = nfs_get_parent,
-	.flags = EXPORT_OP_NOWCC,
+	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK,
 };
diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index 21e404e7cb68..81e7bb12aca6 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -408,6 +408,12 @@ static int check_export(struct inode *inode, int *flags, unsigned char *uuid)
 		return -EINVAL;
 	}
 
+	if (inode->i_sb->s_export_op->flags & EXPORT_OP_NOSUBTREECHK &&
+	    !(*flags & NFSEXP_NOSUBTREECHECK)) {
+		dprintk("%s: %s does not support subtree checking!\n",
+			__func__, inode->i_sb->s_type->name);
+		return -EINVAL;
+	}
 	return 0;
 
 }
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index e7de0103a32e..2fcbab0f6b61 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -214,6 +214,7 @@ struct export_operations {
 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 			     int nr_iomaps, struct iattr *iattr);
 #define	EXPORT_OP_NOWCC		(0x1)	/* Don't collect wcc data for NFSv3 replies */
+#define	EXPORT_OP_NOSUBTREECHK	(0x2)	/* Subtree checking is not supported! */
 	unsigned long	flags;
 };
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/6] nfsd: close cached files prior to a REMOVE or RENAME that would replace target
  2020-11-30 21:24   ` [PATCH 2/6] nfsd: allow filesystems to opt out of subtree checking trondmy
@ 2020-11-30 21:24     ` trondmy
  2020-11-30 21:24       ` [PATCH 4/6] exportfs: Add a function to return the raw output from fh_to_dentry() trondmy
  2020-11-30 22:59     ` [PATCH 2/6] nfsd: allow filesystems to opt out of subtree checking J. Bruce Fields
  1 sibling, 1 reply; 28+ messages in thread
From: trondmy @ 2020-11-30 21:24 UTC (permalink / raw)
  To: J. Bruce Fields, Chuck Lever; +Cc: linux-nfs

From: Jeff Layton <jeff.layton@primarydata.com>

It's not uncommon for some workloads to do a bunch of I/O to a file and
delete it just afterward. If knfsd has a cached open file however, then
the file may still be open when the dentry is unlinked. If the
underlying filesystem is nfs, then that could trigger it to do a
sillyrename.

On a REMOVE or RENAME scan the nfsd_file cache for open files that
correspond to the inode, and proactively unhash and put their
references. This should prevent any delete-on-last-close activity from
occurring, solely due to knfsd's open file cache.

This must be done synchronously though so we use the variants that call
flush_delayed_fput. There are deadlock possibilities if you call
flush_delayed_fput while holding locks, however. In the case of
nfsd_rename, we don't even do the lookups of the dentries to be renamed
until we've locked for rename.

Once we've figured out what the target dentry is for a rename, check to
see whether there are cached open files associated with it. If there
are, then unwind all of the locking, close them all, and then reattempt
the rename.

None of this is really necessary for "typical" filesystems though. It's
mostly of use for NFS, so declare a new export op flag and use that to
determine whether to close the files beforehand.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 Documentation/filesystems/nfs/exporting.rst | 13 +++++++++++++
 fs/nfs/export.c                             |  2 +-
 fs/nfsd/vfs.c                               | 16 +++++++++-------
 include/linux/exportfs.h                    |  5 +++--
 4 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index 960be64446cb..0e98edd353b5 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -202,3 +202,16 @@ following flags are defined:
     This flag exempts the filesystem from subtree checking and causes
     exportfs to get back an error if it tries to enable subtree checking
     on it.
+
+  EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking
+    On some exportable filesystems (such as NFS) unlinking a file that
+    is still open can cause a fair bit of extra work. For instance,
+    the NFS client will do a "sillyrename" to ensure that the file
+    sticks around while it's still open. When reexporting, that open
+    file is held by nfsd so we usually end up doing a sillyrename, and
+    then immediately deleting the sillyrenamed file just afterward when
+    the link count actually goes to zero. Sometimes this delete can race
+    with other operations (for instance an rmdir of the parent directory).
+    This flag causes nfsd to close any open files for this inode _before_
+    calling into the vfs to do an unlink or a rename that would replace
+    an existing file.
diff --git a/fs/nfs/export.c b/fs/nfs/export.c
index b9ba306bf912..5428713af5fe 100644
--- a/fs/nfs/export.c
+++ b/fs/nfs/export.c
@@ -171,5 +171,5 @@ const struct export_operations nfs_export_ops = {
 	.encode_fh = nfs_encode_fh,
 	.fh_to_dentry = nfs_fh_to_dentry,
 	.get_parent = nfs_get_parent,
-	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK,
+	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|EXPORT_OP_CLOSE_BEFORE_UNLINK,
 };
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 1ecaceebee13..79cba942087e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1724,7 +1724,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	struct inode	*fdir, *tdir;
 	__be32		err;
 	int		host_err;
-	bool		has_cached = false;
+	bool		close_cached = false;
 
 	err = fh_verify(rqstp, ffhp, S_IFDIR, NFSD_MAY_REMOVE);
 	if (err)
@@ -1783,8 +1783,9 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	if (ffhp->fh_export->ex_path.dentry != tfhp->fh_export->ex_path.dentry)
 		goto out_dput_new;
 
-	if (nfsd_has_cached_files(ndentry)) {
-		has_cached = true;
+	if ((ndentry->d_sb->s_export_op->flags & EXPORT_OP_CLOSE_BEFORE_UNLINK) &&
+	    nfsd_has_cached_files(ndentry)) {
+		close_cached = true;
 		goto out_dput_old;
 	} else {
 		host_err = vfs_rename(fdir, odentry, tdir, ndentry, NULL, 0);
@@ -1805,7 +1806,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	 * as that would do the wrong thing if the two directories
 	 * were the same, so again we do it by hand.
 	 */
-	if (!has_cached) {
+	if (!close_cached) {
 		fill_post_wcc(ffhp);
 		fill_post_wcc(tfhp);
 	}
@@ -1819,8 +1820,8 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	 * shouldn't be done with locks held however, so we delay it until this
 	 * point and then reattempt the whole shebang.
 	 */
-	if (has_cached) {
-		has_cached = false;
+	if (close_cached) {
+		close_cached = false;
 		nfsd_close_cached_files(ndentry);
 		dput(ndentry);
 		goto retry;
@@ -1872,7 +1873,8 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
 		type = d_inode(rdentry)->i_mode & S_IFMT;
 
 	if (type != S_IFDIR) {
-		nfsd_close_cached_files(rdentry);
+		if (rdentry->d_sb->s_export_op->flags & EXPORT_OP_CLOSE_BEFORE_UNLINK)
+			nfsd_close_cached_files(rdentry);
 		host_err = vfs_unlink(dirp, rdentry, NULL);
 	} else {
 		host_err = vfs_rmdir(dirp, rdentry);
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 2fcbab0f6b61..d829403ffd3b 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -213,8 +213,9 @@ struct export_operations {
 			  bool write, u32 *device_generation);
 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 			     int nr_iomaps, struct iattr *iattr);
-#define	EXPORT_OP_NOWCC		(0x1)	/* Don't collect wcc data for NFSv3 replies */
-#define	EXPORT_OP_NOSUBTREECHK	(0x2)	/* Subtree checking is not supported! */
+#define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
+#define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */
+#define	EXPORT_OP_CLOSE_BEFORE_UNLINK	(0x4) /* close files before unlink */
 	unsigned long	flags;
 };
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/6] exportfs: Add a function to return the raw output from fh_to_dentry()
  2020-11-30 21:24     ` [PATCH 3/6] nfsd: close cached files prior to a REMOVE or RENAME that would replace target trondmy
@ 2020-11-30 21:24       ` trondmy
  2020-11-30 21:24         ` [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE trondmy
  0 siblings, 1 reply; 28+ messages in thread
From: trondmy @ 2020-11-30 21:24 UTC (permalink / raw)
  To: J. Bruce Fields, Chuck Lever; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

In order to allow nfsd to accept return values that are not
acceptable to overlayfs and others, add a new function.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/exportfs/expfs.c      | 32 ++++++++++++++++++++++++--------
 include/linux/exportfs.h |  5 +++++
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 2dd55b172d57..0106eba46d5a 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -417,9 +417,11 @@ int exportfs_encode_fh(struct dentry *dentry, struct fid *fid, int *max_len,
 }
 EXPORT_SYMBOL_GPL(exportfs_encode_fh);
 
-struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
-		int fh_len, int fileid_type,
-		int (*acceptable)(void *, struct dentry *), void *context)
+struct dentry *
+exportfs_decode_fh_raw(struct vfsmount *mnt, struct fid *fid, int fh_len,
+		       int fileid_type,
+		       int (*acceptable)(void *, struct dentry *),
+		       void *context)
 {
 	const struct export_operations *nop = mnt->mnt_sb->s_export_op;
 	struct dentry *result, *alias;
@@ -432,10 +434,8 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
 	if (!nop || !nop->fh_to_dentry)
 		return ERR_PTR(-ESTALE);
 	result = nop->fh_to_dentry(mnt->mnt_sb, fid, fh_len, fileid_type);
-	if (PTR_ERR(result) == -ENOMEM)
-		return ERR_CAST(result);
 	if (IS_ERR_OR_NULL(result))
-		return ERR_PTR(-ESTALE);
+		return result;
 
 	/*
 	 * If no acceptance criteria was specified by caller, a disconnected
@@ -561,10 +561,26 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
 
  err_result:
 	dput(result);
-	if (err != -ENOMEM)
-		err = -ESTALE;
 	return ERR_PTR(err);
 }
+EXPORT_SYMBOL_GPL(exportfs_decode_fh_raw);
+
+struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
+				  int fh_len, int fileid_type,
+				  int (*acceptable)(void *, struct dentry *),
+				  void *context)
+{
+	struct dentry *ret;
+
+	ret = exportfs_decode_fh_raw(mnt, fid, fh_len, fileid_type,
+				     acceptable, context);
+	if (IS_ERR_OR_NULL(ret)) {
+		if (ret == ERR_PTR(-ENOMEM))
+			return ret;
+		return ERR_PTR(-ESTALE);
+	}
+	return ret;
+}
 EXPORT_SYMBOL_GPL(exportfs_decode_fh);
 
 MODULE_LICENSE("GPL");
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index d829403ffd3b..846df3c96730 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -223,6 +223,11 @@ extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
 				    int *max_len, struct inode *parent);
 extern int exportfs_encode_fh(struct dentry *dentry, struct fid *fid,
 	int *max_len, int connectable);
+extern struct dentry *exportfs_decode_fh_raw(struct vfsmount *mnt,
+					     struct fid *fid, int fh_len,
+					     int fileid_type,
+					     int (*acceptable)(void *, struct dentry *),
+					     void *context);
 extern struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
 	int fh_len, int fileid_type, int (*acceptable)(void *, struct dentry *),
 	void *context);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE
  2020-11-30 21:24       ` [PATCH 4/6] exportfs: Add a function to return the raw output from fh_to_dentry() trondmy
@ 2020-11-30 21:24         ` trondmy
  2020-11-30 21:24           ` [PATCH 6/6] nfsd: Set PF_LOCAL_THROTTLE on local filesystems only trondmy
  2020-11-30 23:05           ` [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE J. Bruce Fields
  0 siblings, 2 replies; 28+ messages in thread
From: trondmy @ 2020-11-30 21:24 UTC (permalink / raw)
  To: J. Bruce Fields, Chuck Lever; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

If the underlying filesystem times out, then we want knfsd to return
NFSERR_JUKEBOX/DELAY rather than NFSERR_STALE.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfsd/nfsfh.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 0c2ee65e46f3..46c86f7bc429 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -268,12 +268,20 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp)
 	if (fileid_type == FILEID_ROOT)
 		dentry = dget(exp->ex_path.dentry);
 	else {
-		dentry = exportfs_decode_fh(exp->ex_path.mnt, fid,
-				data_left, fileid_type,
-				nfsd_acceptable, exp);
-		if (IS_ERR_OR_NULL(dentry))
+		dentry = exportfs_decode_fh_raw(exp->ex_path.mnt, fid,
+						data_left, fileid_type,
+						nfsd_acceptable, exp);
+		if (IS_ERR_OR_NULL(dentry)) {
 			trace_nfsd_set_fh_dentry_badhandle(rqstp, fhp,
 					dentry ?  PTR_ERR(dentry) : -ESTALE);
+			switch (PTR_ERR(dentry)) {
+			case -ENOMEM:
+			case -ETIMEDOUT:
+				break;
+			default:
+				dentry = ERR_PTR(-ESTALE);
+			}
+		}
 	}
 	if (dentry == NULL)
 		goto out;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 6/6] nfsd: Set PF_LOCAL_THROTTLE on local filesystems only
  2020-11-30 21:24         ` [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE trondmy
@ 2020-11-30 21:24           ` trondmy
  2020-11-30 23:05           ` [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE J. Bruce Fields
  1 sibling, 0 replies; 28+ messages in thread
From: trondmy @ 2020-11-30 21:24 UTC (permalink / raw)
  To: J. Bruce Fields, Chuck Lever; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Don't set PF_LOCAL_THROTTLE on remote filesystems like NFS, since they
aren't expected to ever be subject to double buffering.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/export.c          |  3 ++-
 fs/nfsd/vfs.c            | 13 +++++++++++--
 include/linux/exportfs.h |  1 +
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/export.c b/fs/nfs/export.c
index 5428713af5fe..48b879cfe6e3 100644
--- a/fs/nfs/export.c
+++ b/fs/nfs/export.c
@@ -171,5 +171,6 @@ const struct export_operations nfs_export_ops = {
 	.encode_fh = nfs_encode_fh,
 	.fh_to_dentry = nfs_fh_to_dentry,
 	.get_parent = nfs_get_parent,
-	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|EXPORT_OP_CLOSE_BEFORE_UNLINK,
+	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK|
+		EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS,
 };
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 79cba942087e..04937e51de56 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -978,18 +978,25 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 				__be32 *verf)
 {
 	struct file		*file = nf->nf_file;
+	struct super_block	*sb = file_inode(file)->i_sb;
 	struct svc_export	*exp;
 	struct iov_iter		iter;
 	__be32			nfserr;
 	int			host_err;
 	int			use_wgather;
 	loff_t			pos = offset;
+	unsigned long		exp_op_flags = 0;
 	unsigned int		pflags = current->flags;
 	rwf_t			flags = 0;
+	bool			restore_flags = false;
 
 	trace_nfsd_write_opened(rqstp, fhp, offset, *cnt);
 
-	if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
+	if (sb->s_export_op)
+		exp_op_flags = sb->s_export_op->flags;
+
+	if (test_bit(RQ_LOCAL, &rqstp->rq_flags) &&
+	    !(exp_op_flags & EXPORT_OP_REMOTE_FS)) {
 		/*
 		 * We want throttling in balance_dirty_pages()
 		 * and shrink_inactive_list() to only consider
@@ -998,6 +1005,8 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 		 * the client's dirty pages or its congested queue.
 		 */
 		current->flags |= PF_LOCAL_THROTTLE;
+		restore_flags = true;
+	}
 
 	exp = fhp->fh_export;
 	use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp);
@@ -1049,7 +1058,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 		trace_nfsd_write_err(rqstp, fhp, offset, host_err);
 		nfserr = nfserrno(host_err);
 	}
-	if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
+	if (restore_flags)
 		current_restore_flags(pflags, PF_LOCAL_THROTTLE);
 	return nfserr;
 }
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 846df3c96730..d93e8a6737bb 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -216,6 +216,7 @@ struct export_operations {
 #define	EXPORT_OP_NOWCC			(0x1) /* don't collect v3 wcc data */
 #define	EXPORT_OP_NOSUBTREECHK		(0x2) /* no subtree checking */
 #define	EXPORT_OP_CLOSE_BEFORE_UNLINK	(0x4) /* close files before unlink */
+#define EXPORT_OP_REMOTE_FS		(0x8) /* Filesystem is remote */
 	unsigned long	flags;
 };
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/6] Patches to support NFS re-exporting
  2020-11-30 21:24 [PATCH 0/6] Patches to support NFS re-exporting trondmy
  2020-11-30 21:24 ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations trondmy
@ 2020-11-30 21:40 ` Chuck Lever
  2020-11-30 21:51   ` Trond Myklebust
  1 sibling, 1 reply; 28+ messages in thread
From: Chuck Lever @ 2020-11-30 21:40 UTC (permalink / raw)
  To: trondmy; +Cc: Bruce Fields, Linux NFS Mailing List

Hi Trond-

> On Nov 30, 2020, at 4:24 PM, trondmy@kernel.org wrote:
> 
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> 
> These patches fix a number of issues that Hammerspace has hit when doing
> re-exporting of NFS.

These do not apply on top of Bruce's changes in the same area.
I've prepared a tree that you can apply onto.

See the cel-next topic branch in this repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git


> Jeff Layton (3):
>  nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
>  nfsd: allow filesystems to opt out of subtree checking
>  nfsd: close cached files prior to a REMOVE or RENAME that would
>    replace target
> 
> Trond Myklebust (3):
>  exportfs: Add a function to return the raw output from fh_to_dentry()
>  nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE
>  nfsd: Set PF_LOCAL_THROTTLE on local filesystems only
> 
> Documentation/filesystems/nfs/exporting.rst | 52 +++++++++++++++++++++
> fs/exportfs/expfs.c                         | 32 +++++++++----
> fs/nfs/export.c                             |  2 +
> fs/nfsd/export.c                            |  6 +++
> fs/nfsd/nfs3xdr.c                           |  7 ++-
> fs/nfsd/nfsfh.c                             | 30 ++++++++++--
> fs/nfsd/nfsfh.h                             |  2 +-
> fs/nfsd/vfs.c                               | 29 ++++++++----
> include/linux/exportfs.h                    | 10 ++++
> 9 files changed, 146 insertions(+), 24 deletions(-)
> 
> -- 
> 2.28.0
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/6] Patches to support NFS re-exporting
  2020-11-30 21:40 ` [PATCH 0/6] Patches to support NFS re-exporting Chuck Lever
@ 2020-11-30 21:51   ` Trond Myklebust
  0 siblings, 0 replies; 28+ messages in thread
From: Trond Myklebust @ 2020-11-30 21:51 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-nfs, bfields

On Mon, 2020-11-30 at 16:40 -0500, Chuck Lever wrote:
> Hi Trond-
> 
> > On Nov 30, 2020, at 4:24 PM, trondmy@kernel.org wrote:
> > 
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > These patches fix a number of issues that Hammerspace has hit when
> > doing
> > re-exporting of NFS.
> 
> These do not apply on top of Bruce's changes in the same area.
> I've prepared a tree that you can apply onto.
> 

Hmm... They rebased cleanly on top of your branch, but I'll resend
those rebased patches.

> See the cel-next topic branch in this repo:
> 
> git://git.linux-nfs.org/projects/cel/cel-2.6.git
> 
> 
> > Jeff Layton (3):
> >  nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
> >  nfsd: allow filesystems to opt out of subtree checking
> >  nfsd: close cached files prior to a REMOVE or RENAME that would
> >    replace target
> > 
> > Trond Myklebust (3):
> >  exportfs: Add a function to return the raw output from
> > fh_to_dentry()
> >  nfsd: Fix up nfsd to ensure that timeout errors don't result in
> > ESTALE
> >  nfsd: Set PF_LOCAL_THROTTLE on local filesystems only
> > 
> > Documentation/filesystems/nfs/exporting.rst | 52
> > +++++++++++++++++++++
> > fs/exportfs/expfs.c                         | 32 +++++++++----
> > fs/nfs/export.c                             |  2 +
> > fs/nfsd/export.c                            |  6 +++
> > fs/nfsd/nfs3xdr.c                           |  7 ++-
> > fs/nfsd/nfsfh.c                             | 30 ++++++++++--
> > fs/nfsd/nfsfh.h                             |  2 +-
> > fs/nfsd/vfs.c                               | 29 ++++++++----
> > include/linux/exportfs.h                    | 10 ++++
> > 9 files changed, 146 insertions(+), 24 deletions(-)
> > 
> > -- 
> > 2.28.0
> > 
> 
> --
> Chuck Lever
> 
> 
> 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-11-30 21:24 ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations trondmy
  2020-11-30 21:24   ` [PATCH 2/6] nfsd: allow filesystems to opt out of subtree checking trondmy
@ 2020-11-30 22:58   ` J. Bruce Fields
  2020-12-01  0:33     ` Trond Myklebust
                       ` (2 more replies)
  2020-11-30 23:11   ` Chuck Lever
  2020-12-01  0:14   ` Jeff Layton
  3 siblings, 3 replies; 28+ messages in thread
From: J. Bruce Fields @ 2020-11-30 22:58 UTC (permalink / raw)
  To: trondmy; +Cc: J. Bruce Fields, Chuck Lever, linux-nfs

This is great, thanks:

On Mon, Nov 30, 2020 at 04:24:50PM -0500, trondmy@kernel.org wrote:
> From: Jeff Layton <jeff.layton@primarydata.com>
> 
> With NFSv3 nfsd will always attempt to send along WCC data to the
> client. This generally involves saving off the in-core inode information
> prior to doing the operation on the given filehandle, and then issuing a
> vfs_getattr to it after the op.
> 
> Some filesystems (particularly clustered or networked ones) have an
> expensive ->getattr inode operation. Atomicitiy is also often difficult
> or impossible to guarantee on such filesystems. For those, we're best
> off not trying to provide WCC information to the client at all, and to
> simply allow it to poll for that information as needed with a GETATTR
> RPC.
> 
> This patch adds a new flags field to struct export_operations, and
> defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate
> that nfsd should not attempt to provide WCC info in NFSv3 replies. It
> also adds a blurb about the new flags field and flag to the exporting
> documentation.

In the v4 case I think it should also turn off the "atomic" flag in the
change_info4 structure that's returned by some operations.

(Out of curiosity: have you seen this cause actual bugs?)

--b.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/6] nfsd: allow filesystems to opt out of subtree checking
  2020-11-30 21:24   ` [PATCH 2/6] nfsd: allow filesystems to opt out of subtree checking trondmy
  2020-11-30 21:24     ` [PATCH 3/6] nfsd: close cached files prior to a REMOVE or RENAME that would replace target trondmy
@ 2020-11-30 22:59     ` J. Bruce Fields
  1 sibling, 0 replies; 28+ messages in thread
From: J. Bruce Fields @ 2020-11-30 22:59 UTC (permalink / raw)
  To: trondmy; +Cc: J. Bruce Fields, Chuck Lever, linux-nfs

Makes sense to me, thanks.--b.

On Mon, Nov 30, 2020 at 04:24:51PM -0500, trondmy@kernel.org wrote:
> From: Jeff Layton <jeff.layton@primarydata.com>
> 
> When we start allowing NFS to be reexported, then we have some problems
> when it comes to subtree checking. In principle, we could allow it, but
> it would mean encoding parent info in the filehandles and there may not
> be enough space for that in a NFSv3 filehandle.
> 
> To enforce this at export upcall time, we add a new export_ops flag
> that declares the filesystem ineligible for subtree checking.
> 
> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  Documentation/filesystems/nfs/exporting.rst | 14 +++++++++++++-
>  fs/nfs/export.c                             |  2 +-
>  fs/nfsd/export.c                            |  6 ++++++
>  include/linux/exportfs.h                    |  1 +
>  4 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
> index a3e3805833d1..960be64446cb 100644
> --- a/Documentation/filesystems/nfs/exporting.rst
> +++ b/Documentation/filesystems/nfs/exporting.rst
> @@ -176,7 +176,7 @@ contains a "flags" field that allows the filesystem to communicate to nfsd
>  that it may want to do things differently when dealing with it. The
>  following flags are defined:
>  
> -  EXPORT_OP_NOWCC
> +  EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem
>      RFC 1813 recommends that servers always send weak cache consistency
>      (WCC) data to the client after each operation. The server should
>      atomically collect attributes about the inode, do an operation on it,
> @@ -190,3 +190,15 @@ following flags are defined:
>      this on filesystems that have an expensive ->getattr inode operation,
>      or when atomicity between pre and post operation attribute collection
>      is impossible to guarantee.
> +
> +  EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs
> +    Many NFS operations deal with filehandles, which the server must then
> +    vet to ensure that they live inside of an exported tree. When the
> +    export consists of an entire filesystem, this is trivial. nfsd can just
> +    ensure that the filehandle live on the filesystem. When only part of a
> +    filesystem is exported however, then nfsd must walk the ancestors of the
> +    inode to ensure that it's within an exported subtree. This is an
> +    expensive operation and not all filesystems can support it properly.
> +    This flag exempts the filesystem from subtree checking and causes
> +    exportfs to get back an error if it tries to enable subtree checking
> +    on it.
> diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> index 8f4c528865c5..b9ba306bf912 100644
> --- a/fs/nfs/export.c
> +++ b/fs/nfs/export.c
> @@ -171,5 +171,5 @@ const struct export_operations nfs_export_ops = {
>  	.encode_fh = nfs_encode_fh,
>  	.fh_to_dentry = nfs_fh_to_dentry,
>  	.get_parent = nfs_get_parent,
> -	.flags = EXPORT_OP_NOWCC,
> +	.flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK,
>  };
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index 21e404e7cb68..81e7bb12aca6 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -408,6 +408,12 @@ static int check_export(struct inode *inode, int *flags, unsigned char *uuid)
>  		return -EINVAL;
>  	}
>  
> +	if (inode->i_sb->s_export_op->flags & EXPORT_OP_NOSUBTREECHK &&
> +	    !(*flags & NFSEXP_NOSUBTREECHECK)) {
> +		dprintk("%s: %s does not support subtree checking!\n",
> +			__func__, inode->i_sb->s_type->name);
> +		return -EINVAL;
> +	}
>  	return 0;
>  
>  }
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index e7de0103a32e..2fcbab0f6b61 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -214,6 +214,7 @@ struct export_operations {
>  	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
>  			     int nr_iomaps, struct iattr *iattr);
>  #define	EXPORT_OP_NOWCC		(0x1)	/* Don't collect wcc data for NFSv3 replies */
> +#define	EXPORT_OP_NOSUBTREECHK	(0x2)	/* Subtree checking is not supported! */
>  	unsigned long	flags;
>  };
>  
> -- 
> 2.28.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE
  2020-11-30 21:24         ` [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE trondmy
  2020-11-30 21:24           ` [PATCH 6/6] nfsd: Set PF_LOCAL_THROTTLE on local filesystems only trondmy
@ 2020-11-30 23:05           ` J. Bruce Fields
  2020-12-01  0:39             ` Trond Myklebust
  1 sibling, 1 reply; 28+ messages in thread
From: J. Bruce Fields @ 2020-11-30 23:05 UTC (permalink / raw)
  To: trondmy; +Cc: J. Bruce Fields, Chuck Lever, linux-nfs

On Mon, Nov 30, 2020 at 04:24:54PM -0500, trondmy@kernel.org wrote:
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> 
> If the underlying filesystem times out, then we want knfsd to return
> NFSERR_JUKEBOX/DELAY rather than NFSERR_STALE.

Out of curiosity, what was causing ETIMEDOUT in practice?

--b.

> 
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  fs/nfsd/nfsfh.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index 0c2ee65e46f3..46c86f7bc429 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -268,12 +268,20 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp)
>  	if (fileid_type == FILEID_ROOT)
>  		dentry = dget(exp->ex_path.dentry);
>  	else {
> -		dentry = exportfs_decode_fh(exp->ex_path.mnt, fid,
> -				data_left, fileid_type,
> -				nfsd_acceptable, exp);
> -		if (IS_ERR_OR_NULL(dentry))
> +		dentry = exportfs_decode_fh_raw(exp->ex_path.mnt, fid,
> +						data_left, fileid_type,
> +						nfsd_acceptable, exp);
> +		if (IS_ERR_OR_NULL(dentry)) {
>  			trace_nfsd_set_fh_dentry_badhandle(rqstp, fhp,
>  					dentry ?  PTR_ERR(dentry) : -ESTALE);
> +			switch (PTR_ERR(dentry)) {
> +			case -ENOMEM:
> +			case -ETIMEDOUT:
> +				break;
> +			default:
> +				dentry = ERR_PTR(-ESTALE);
> +			}
> +		}
>  	}
>  	if (dentry == NULL)
>  		goto out;
> -- 
> 2.28.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-11-30 21:24 ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations trondmy
  2020-11-30 21:24   ` [PATCH 2/6] nfsd: allow filesystems to opt out of subtree checking trondmy
  2020-11-30 22:58   ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations J. Bruce Fields
@ 2020-11-30 23:11   ` Chuck Lever
  2020-12-01  0:49     ` Trond Myklebust
  2020-12-01  0:14   ` Jeff Layton
  3 siblings, 1 reply; 28+ messages in thread
From: Chuck Lever @ 2020-11-30 23:11 UTC (permalink / raw)
  To: trondmy; +Cc: Bruce Fields, Linux NFS Mailing List



> On Nov 30, 2020, at 4:24 PM, trondmy@kernel.org wrote:
> 
> From: Jeff Layton <jeff.layton@primarydata.com>
> 
> With NFSv3 nfsd will always attempt to send along WCC data to the
> client. This generally involves saving off the in-core inode information
> prior to doing the operation on the given filehandle, and then issuing a
> vfs_getattr to it after the op.
> 
> Some filesystems (particularly clustered or networked ones) have an
> expensive ->getattr inode operation. Atomicitiy is also often difficult
> or impossible to guarantee on such filesystems. For those, we're best
> off not trying to provide WCC information to the client at all, and to
> simply allow it to poll for that information as needed with a GETATTR
> RPC.
> 
> This patch adds a new flags field to struct export_operations, and
> defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate
> that nfsd should not attempt to provide WCC info in NFSv3 replies. It
> also adds a blurb about the new flags field and flag to the exporting
> documentation.
> 
> The server will also now skip collecting this information for NFSv2 as
> well, since that info is never used there anyway.
> 
> Note that this patch does not add this flag to any filesystem
> export_operations structures. This was originally developed to allow
> reexporting nfs via nfsd. That code is not (and may never be) suitable
> for merging into mainline.
> 
> Other filesystems may want to consider enabling this flag too. It's hard
> to tell however which ones have export operations to enable export via
> knfsd and which ones mostly rely on them for open-by-filehandle support,
> so I'm leaving that up to the individual maintainers to decide. I am
> cc'ing the relevant lists for those filesystems that I think may want to
> consider adding this though.
> 
> Cc: HPDD-discuss@lists.01.org
> Cc: ceph-devel@vger.kernel.org
> Cc: cluster-devel@redhat.com
> Cc: fuse-devel@lists.sourceforge.net
> Cc: ocfs2-devel@oss.oracle.com
> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

These seem to apply fine, thanks for resending.

If you post a v3 to address Bruce's comment, can you also
address this checkpatch nit?


WARNING: Prefer 'fallthrough;' over fallthrough comment
#154: FILE: fs/nfsd/nfsfh.c:299:
+		/* Fallthrough */

total: 0 errors, 1 warnings, 120 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.


> ---
> Documentation/filesystems/nfs/exporting.rst | 27 +++++++++++++++++++++
> fs/nfs/export.c                             |  1 +
> fs/nfsd/nfs3xdr.c                           |  7 ++++--
> fs/nfsd/nfsfh.c                             | 14 +++++++++++
> fs/nfsd/nfsfh.h                             |  2 +-
> include/linux/exportfs.h                    |  2 ++
> 6 files changed, 50 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
> index 33d588a01ace..a3e3805833d1 100644
> --- a/Documentation/filesystems/nfs/exporting.rst
> +++ b/Documentation/filesystems/nfs/exporting.rst
> @@ -154,6 +154,11 @@ struct which has the following members:
>     to find potential names, and matches inode numbers to find the correct
>     match.
> 
> +  flags
> +    Some filesystems may need to be handled differently than others. The
> +    export_operations struct also includes a flags field that allows the
> +    filesystem to communicate such information to nfsd. See the Export
> +    Operations Flags section below for more explanation.
> 
> A filehandle fragment consists of an array of 1 or more 4byte words,
> together with a one byte "type".
> @@ -163,3 +168,25 @@ generated by encode_fh, in which case it will have been padded with
> nuls.  Rather, the encode_fh routine should choose a "type" which
> indicates the decode_fh how much of the filehandle is valid, and how
> it should be interpreted.
> +
> +Export Operations Flags
> +-----------------------
> +In addition to the operation vector pointers, struct export_operations also
> +contains a "flags" field that allows the filesystem to communicate to nfsd
> +that it may want to do things differently when dealing with it. The
> +following flags are defined:
> +
> +  EXPORT_OP_NOWCC
> +    RFC 1813 recommends that servers always send weak cache consistency
> +    (WCC) data to the client after each operation. The server should
> +    atomically collect attributes about the inode, do an operation on it,
> +    and then collect the attributes afterward. This allows the client to
> +    skip issuing GETATTRs in some situations but means that the server
> +    is calling vfs_getattr for almost all RPCs. On some filesystems
> +    (particularly those that are clustered or networked) this is expensive
> +    and atomicity is difficult to guarantee. This flag indicates to nfsd
> +    that it should skip providing WCC attributes to the client in NFSv3
> +    replies when doing operations on this filesystem. Consider enabling
> +    this on filesystems that have an expensive ->getattr inode operation,
> +    or when atomicity between pre and post operation attribute collection
> +    is impossible to guarantee.
> diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> index 3430d6891e89..8f4c528865c5 100644
> --- a/fs/nfs/export.c
> +++ b/fs/nfs/export.c
> @@ -171,4 +171,5 @@ const struct export_operations nfs_export_ops = {
> 	.encode_fh = nfs_encode_fh,
> 	.fh_to_dentry = nfs_fh_to_dentry,
> 	.get_parent = nfs_get_parent,
> +	.flags = EXPORT_OP_NOWCC,
> };
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 2277f83da250..480342675292 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -206,7 +206,7 @@ static __be32 *
> encode_post_op_attr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp)
> {
> 	struct dentry *dentry = fhp->fh_dentry;
> -	if (dentry && d_really_is_positive(dentry)) {
> +	if (!fhp->fh_no_wcc && dentry && d_really_is_positive(dentry)) {
> 	        __be32 err;
> 		struct kstat stat;
> 
> @@ -261,7 +261,7 @@ void fill_pre_wcc(struct svc_fh *fhp)
> 	struct kstat	stat;
> 	__be32 err;
> 
> -	if (fhp->fh_pre_saved)
> +	if (fhp->fh_no_wcc || fhp->fh_pre_saved)
> 		return;
> 
> 	inode = d_inode(fhp->fh_dentry);
> @@ -287,6 +287,9 @@ void fill_post_wcc(struct svc_fh *fhp)
> {
> 	__be32 err;
> 
> +	if (fhp->fh_no_wcc)
> +		return;
> +
> 	if (fhp->fh_post_saved)
> 		printk("nfsd: inode locked twice during operation.\n");
> 
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index c81dbbad8792..0c2ee65e46f3 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -291,6 +291,16 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp)
> 
> 	fhp->fh_dentry = dentry;
> 	fhp->fh_export = exp;
> +
> +	switch (rqstp->rq_vers) {
> +	case 3:
> +		if (!(dentry->d_sb->s_export_op->flags & EXPORT_OP_NOWCC))
> +			break;
> +		/* Fallthrough */
> +	case 2:
> +		fhp->fh_no_wcc = true;
> +	}
> +
> 	return 0;
> out:
> 	exp_put(exp);
> @@ -559,6 +569,9 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
> 	 */
> 	set_version_and_fsid_type(fhp, exp, ref_fh);
> 
> +	/* If we have a ref_fh, then copy the fh_no_wcc setting from it. */
> +	fhp->fh_no_wcc = ref_fh ? ref_fh->fh_no_wcc : false;
> +
> 	if (ref_fh == fhp)
> 		fh_put(ref_fh);
> 
> @@ -662,6 +675,7 @@ fh_put(struct svc_fh *fhp)
> 		exp_put(exp);
> 		fhp->fh_export = NULL;
> 	}
> +	fhp->fh_no_wcc = false;
> 	return;
> }
> 
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index 56cfbc361561..fb2b60a76b32 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -35,6 +35,7 @@ typedef struct svc_fh {
> 
> 	bool			fh_locked;	/* inode locked by us */
> 	bool			fh_want_write;	/* remount protection taken */
> +	bool			fh_no_wcc;	/* no wcc data needed */
> 	int			fh_flags;	/* FH flags */
> #ifdef CONFIG_NFSD_V3
> 	bool			fh_post_saved;	/* post-op attrs saved */
> @@ -54,7 +55,6 @@ typedef struct svc_fh {
> 	struct kstat		fh_post_attr;	/* full attrs after operation */
> 	u64			fh_post_change; /* nfsv4 change; see above */
> #endif /* CONFIG_NFSD_V3 */
> -
> } svc_fh;
> #define NFSD4_FH_FOREIGN (1<<0)
> #define SET_FH_FLAG(c, f) ((c)->fh_flags |= (f))
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index 3ceb72b67a7a..e7de0103a32e 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -213,6 +213,8 @@ struct export_operations {
> 			  bool write, u32 *device_generation);
> 	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> 			     int nr_iomaps, struct iattr *iattr);
> +#define	EXPORT_OP_NOWCC		(0x1)	/* Don't collect wcc data for NFSv3 replies */
> +	unsigned long	flags;
> };
> 
> extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
> -- 
> 2.28.0
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-11-30 21:24 ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations trondmy
                     ` (2 preceding siblings ...)
  2020-11-30 23:11   ` Chuck Lever
@ 2020-12-01  0:14   ` Jeff Layton
  3 siblings, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2020-12-01  0:14 UTC (permalink / raw)
  To: trondmy, J. Bruce Fields, Chuck Lever; +Cc: linux-nfs

On Mon, 2020-11-30 at 16:24 -0500, trondmy@kernel.org wrote:
> From: Jeff Layton <jeff.layton@primarydata.com>
> 
> With NFSv3 nfsd will always attempt to send along WCC data to the
> client. This generally involves saving off the in-core inode information
> prior to doing the operation on the given filehandle, and then issuing a
> vfs_getattr to it after the op.
> 
> Some filesystems (particularly clustered or networked ones) have an
> expensive ->getattr inode operation. Atomicitiy is also often difficult
> or impossible to guarantee on such filesystems. For those, we're best
> off not trying to provide WCC information to the client at all, and to
> simply allow it to poll for that information as needed with a GETATTR
> RPC.
> 
> This patch adds a new flags field to struct export_operations, and
> defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate
> that nfsd should not attempt to provide WCC info in NFSv3 replies. It
> also adds a blurb about the new flags field and flag to the exporting
> documentation.
> 
> The server will also now skip collecting this information for NFSv2 as
> well, since that info is never used there anyway.
> 
> Note that this patch does not add this flag to any filesystem
> export_operations structures. This was originally developed to allow
> reexporting nfs via nfsd. That code is not (and may never be) suitable
> for merging into mainline.
> 

Probably ought to fix up the above paragraph since we are now merging
this into mainline.

> Other filesystems may want to consider enabling this flag too. It's hard
> to tell however which ones have export operations to enable export via
> knfsd and which ones mostly rely on them for open-by-filehandle support,
> so I'm leaving that up to the individual maintainers to decide. I am
> cc'ing the relevant lists for those filesystems that I think may want to
> consider adding this though.
> 
> Cc: HPDD-discuss@lists.01.org
> Cc: ceph-devel@vger.kernel.org
> Cc: cluster-devel@redhat.com
> Cc: fuse-devel@lists.sourceforge.net
> Cc: ocfs2-devel@oss.oracle.com
> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  Documentation/filesystems/nfs/exporting.rst | 27 +++++++++++++++++++++
>  fs/nfs/export.c                             |  1 +
>  fs/nfsd/nfs3xdr.c                           |  7 ++++--
>  fs/nfsd/nfsfh.c                             | 14 +++++++++++
>  fs/nfsd/nfsfh.h                             |  2 +-
>  include/linux/exportfs.h                    |  2 ++
>  6 files changed, 50 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
> index 33d588a01ace..a3e3805833d1 100644
> --- a/Documentation/filesystems/nfs/exporting.rst
> +++ b/Documentation/filesystems/nfs/exporting.rst
> @@ -154,6 +154,11 @@ struct which has the following members:
>      to find potential names, and matches inode numbers to find the correct
>      match.
>  
> 
> 
> 
> +  flags
> +    Some filesystems may need to be handled differently than others. The
> +    export_operations struct also includes a flags field that allows the
> +    filesystem to communicate such information to nfsd. See the Export
> +    Operations Flags section below for more explanation.
>  
> 
> 
> 
>  A filehandle fragment consists of an array of 1 or more 4byte words,
>  together with a one byte "type".
> @@ -163,3 +168,25 @@ generated by encode_fh, in which case it will have been padded with
>  nuls.  Rather, the encode_fh routine should choose a "type" which
>  indicates the decode_fh how much of the filehandle is valid, and how
>  it should be interpreted.
> +
> +Export Operations Flags
> +-----------------------
> +In addition to the operation vector pointers, struct export_operations also
> +contains a "flags" field that allows the filesystem to communicate to nfsd
> +that it may want to do things differently when dealing with it. The
> +following flags are defined:
> +
> +  EXPORT_OP_NOWCC
> +    RFC 1813 recommends that servers always send weak cache consistency
> +    (WCC) data to the client after each operation. The server should
> +    atomically collect attributes about the inode, do an operation on it,
> +    and then collect the attributes afterward. This allows the client to
> +    skip issuing GETATTRs in some situations but means that the server
> +    is calling vfs_getattr for almost all RPCs. On some filesystems
> +    (particularly those that are clustered or networked) this is expensive
> +    and atomicity is difficult to guarantee. This flag indicates to nfsd
> +    that it should skip providing WCC attributes to the client in NFSv3
> +    replies when doing operations on this filesystem. Consider enabling
> +    this on filesystems that have an expensive ->getattr inode operation,
> +    or when atomicity between pre and post operation attribute collection
> +    is impossible to guarantee.
> diff --git a/fs/nfs/export.c b/fs/nfs/export.c
> index 3430d6891e89..8f4c528865c5 100644
> --- a/fs/nfs/export.c
> +++ b/fs/nfs/export.c
> @@ -171,4 +171,5 @@ const struct export_operations nfs_export_ops = {
>  	.encode_fh = nfs_encode_fh,
>  	.fh_to_dentry = nfs_fh_to_dentry,
>  	.get_parent = nfs_get_parent,
> +	.flags = EXPORT_OP_NOWCC,
>  };
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 2277f83da250..480342675292 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -206,7 +206,7 @@ static __be32 *
>  encode_post_op_attr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp)
>  {
>  	struct dentry *dentry = fhp->fh_dentry;
> -	if (dentry && d_really_is_positive(dentry)) {
> +	if (!fhp->fh_no_wcc && dentry && d_really_is_positive(dentry)) {
>  	        __be32 err;
>  		struct kstat stat;
>  
> 
> 
> 
> @@ -261,7 +261,7 @@ void fill_pre_wcc(struct svc_fh *fhp)
>  	struct kstat	stat;
>  	__be32 err;
>  
> 
> 
> 
> -	if (fhp->fh_pre_saved)
> +	if (fhp->fh_no_wcc || fhp->fh_pre_saved)
>  		return;
>  
> 
> 
> 
>  	inode = d_inode(fhp->fh_dentry);
> @@ -287,6 +287,9 @@ void fill_post_wcc(struct svc_fh *fhp)
>  {
>  	__be32 err;
>  
> 
> 
> 
> +	if (fhp->fh_no_wcc)
> +		return;
> +
>  	if (fhp->fh_post_saved)
>  		printk("nfsd: inode locked twice during operation.\n");
>  
> 
> 
> 
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index c81dbbad8792..0c2ee65e46f3 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -291,6 +291,16 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp)
>  
> 
> 
> 
>  	fhp->fh_dentry = dentry;
>  	fhp->fh_export = exp;
> +
> +	switch (rqstp->rq_vers) {
> +	case 3:
> +		if (!(dentry->d_sb->s_export_op->flags & EXPORT_OP_NOWCC))
> +			break;
> +		/* Fallthrough */
> +	case 2:
> +		fhp->fh_no_wcc = true;
> +	}
> +
>  	return 0;
>  out:
>  	exp_put(exp);
> @@ -559,6 +569,9 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
>  	 */
>  	set_version_and_fsid_type(fhp, exp, ref_fh);
>  
> 
> 
> 
> +	/* If we have a ref_fh, then copy the fh_no_wcc setting from it. */
> +	fhp->fh_no_wcc = ref_fh ? ref_fh->fh_no_wcc : false;
> +
>  	if (ref_fh == fhp)
>  		fh_put(ref_fh);
>  
> 
> 
> 
> @@ -662,6 +675,7 @@ fh_put(struct svc_fh *fhp)
>  		exp_put(exp);
>  		fhp->fh_export = NULL;
>  	}
> +	fhp->fh_no_wcc = false;
>  	return;
>  }
>  
> 
> 
> 
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index 56cfbc361561..fb2b60a76b32 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -35,6 +35,7 @@ typedef struct svc_fh {
>  
> 
> 
> 
>  	bool			fh_locked;	/* inode locked by us */
>  	bool			fh_want_write;	/* remount protection taken */
> +	bool			fh_no_wcc;	/* no wcc data needed */
>  	int			fh_flags;	/* FH flags */
>  #ifdef CONFIG_NFSD_V3
>  	bool			fh_post_saved;	/* post-op attrs saved */
> @@ -54,7 +55,6 @@ typedef struct svc_fh {
>  	struct kstat		fh_post_attr;	/* full attrs after operation */
>  	u64			fh_post_change; /* nfsv4 change; see above */
>  #endif /* CONFIG_NFSD_V3 */
> -
>  } svc_fh;
>  #define NFSD4_FH_FOREIGN (1<<0)
>  #define SET_FH_FLAG(c, f) ((c)->fh_flags |= (f))
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index 3ceb72b67a7a..e7de0103a32e 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -213,6 +213,8 @@ struct export_operations {
>  			  bool write, u32 *device_generation);
>  	int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
>  			     int nr_iomaps, struct iattr *iattr);
> +#define	EXPORT_OP_NOWCC		(0x1)	/* Don't collect wcc data for NFSv3 replies */
> +	unsigned long	flags;
>  };
>  
> 
> 
> 
>  extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,

-- 
Jeff Layton <jlayton@poochiereds.net>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-11-30 22:58   ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations J. Bruce Fields
@ 2020-12-01  0:33     ` Trond Myklebust
  2020-12-01  0:45     ` Trond Myklebust
  2020-12-01 19:46     ` J. Bruce Fields
  2 siblings, 0 replies; 28+ messages in thread
From: Trond Myklebust @ 2020-12-01  0:33 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, bfields, chuck.lever

On Mon, 2020-11-30 at 17:58 -0500, J. Bruce Fields wrote:
> This is great, thanks:
> 
> On Mon, Nov 30, 2020 at 04:24:50PM -0500, trondmy@kernel.org wrote:
> > From: Jeff Layton <jeff.layton@primarydata.com>
> > 
> > With NFSv3 nfsd will always attempt to send along WCC data to the
> > client. This generally involves saving off the in-core inode
> > information
> > prior to doing the operation on the given filehandle, and then
> > issuing a
> > vfs_getattr to it after the op.
> > 
> > Some filesystems (particularly clustered or networked ones) have an
> > expensive ->getattr inode operation. Atomicitiy is also often
> > difficult
> > or impossible to guarantee on such filesystems. For those, we're
> > best
> > off not trying to provide WCC information to the client at all, and
> > to
> > simply allow it to poll for that information as needed with a
> > GETATTR
> > RPC.
> > 
> > This patch adds a new flags field to struct export_operations, and
> > defines a new EXPORT_OP_NOWCC flag that filesystems can use to
> > indicate
> > that nfsd should not attempt to provide WCC info in NFSv3 replies.
> > It
> > also adds a blurb about the new flags field and flag to the
> > exporting
> > documentation.
> 
> In the v4 case I think it should also turn off the "atomic" flag in
> the
> change_info4 structure that's returned by some operations.
> 
> (Out of curiosity: have you seen this cause actual bugs?)
> 

Not so much bugs, but it definitely causes inefficiencies. The client
has to go to the server for every one of the WCC GETATTR calls, and
needs to serialise that with the operations. It's just a latency hog
for very little gain when you are doing bulk writing.


-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE
  2020-11-30 23:05           ` [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE J. Bruce Fields
@ 2020-12-01  0:39             ` Trond Myklebust
  2020-12-01  2:30               ` J. Bruce Fields
  0 siblings, 1 reply; 28+ messages in thread
From: Trond Myklebust @ 2020-12-01  0:39 UTC (permalink / raw)
  To: bfields, trondmy; +Cc: linux-nfs, bfields, chuck.lever

On Mon, 2020-11-30 at 18:05 -0500, J. Bruce Fields wrote:
> On Mon, Nov 30, 2020 at 04:24:54PM -0500, trondmy@kernel.org wrote:
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > If the underlying filesystem times out, then we want knfsd to
> > return
> > NFSERR_JUKEBOX/DELAY rather than NFSERR_STALE.
> 
> Out of curiosity, what was causing ETIMEDOUT in practice?
> 

If you're only re-exporting NFS from a single server, then it is OK to
use hard mounts. However if you are exporting from multiple servers, or
you have local filesystems that are also being exported by the same
knfsd server, then you usually want to use softerr mounts for NFS so
that operations that take an inordinate amount of time due to temporary
server outages get converted into JUKEBOX/DELAY errors. Otherwise, it
is really simple to cause all the nfsd threads to hang on that one
delinquent server.

> --b.
> 
> > 
> > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> > ---
> >  fs/nfsd/nfsfh.c | 16 ++++++++++++----
> >  1 file changed, 12 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> > index 0c2ee65e46f3..46c86f7bc429 100644
> > --- a/fs/nfsd/nfsfh.c
> > +++ b/fs/nfsd/nfsfh.c
> > @@ -268,12 +268,20 @@ static __be32 nfsd_set_fh_dentry(struct
> > svc_rqst *rqstp, struct svc_fh *fhp)
> >         if (fileid_type == FILEID_ROOT)
> >                 dentry = dget(exp->ex_path.dentry);
> >         else {
> > -               dentry = exportfs_decode_fh(exp->ex_path.mnt, fid,
> > -                               data_left, fileid_type,
> > -                               nfsd_acceptable, exp);
> > -               if (IS_ERR_OR_NULL(dentry))
> > +               dentry = exportfs_decode_fh_raw(exp->ex_path.mnt,
> > fid,
> > +                                               data_left,
> > fileid_type,
> > +                                               nfsd_acceptable,
> > exp);
> > +               if (IS_ERR_OR_NULL(dentry)) {
> >                         trace_nfsd_set_fh_dentry_badhandle(rqstp,
> > fhp,
> >                                         dentry ?  PTR_ERR(dentry) :
> > -ESTALE);
> > +                       switch (PTR_ERR(dentry)) {
> > +                       case -ENOMEM:
> > +                       case -ETIMEDOUT:
> > +                               break;
> > +                       default:
> > +                               dentry = ERR_PTR(-ESTALE);
> > +                       }
> > +               }
> >         }
> >         if (dentry == NULL)
> >                 goto out;
> > -- 
> > 2.28.0

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-11-30 22:58   ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations J. Bruce Fields
  2020-12-01  0:33     ` Trond Myklebust
@ 2020-12-01  0:45     ` Trond Myklebust
  2020-12-01  2:28       ` J. Bruce Fields
  2020-12-01 19:46     ` J. Bruce Fields
  2 siblings, 1 reply; 28+ messages in thread
From: Trond Myklebust @ 2020-12-01  0:45 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, bfields, chuck.lever

On Mon, 2020-11-30 at 17:58 -0500, J. Bruce Fields wrote:
> This is great, thanks:
> 
> On Mon, Nov 30, 2020 at 04:24:50PM -0500, trondmy@kernel.org wrote:
> > From: Jeff Layton <jeff.layton@primarydata.com>
> > 
> > With NFSv3 nfsd will always attempt to send along WCC data to the
> > client. This generally involves saving off the in-core inode
> > information
> > prior to doing the operation on the given filehandle, and then
> > issuing a
> > vfs_getattr to it after the op.
> > 
> > Some filesystems (particularly clustered or networked ones) have an
> > expensive ->getattr inode operation. Atomicitiy is also often
> > difficult
> > or impossible to guarantee on such filesystems. For those, we're
> > best
> > off not trying to provide WCC information to the client at all, and
> > to
> > simply allow it to poll for that information as needed with a
> > GETATTR
> > RPC.
> > 
> > This patch adds a new flags field to struct export_operations, and
> > defines a new EXPORT_OP_NOWCC flag that filesystems can use to
> > indicate
> > that nfsd should not attempt to provide WCC info in NFSv3 replies.
> > It
> > also adds a blurb about the new flags field and flag to the
> > exporting
> > documentation.
> 
> In the v4 case I think it should also turn off the "atomic" flag in
> the
> change_info4 structure that's returned by some operations.
> 

To answer this comment (which I missed earlier): I don't know that we
can turn off WCC for NFSv4. The GETATTR is a completely separate
operation, so the server would have to second-guess what the client
needs it for in order to optimise it away.

That is why this patch is labelled as being an optimisation for NFSv3
only in the comments above.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-11-30 23:11   ` Chuck Lever
@ 2020-12-01  0:49     ` Trond Myklebust
  0 siblings, 0 replies; 28+ messages in thread
From: Trond Myklebust @ 2020-12-01  0:49 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-nfs, bfields

On Mon, 2020-11-30 at 18:11 -0500, Chuck Lever wrote:
> 
> 
> > On Nov 30, 2020, at 4:24 PM, trondmy@kernel.org wrote:
> > 
> > From: Jeff Layton <jeff.layton@primarydata.com>
> > 
> > With NFSv3 nfsd will always attempt to send along WCC data to the
> > client. This generally involves saving off the in-core inode
> > information
> > prior to doing the operation on the given filehandle, and then
> > issuing a
> > vfs_getattr to it after the op.
> > 
> > Some filesystems (particularly clustered or networked ones) have an
> > expensive ->getattr inode operation. Atomicitiy is also often
> > difficult
> > or impossible to guarantee on such filesystems. For those, we're
> > best
> > off not trying to provide WCC information to the client at all, and
> > to
> > simply allow it to poll for that information as needed with a
> > GETATTR
> > RPC.
> > 
> > This patch adds a new flags field to struct export_operations, and
> > defines a new EXPORT_OP_NOWCC flag that filesystems can use to
> > indicate
> > that nfsd should not attempt to provide WCC info in NFSv3 replies.
> > It
> > also adds a blurb about the new flags field and flag to the
> > exporting
> > documentation.
> > 
> > The server will also now skip collecting this information for NFSv2
> > as
> > well, since that info is never used there anyway.
> > 
> > Note that this patch does not add this flag to any filesystem
> > export_operations structures. This was originally developed to
> > allow
> > reexporting nfs via nfsd. That code is not (and may never be)
> > suitable
> > for merging into mainline.
> > 
> > Other filesystems may want to consider enabling this flag too. It's
> > hard
> > to tell however which ones have export operations to enable export
> > via
> > knfsd and which ones mostly rely on them for open-by-filehandle
> > support,
> > so I'm leaving that up to the individual maintainers to decide. I
> > am
> > cc'ing the relevant lists for those filesystems that I think may
> > want to
> > consider adding this though.
> > 
> > Cc: HPDD-discuss@lists.01.org
> > Cc: ceph-devel@vger.kernel.org
> > Cc: cluster-devel@redhat.com
> > Cc: fuse-devel@lists.sourceforge.net
> > Cc: ocfs2-devel@oss.oracle.com
> > Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
> > Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
> > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> 
> These seem to apply fine, thanks for resending.
> 
> If you post a v3 to address Bruce's comment, can you also
> address this checkpatch nit?

I'm not seeing how I can address Bruce's comment at this time. I can
send you a v3 that changes the comment to the "fallthrough" obscenity.

> 
> 
> WARNING: Prefer 'fallthrough;' over fallthrough comment
> #154: FILE: fs/nfsd/nfsfh.c:299:
> +               /* Fallthrough */
> 
> total: 0 errors, 1 warnings, 120 lines checked
> 
> NOTE: For some of the reported defects, checkpatch may be able to
>       mechanically convert to the typical style using --fix or --fix-
> inplace.
> 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-12-01  0:45     ` Trond Myklebust
@ 2020-12-01  2:28       ` J. Bruce Fields
  2020-12-01  3:06         ` Trond Myklebust
  0 siblings, 1 reply; 28+ messages in thread
From: J. Bruce Fields @ 2020-12-01  2:28 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: bfields, linux-nfs, chuck.lever

On Tue, Dec 01, 2020 at 12:45:16AM +0000, Trond Myklebust wrote:
> On Mon, 2020-11-30 at 17:58 -0500, J. Bruce Fields wrote:
> > This is great, thanks:
> > 
> > On Mon, Nov 30, 2020 at 04:24:50PM -0500, trondmy@kernel.org wrote:
> > > From: Jeff Layton <jeff.layton@primarydata.com>
> > > 
> > > With NFSv3 nfsd will always attempt to send along WCC data to the
> > > client. This generally involves saving off the in-core inode
> > > information
> > > prior to doing the operation on the given filehandle, and then
> > > issuing a
> > > vfs_getattr to it after the op.
> > > 
> > > Some filesystems (particularly clustered or networked ones) have an
> > > expensive ->getattr inode operation. Atomicitiy is also often
> > > difficult
> > > or impossible to guarantee on such filesystems. For those, we're
> > > best
> > > off not trying to provide WCC information to the client at all, and
> > > to
> > > simply allow it to poll for that information as needed with a
> > > GETATTR
> > > RPC.
> > > 
> > > This patch adds a new flags field to struct export_operations, and
> > > defines a new EXPORT_OP_NOWCC flag that filesystems can use to
> > > indicate
> > > that nfsd should not attempt to provide WCC info in NFSv3 replies.
> > > It
> > > also adds a blurb about the new flags field and flag to the
> > > exporting
> > > documentation.
> > 
> > In the v4 case I think it should also turn off the "atomic" flag in
> > the
> > change_info4 structure that's returned by some operations.
> > 
> 
> To answer this comment (which I missed earlier): I don't know that we
> can turn off WCC for NFSv4. The GETATTR is a completely separate
> operation, so the server would have to second-guess what the client
> needs it for in order to optimise it away.

In the v4 case, we're setting the "atomic" field in the change_info4
struct to true even though the returned changeattrs clearly aren't
atomic with the operation in the re-export case.

That atomic field is initialized from fh_post_saved, so we just need to
set it to false in the v4 case as we are in the v3 case already.

Yes, it's true, that doesn't allow any optimizations because we still
have to get the post-op change attributes.

But it's a bug we may as well fix while we're here, and it probably
simplifies this patch if anything....

--b.

> 
> That is why this patch is labelled as being an optimisation for NFSv3
> only in the comments above.
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE
  2020-12-01  0:39             ` Trond Myklebust
@ 2020-12-01  2:30               ` J. Bruce Fields
  0 siblings, 0 replies; 28+ messages in thread
From: J. Bruce Fields @ 2020-12-01  2:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: bfields, trondmy, linux-nfs, chuck.lever

On Tue, Dec 01, 2020 at 12:39:11AM +0000, Trond Myklebust wrote:
> On Mon, 2020-11-30 at 18:05 -0500, J. Bruce Fields wrote:
> > On Mon, Nov 30, 2020 at 04:24:54PM -0500, trondmy@kernel.org wrote:
> > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > 
> > > If the underlying filesystem times out, then we want knfsd to
> > > return
> > > NFSERR_JUKEBOX/DELAY rather than NFSERR_STALE.
> > 
> > Out of curiosity, what was causing ETIMEDOUT in practice?
> > 
> 
> If you're only re-exporting NFS from a single server, then it is OK to
> use hard mounts. However if you are exporting from multiple servers, or
> you have local filesystems that are also being exported by the same
> knfsd server, then you usually want to use softerr mounts for NFS so
> that operations that take an inordinate amount of time due to temporary
> server outages get converted into JUKEBOX/DELAY errors. Otherwise, it
> is really simple to cause all the nfsd threads to hang on that one
> delinquent server.

Makes sense, thanks.

In theory the same thing could happen with block devices; longer term I
wonder if it'd make sense to limit how many threads are waiting on a
single backend.

(ACK to the patch, though, that'd be a project for another day.)

--b.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-12-01  2:28       ` J. Bruce Fields
@ 2020-12-01  3:06         ` Trond Myklebust
  2020-12-01  3:11           ` bfields
  0 siblings, 1 reply; 28+ messages in thread
From: Trond Myklebust @ 2020-12-01  3:06 UTC (permalink / raw)
  To: bfields; +Cc: bfields, linux-nfs, chuck.lever

On Mon, 2020-11-30 at 21:28 -0500, J. Bruce Fields wrote:
> On Tue, Dec 01, 2020 at 12:45:16AM +0000, Trond Myklebust wrote:
> > On Mon, 2020-11-30 at 17:58 -0500, J. Bruce Fields wrote:
> > > This is great, thanks:
> > > 
> > > On Mon, Nov 30, 2020 at 04:24:50PM -0500,
> > > trondmy@kernel.org wrote:
> > > > From: Jeff Layton <jeff.layton@primarydata.com>
> > > > 
> > > > With NFSv3 nfsd will always attempt to send along WCC data to
> > > > the
> > > > client. This generally involves saving off the in-core inode
> > > > information
> > > > prior to doing the operation on the given filehandle, and then
> > > > issuing a
> > > > vfs_getattr to it after the op.
> > > > 
> > > > Some filesystems (particularly clustered or networked ones)
> > > > have an
> > > > expensive ->getattr inode operation. Atomicitiy is also often
> > > > difficult
> > > > or impossible to guarantee on such filesystems. For those,
> > > > we're
> > > > best
> > > > off not trying to provide WCC information to the client at all,
> > > > and
> > > > to
> > > > simply allow it to poll for that information as needed with a
> > > > GETATTR
> > > > RPC.
> > > > 
> > > > This patch adds a new flags field to struct export_operations,
> > > > and
> > > > defines a new EXPORT_OP_NOWCC flag that filesystems can use to
> > > > indicate
> > > > that nfsd should not attempt to provide WCC info in NFSv3
> > > > replies.
> > > > It
> > > > also adds a blurb about the new flags field and flag to the
> > > > exporting
> > > > documentation.
> > > 
> > > In the v4 case I think it should also turn off the "atomic" flag
> > > in
> > > the
> > > change_info4 structure that's returned by some operations.
> > > 
> > 
> > To answer this comment (which I missed earlier): I don't know that
> > we
> > can turn off WCC for NFSv4. The GETATTR is a completely separate
> > operation, so the server would have to second-guess what the client
> > needs it for in order to optimise it away.
> 
> In the v4 case, we're setting the "atomic" field in the change_info4
> struct to true even though the returned changeattrs clearly aren't
> atomic with the operation in the re-export case.
> 
> That atomic field is initialized from fh_post_saved, so we just need
> to
> set it to false in the v4 case as we are in the v3 case already.
> 
> Yes, it's true, that doesn't allow any optimizations because we still
> have to get the post-op change attributes.
> 
> But it's a bug we may as well fix while we're here, and it probably
> simplifies this patch if anything....

I'd argue that is a completely separate issue. This is an optimisation
for NFSv3, whereas what you're talking about is atomicity (whether in
general or for NFSv4 only). I'd therefore prefer to make that a
completely separate export flag so that it can be treated as separate
functionality.

A local filesystem might choose to set the 'non-atomic' flag without
wanting to turn off NFSv3 WCC attributes. Yes, the latter are assumed
to be atomic, but a number of commercial servers do abuse that
assumption in practice.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-12-01  3:06         ` Trond Myklebust
@ 2020-12-01  3:11           ` bfields
  2020-12-01  3:16             ` Trond Myklebust
  0 siblings, 1 reply; 28+ messages in thread
From: bfields @ 2020-12-01  3:11 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: bfields, linux-nfs, chuck.lever

On Tue, Dec 01, 2020 at 03:06:46AM +0000, Trond Myklebust wrote:
> A local filesystem might choose to set the 'non-atomic' flag without
> wanting to turn off NFSv3 WCC attributes. Yes, the latter are assumed
> to be atomic, but a number of commercial servers do abuse that
> assumption in practice.

What do you mean by abusing that assumption?

I thought that leaving off the post-op attrs was the v3 protocol's way
of saying that it couldn't give you atomic wcc information.

--b.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-12-01  3:11           ` bfields
@ 2020-12-01  3:16             ` Trond Myklebust
  2020-12-01  3:23               ` Trond Myklebust
  2020-12-01 15:06               ` J. Bruce Fields
  0 siblings, 2 replies; 28+ messages in thread
From: Trond Myklebust @ 2020-12-01  3:16 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, bfields, chuck.lever

On Mon, 2020-11-30 at 22:11 -0500, bfields@fieldses.org wrote:
> On Tue, Dec 01, 2020 at 03:06:46AM +0000, Trond Myklebust wrote:
> > A local filesystem might choose to set the 'non-atomic' flag
> > without
> > wanting to turn off NFSv3 WCC attributes. Yes, the latter are
> > assumed
> > to be atomic, but a number of commercial servers do abuse that
> > assumption in practice.
> 
> What do you mean by abusing that assumption?
> 
> I thought that leaving off the post-op attrs was the v3 protocol's
> way
> of saying that it couldn't give you atomic wcc information.
> 

I mean that a number of commercial servers will happily return NFSv3
pre/post-operation WCC information that is not atomic with the
operation that is supposed to be 'protected'. This is, after all, why
the NFSv4 "struct change_info4" added the 'atomic' field in the first
place.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-12-01  3:16             ` Trond Myklebust
@ 2020-12-01  3:23               ` Trond Myklebust
  2020-12-01 15:19                 ` bfields
  2020-12-01 15:06               ` J. Bruce Fields
  1 sibling, 1 reply; 28+ messages in thread
From: Trond Myklebust @ 2020-12-01  3:23 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, bfields, chuck.lever

On Tue, 2020-12-01 at 03:16 +0000, Trond Myklebust wrote:
> On Mon, 2020-11-30 at 22:11 -0500, bfields@fieldses.org wrote:
> > On Tue, Dec 01, 2020 at 03:06:46AM +0000, Trond Myklebust wrote:
> > > A local filesystem might choose to set the 'non-atomic' flag
> > > without
> > > wanting to turn off NFSv3 WCC attributes. Yes, the latter are
> > > assumed
> > > to be atomic, but a number of commercial servers do abuse that
> > > assumption in practice.
> > 
> > What do you mean by abusing that assumption?
> > 
> > I thought that leaving off the post-op attrs was the v3 protocol's
> > way
> > of saying that it couldn't give you atomic wcc information.
> > 
> 
> I mean that a number of commercial servers will happily return NFSv3
> pre/post-operation WCC information that is not atomic with the
> operation that is supposed to be 'protected'. This is, after all, why
> the NFSv4 "struct change_info4" added the 'atomic' field in the first
> place.

BTW: To be fair, so does knfsd...

At Hammerspace, we had some real problems recently due to XFS exports
returning non-atomic values for the "space used" field. Speculative
preallocation is a real bitch:
https://xfs.org/index.php/XFS_FAQ#Q:_What_is_speculative_preallocation.3F

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-12-01  3:16             ` Trond Myklebust
  2020-12-01  3:23               ` Trond Myklebust
@ 2020-12-01 15:06               ` J. Bruce Fields
  1 sibling, 0 replies; 28+ messages in thread
From: J. Bruce Fields @ 2020-12-01 15:06 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: bfields, linux-nfs, chuck.lever

On Tue, Dec 01, 2020 at 03:16:41AM +0000, Trond Myklebust wrote:
> On Mon, 2020-11-30 at 22:11 -0500, bfields@fieldses.org wrote:
> > On Tue, Dec 01, 2020 at 03:06:46AM +0000, Trond Myklebust wrote:
> > > A local filesystem might choose to set the 'non-atomic' flag
> > > without
> > > wanting to turn off NFSv3 WCC attributes. Yes, the latter are
> > > assumed
> > > to be atomic, but a number of commercial servers do abuse that
> > > assumption in practice.
> > 
> > What do you mean by abusing that assumption?
> > 
> > I thought that leaving off the post-op attrs was the v3 protocol's
> > way
> > of saying that it couldn't give you atomic wcc information.
> > 
> 
> I mean that a number of commercial servers will happily return NFSv3
> pre/post-operation WCC information that is not atomic with the
> operation that is supposed to be 'protected'.

Oh, OK.

But why do *we* want to do that?

If there's some reason a filesystem really needs NFSv3 post-operation
WCC information without providing an atomic guarantee, they can make
that argument when the filesystem's merged.

Separating these two flags on the off chance a future filesystem may
want to violate the protocol in this way seems unnecessary.

--b.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-12-01  3:23               ` Trond Myklebust
@ 2020-12-01 15:19                 ` bfields
  2020-12-01 15:50                   ` Trond Myklebust
  0 siblings, 1 reply; 28+ messages in thread
From: bfields @ 2020-12-01 15:19 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, bfields, chuck.lever

On Tue, Dec 01, 2020 at 03:23:00AM +0000, Trond Myklebust wrote:
> On Tue, 2020-12-01 at 03:16 +0000, Trond Myklebust wrote:
> > On Mon, 2020-11-30 at 22:11 -0500, bfields@fieldses.org wrote:
> > > On Tue, Dec 01, 2020 at 03:06:46AM +0000, Trond Myklebust wrote:
> > > > A local filesystem might choose to set the 'non-atomic' flag
> > > > without
> > > > wanting to turn off NFSv3 WCC attributes. Yes, the latter are
> > > > assumed
> > > > to be atomic, but a number of commercial servers do abuse that
> > > > assumption in practice.
> > > 
> > > What do you mean by abusing that assumption?
> > > 
> > > I thought that leaving off the post-op attrs was the v3 protocol's
> > > way
> > > of saying that it couldn't give you atomic wcc information.
> > > 
> > 
> > I mean that a number of commercial servers will happily return NFSv3
> > pre/post-operation WCC information that is not atomic with the
> > operation that is supposed to be 'protected'. This is, after all, why
> > the NFSv4 "struct change_info4" added the 'atomic' field in the first
> > place.
> 
> BTW: To be fair, so does knfsd...
> 
> At Hammerspace, we had some real problems recently due to XFS exports
> returning non-atomic values for the "space used" field. Speculative
> preallocation is a real bitch:
> https://xfs.org/index.php/XFS_FAQ#Q:_What_is_speculative_preallocation.3F

So you think xfs should omit v3 post-operation attributes and still set
the atomic bit in v4 replies?

Would that have helped in the cases you saw?  It seems like speculative
preallocation isn't a problem with atomicity exactly--it couldn't be
avoided by applications cooperating with some locking scheme, for
example, if I'm understanding right.

--b.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-12-01 15:19                 ` bfields
@ 2020-12-01 15:50                   ` Trond Myklebust
  0 siblings, 0 replies; 28+ messages in thread
From: Trond Myklebust @ 2020-12-01 15:50 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs, bfields, chuck.lever

On Tue, 2020-12-01 at 10:19 -0500, bfields@fieldses.org wrote:
> On Tue, Dec 01, 2020 at 03:23:00AM +0000, Trond Myklebust wrote:
> > On Tue, 2020-12-01 at 03:16 +0000, Trond Myklebust wrote:
> > > On Mon, 2020-11-30 at 22:11 -0500, bfields@fieldses.org wrote:
> > > > On Tue, Dec 01, 2020 at 03:06:46AM +0000, Trond Myklebust
> > > > wrote:
> > > > > A local filesystem might choose to set the 'non-atomic' flag
> > > > > without
> > > > > wanting to turn off NFSv3 WCC attributes. Yes, the latter are
> > > > > assumed
> > > > > to be atomic, but a number of commercial servers do abuse
> > > > > that
> > > > > assumption in practice.
> > > > 
> > > > What do you mean by abusing that assumption?
> > > > 
> > > > I thought that leaving off the post-op attrs was the v3
> > > > protocol's
> > > > way
> > > > of saying that it couldn't give you atomic wcc information.
> > > > 
> > > 
> > > I mean that a number of commercial servers will happily return
> > > NFSv3
> > > pre/post-operation WCC information that is not atomic with the
> > > operation that is supposed to be 'protected'. This is, after all,
> > > why
> > > the NFSv4 "struct change_info4" added the 'atomic' field in the
> > > first
> > > place.
> > 
> > BTW: To be fair, so does knfsd...
> > 
> > At Hammerspace, we had some real problems recently due to XFS
> > exports
> > returning non-atomic values for the "space used" field. Speculative
> > preallocation is a real bitch:
> > https://xfs.org/index.php/XFS_FAQ#Q:_What_is_speculative_preallocation.3F
> 
> So you think xfs should omit v3 post-operation attributes and still
> set
> the atomic bit in v4 replies?
> 
> Would that have helped in the cases you saw?  It seems like
> speculative
> preallocation isn't a problem with atomicity exactly--it couldn't be
> avoided by applications cooperating with some locking scheme, for
> example, if I'm understanding right.
> 

Locking doesn't help. This isn't even something that needs multiple
clients. XFS will happily give the client that sends the WRITE one
answer for 'space used' in the WCC attributes, and then a different
answer in a subsequent GETATTR (no change in mtime, ctime or change
attribute) once the speculative allocation has been resolved.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  2020-11-30 22:58   ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations J. Bruce Fields
  2020-12-01  0:33     ` Trond Myklebust
  2020-12-01  0:45     ` Trond Myklebust
@ 2020-12-01 19:46     ` J. Bruce Fields
  2 siblings, 0 replies; 28+ messages in thread
From: J. Bruce Fields @ 2020-12-01 19:46 UTC (permalink / raw)
  To: trondmy; +Cc: J. Bruce Fields, Chuck Lever, linux-nfs

On Mon, Nov 30, 2020 at 05:58:42PM -0500, J. Bruce Fields wrote:
> This is great, thanks:
> 
> On Mon, Nov 30, 2020 at 04:24:50PM -0500, trondmy@kernel.org wrote:
> > From: Jeff Layton <jeff.layton@primarydata.com>
> > 
> > With NFSv3 nfsd will always attempt to send along WCC data to the
> > client. This generally involves saving off the in-core inode information
> > prior to doing the operation on the given filehandle, and then issuing a
> > vfs_getattr to it after the op.
> > 
> > Some filesystems (particularly clustered or networked ones) have an
> > expensive ->getattr inode operation. Atomicitiy is also often difficult
> > or impossible to guarantee on such filesystems. For those, we're best
> > off not trying to provide WCC information to the client at all, and to
> > simply allow it to poll for that information as needed with a GETATTR
> > RPC.
> > 
> > This patch adds a new flags field to struct export_operations, and
> > defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate
> > that nfsd should not attempt to provide WCC info in NFSv3 replies. It
> > also adds a blurb about the new flags field and flag to the exporting
> > documentation.
> 
> In the v4 case I think it should also turn off the "atomic" flag in the
> change_info4 structure that's returned by some operations.

And then it looks to me like all you need is something like the
following, no need for a fh_no_wcc field or anything, just skip the
stuff you don't want in fill_post_wcc when the flag is set:

--b.

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index bd4edf904bba..0b51f9dd0752 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -289,13 +289,16 @@ void fill_post_wcc(struct svc_fh *fhp)
 {
 	bool v4 = (fhp->fh_maxsize == NFS4_FHSIZE);
 	struct inode *inode = d_inode(fhp->fh_dentry);
+	struct export_operations *ops = inode->i_sb->s_export_op;
 
 	if (fhp->fh_post_saved)
 		printk("nfsd: inode locked twice during operation.\n");
 
 	fhp->fh_post_saved = true;
 
-	if (!v4 || !inode->i_sb->s_export_op->fetch_iversion) {
+	if (ops->flags & EXPORT_OP_NOWCC)
+		fhp->fh_post_saved = false;
+	else if (!v4 || !ops->fetch_iversion) {
 		__be32 err = fh_getattr(fhp, &fhp->fh_post_attr);
 		if (err) {
 			fhp->fh_post_saved = false;

^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-12-01 19:47 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-30 21:24 [PATCH 0/6] Patches to support NFS re-exporting trondmy
2020-11-30 21:24 ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations trondmy
2020-11-30 21:24   ` [PATCH 2/6] nfsd: allow filesystems to opt out of subtree checking trondmy
2020-11-30 21:24     ` [PATCH 3/6] nfsd: close cached files prior to a REMOVE or RENAME that would replace target trondmy
2020-11-30 21:24       ` [PATCH 4/6] exportfs: Add a function to return the raw output from fh_to_dentry() trondmy
2020-11-30 21:24         ` [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE trondmy
2020-11-30 21:24           ` [PATCH 6/6] nfsd: Set PF_LOCAL_THROTTLE on local filesystems only trondmy
2020-11-30 23:05           ` [PATCH 5/6] nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE J. Bruce Fields
2020-12-01  0:39             ` Trond Myklebust
2020-12-01  2:30               ` J. Bruce Fields
2020-11-30 22:59     ` [PATCH 2/6] nfsd: allow filesystems to opt out of subtree checking J. Bruce Fields
2020-11-30 22:58   ` [PATCH 1/6] nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations J. Bruce Fields
2020-12-01  0:33     ` Trond Myklebust
2020-12-01  0:45     ` Trond Myklebust
2020-12-01  2:28       ` J. Bruce Fields
2020-12-01  3:06         ` Trond Myklebust
2020-12-01  3:11           ` bfields
2020-12-01  3:16             ` Trond Myklebust
2020-12-01  3:23               ` Trond Myklebust
2020-12-01 15:19                 ` bfields
2020-12-01 15:50                   ` Trond Myklebust
2020-12-01 15:06               ` J. Bruce Fields
2020-12-01 19:46     ` J. Bruce Fields
2020-11-30 23:11   ` Chuck Lever
2020-12-01  0:49     ` Trond Myklebust
2020-12-01  0:14   ` Jeff Layton
2020-11-30 21:40 ` [PATCH 0/6] Patches to support NFS re-exporting Chuck Lever
2020-11-30 21:51   ` Trond Myklebust

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.