All of lore.kernel.org
 help / color / mirror / Atom feed
From: "J.  Bruce Fields" <bfields@fieldses.org>
To: NeilBrown <neilb@suse.de>
Cc: Chuck Lever <chuck.lever@oracle.com>,
	linux-nfs@vger.kernel.org, Josef Bacik <josef@toxicpanda.com>
Subject: Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
Date: Thu, 26 Aug 2021 16:19:16 -0400	[thread overview]
Message-ID: <20210826201916.GB10730@fieldses.org> (raw)
In-Reply-To: <162995778427.7591.11743795294299207756@noble.neil.brown.name>

On Thu, Aug 26, 2021 at 04:03:04PM +1000, NeilBrown wrote:
> 
> [[ Hi Bruce and Chuck,
>    I've rebased this patch on the earlier patch I sent which allows
>    me to use the name "fh_flags".  I've also added a missing #include.
>    I've added the 'Acked-by' which Joesf provided earlier for the
>    btrfs-part.  I don't have an 'ack' for the  stat.h part, but no-one
>    has complained wither.
>    I think it is as ready as it can be, and am keen to know what you
>    think.
>    I'm not *very* keen on testing s_magic in nfsd code (though we
>    already have a couple of such tests in nfs3proc.c), but it does have
>    the advantage of ensuring that no other filesystem can use this
>    functionality without landing a patch in fs/nfsd/.
>  
>    Thanks for any review that you can provide,
>    NeilBrown
> ]]

This seems hairy, but *somebody* has hated every single solution
proposed, so, argh, I don't know, maybe it's best.

There was a ton of "but why can't we just..." in previous threads, could
we include URLs for those and/or the lwn articles?  E.g.:

	https://lore.kernel.org/linux-nfs/162742539595.32498.13687924366155737575.stgit@noble.brown/#b
	https://lwn.net/Articles/866709/

> BTRFS does not provide unique inode numbers across a filesystem.
> It only provides unique inode numbers within a subvolume and
> uses synthetic device numbers for different subvolumes to ensure
> uniqueness for device+inode.
> 
> nfsd cannot use these varying synthetic device numbers.  If nfsd were to
> synthesise different stable filesystem ids to give to the client, that
> would cause subvolumes to appear in the mount table on the client, even
> though they don't appear in the mount table on the server.  Also, NFSv3
> doesn't support changing the filesystem id without a new explicit mount
> on the client (this is partially supported in practice, but violates the
> protocol specification and has problems in some edge cases).
> 
> So currently, the roots of all subvolumes report the same inode number
> in the same filesystem to NFS clients and tools like 'find' notice that
> a directory has the same identity as an ancestor, and so refuse to
> enter that directory.
> 
> This patch allows btrfs (or any filesystem) to provide a 64bit number
> that can be xored with the inode number to make the number more unique.
> Rather than the client being certain to see duplicates, with this patch
> it is possible but extremely rare.
> 
> The number that btrfs provides is a swab64() version of the subvolume
> identifier.  This has most entropy in the high bits (the low bits of the
> subvolume identifer), while the inode has most entropy in the low bits.
> The result will always be unique within a subvolume, and will almost
> always be unique across the filesystem.
> 
> If an upgrade of the NFS server caused all inode numbers in an exportfs
> BTRFS filesystem to appear to the client to change, the client may not
> handle this well.  The Linux client will cause any open files to become
> 'stale'.  If the mount point changed inode number, the whole mount would
> become inaccessible.
> 
> To avoid this, an unused byte in the filehandle (fh_auth) has been
> repurposed as "fh_flags".  The new behaviour of uniquifying inode number
> is only activated when a new flag is set.
> 
> NFSD will only set this flag in filehandles it reports if the filehandle
> of the parent (provided by the client) contains the bit, or if
>  - the filehandle for the parent is not provided or is for a different
>    export and
>  - the filehandle refers to a BTRFS filesystem.
> 
> Thus if you have a BTRFS filesystem originally mounted from a server
> without this patch, the flag will never be set and the current behaviour
> will continue.  Only once you re-mount the filesystem (or the filesystem
> is re-auto-mounted) will the inode numbers change.  When that happens,
> it is likely that the filesystem st_dev number seen on the client will
> change anyway.
> 
> Acked-by: Josef Bacik <josef@toxicpanda.com> (for BTFS change)

s/BTFS/BTRFS/.

> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/btrfs/inode.c     |  4 ++++
>  fs/nfsd/nfs3xdr.c    | 15 ++++++++++++++-
>  fs/nfsd/nfs4xdr.c    |  7 ++++---
>  fs/nfsd/nfsfh.c      | 14 ++++++++++++--
>  fs/nfsd/nfsfh.h      | 34 +++++++++++++++++++++++++++++++---
>  fs/nfsd/xdr3.h       |  2 ++
>  include/linux/stat.h | 18 ++++++++++++++++++
>  7 files changed, 85 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 06f9f167222b..d35e2a30f25f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9195,6 +9195,10 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
>  	generic_fillattr(&init_user_ns, inode, stat);
>  	stat->dev = BTRFS_I(inode)->root->anon_dev;
>  
> +	if (BTRFS_I(inode)->root->root_key.objectid != BTRFS_FS_TREE_OBJECTID)
> +		stat->ino_uniquifier =
> +			swab64(BTRFS_I(inode)->root->root_key.objectid);
> +
>  	spin_lock(&BTRFS_I(inode)->lock);
>  	delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
>  	inode_bytes = inode_get_bytes(inode);
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 3d37923afb06..5e2d5c352ecd 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -340,6 +340,7 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
>  {
>  	struct user_namespace *userns = nfsd_user_namespace(rqstp);
>  	__be32 *p;
> +	u64 ino;
>  	u64 fsid;
>  
>  	p = xdr_reserve_space(xdr, XDR_UNIT * 21);
> @@ -377,7 +378,8 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
>  	p = xdr_encode_hyper(p, fsid);
>  
>  	/* fileid */
> -	p = xdr_encode_hyper(p, stat->ino);
> +	ino = nfsd_uniquify_ino(fhp, stat);
> +	p = xdr_encode_hyper(p, ino);
>  
>  	p = encode_nfstime3(p, &stat->atime);
>  	p = encode_nfstime3(p, &stat->mtime);
> @@ -1151,6 +1153,17 @@ svcxdr_encode_entry3_common(struct nfsd3_readdirres *resp, const char *name,
>  	if (xdr_stream_encode_item_present(xdr) < 0)
>  		return false;
>  	/* fileid */
> +	if (!resp->dir_have_uniquifier) {
> +		struct kstat stat;
> +		if (fh_getattr(&resp->fh, &stat) == nfs_ok)
> +			resp->dir_ino_uniquifier =
> +				nfsd_ino_uniquifier(&resp->fh, &stat);
> +		else
> +			resp->dir_ino_uniquifier = 0;
> +		resp->dir_have_uniquifier = true;

This took me a minute.  So we're assuming the uniquifier stays the same
across a directory and its children (because you can't hard link across
subvolumes), and this code is just caching the uniquifier for use across
the directory--is that right?

> +	}
> +	if (resp->dir_ino_uniquifier != ino)
> +		ino ^= resp->dir_ino_uniquifier;

I guess this check (here and in nfsd_uniquify_ino) is just to prevent
returning inode number zero?

--b.

>  	if (xdr_stream_encode_u64(xdr, ino) < 0)
>  		return false;
>  	/* name */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index a54b2845473b..6e31f6286e0b 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -3114,10 +3114,11 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  					fhp->fh_handle.fh_size);
>  	}
>  	if (bmval0 & FATTR4_WORD0_FILEID) {
> +		u64 ino = nfsd_uniquify_ino(fhp, &stat);
>  		p = xdr_reserve_space(xdr, 8);
>  		if (!p)
>  			goto out_resource;
> -		p = xdr_encode_hyper(p, stat.ino);
> +		p = xdr_encode_hyper(p, ino);
>  	}
>  	if (bmval0 & FATTR4_WORD0_FILES_AVAIL) {
>  		p = xdr_reserve_space(xdr, 8);
> @@ -3274,7 +3275,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  
>  		p = xdr_reserve_space(xdr, 8);
>  		if (!p)
> -                	goto out_resource;
> +			goto out_resource;
>  		/*
>  		 * Get parent's attributes if not ignoring crossmount
>  		 * and this is the root of a cross-mounted filesystem.
> @@ -3284,7 +3285,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  			err = get_parent_attributes(exp, &parent_stat);
>  			if (err)
>  				goto out_nfserr;
> -			ino = parent_stat.ino;
> +			ino = nfsd_uniquify_ino(fhp, &parent_stat);
>  		}
>  		p = xdr_encode_hyper(p, ino);
>  	}
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index 7695c0f1eefe..18bb139f8bfe 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -9,6 +9,7 @@
>   */
>  
>  #include <linux/exportfs.h>
> +#include <linux/magic.h>
>  
>  #include <linux/sunrpc/svcauth_gss.h>
>  #include "nfsd.h"
> @@ -173,7 +174,7 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp)
>  
>  	if (--data_left < 0)
>  		return error;
> -	if (fh->fh_auth_type != 0)
> +	if ((fh->fh_flags & ~NFSD_FH_FLAG_ALL) != 0)
>  		return error;
>  	len = key_len(fh->fh_fsid_type) / 4;
>  	if (len == 0)
> @@ -532,6 +533,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
>  
>  	struct inode * inode = d_inode(dentry);
>  	dev_t ex_dev = exp_sb(exp)->s_dev;
> +	u8 flags = 0;
>  
>  	dprintk("nfsd: fh_compose(exp %02x:%02x/%ld %pd2, ino=%ld)\n",
>  		MAJOR(ex_dev), MINOR(ex_dev),
> @@ -548,6 +550,14 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
>  	/* If we have a ref_fh, then copy the fh_no_wcc setting from it. */
>  	fhp->fh_no_wcc = ref_fh ? ref_fh->fh_no_wcc : false;
>  
> +	if (ref_fh && ref_fh->fh_export == exp) {
> +		flags = ref_fh->fh_handle.fh_flags;
> +	} else {
> +		/* Set flags as needed */
> +		if (exp->ex_path.mnt->mnt_sb->s_magic == BTRFS_SUPER_MAGIC)
> +			flags |= NFSD_FH_FLAG_INO_UNIQUIFY;
> +	}
> +
>  	if (ref_fh == fhp)
>  		fh_put(ref_fh);
>  
> @@ -565,7 +575,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
>  
>  	fhp->fh_handle.fh_size =
>  		key_len(fhp->fh_handle.fh_fsid_type) + 4;
> -	fhp->fh_handle.fh_auth_type = 0;
> +	fhp->fh_handle.fh_flags = flags;
>  
>  	mk_fsid(fhp->fh_handle.fh_fsid_type,
>  		fhp->fh_handle.fh_fsid,
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f36234c474dc..bbc7ddd34143 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -18,11 +18,17 @@
>   * The file handle starts with a sequence of four-byte words.
>   * The first word contains a version number (1) and three descriptor bytes
>   * that tell how the remaining 3 variable length fields should be handled.
> - * These three bytes are auth_type, fsid_type and fileid_type.
> + * These three bytes are flags, fsid_type and fileid_type.
>   *
>   * All four-byte values are in host-byte-order.
>   *
> - * The auth_type field is deprecated and must be set to 0.
> + * The flags field (previously auth_type) can be used when nfsd behaviour
> + * needs to change in a non-compatible way, usually for some specific
> + * filesystem.  Flags should only be set in filehandles for filesystems which
> + * need them.
> + * Current values:
> + *   1  -  BTRFS only.  Cause stat->ino_uniquifier to be used to improve inode
> + *         number uniqueness.
>   *
>   * The fsid_type identifies how the filesystem (or export point) is
>   *    encoded.
> @@ -53,7 +59,7 @@ struct knfsd_fh {
>  		__u32			fh_raw[NFS4_FHSIZE/4];
>  		struct {
>  			__u8		fh_version;	/* == 1 */
> -			__u8		fh_auth_type;	/* deprecated */
> +			__u8		fh_flags;
>  			__u8		fh_fsid_type;
>  			__u8		fh_fileid_type;
>  			__u32		fh_fsid[]; /* flexible-array member */
> @@ -131,6 +137,28 @@ enum fsid_source {
>  };
>  extern enum fsid_source fsid_source(const struct svc_fh *fhp);
>  
> +enum nfsd_fh_flags {
> +	NFSD_FH_FLAG_INO_UNIQUIFY = 1,	/* BTRFS only */
> +
> +	NFSD_FH_FLAG_ALL = 1
> +};
> +
> +static inline u64 nfsd_ino_uniquifier(const struct svc_fh *fhp,
> +				      const struct kstat *stat)
> +{
> +	if (fhp->fh_handle.fh_flags & NFSD_FH_FLAG_INO_UNIQUIFY)
> +		return stat->ino_uniquifier;
> +	return 0;
> +}
> +
> +static inline u64 nfsd_uniquify_ino(const struct svc_fh *fhp,
> +				    const struct kstat *stat)
> +{
> +	u64 u = nfsd_ino_uniquifier(fhp, stat);
> +	if (u != stat->ino)
> +		return stat->ino ^ u;
> +	return stat->ino;
> +}
>  
>  /*
>   * This might look a little large to "inline" but in all calls except
> diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
> index 933008382bbe..d9b6c8314bbb 100644
> --- a/fs/nfsd/xdr3.h
> +++ b/fs/nfsd/xdr3.h
> @@ -179,6 +179,8 @@ struct nfsd3_readdirres {
>  	struct xdr_buf		dirlist;
>  	struct svc_fh		scratch;
>  	struct readdir_cd	common;
> +	u64			dir_ino_uniquifier;
> +	bool			dir_have_uniquifier;
>  	unsigned int		cookie_offset;
>  	struct svc_rqst *	rqstp;
>  
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index fff27e603814..0f3f74d302f8 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -46,6 +46,24 @@ struct kstat {
>  	struct timespec64 btime;			/* File creation time */
>  	u64		blocks;
>  	u64		mnt_id;
> +	/*
> +	 * BTRFS does not provide unique inode numbers within a filesystem,
> +	 * depending on a synthetic 'dev' to provide uniqueness.
> +	 * NFSd cannot make use of this 'dev' number so clients often see
> +	 * duplicate inode numbers.
> +	 * For BTRFS, 'ino' is unlikely to use the high bits until the filesystem
> +	 * has created a great many inodes.
> +	 * It puts another number in ino_uniquifier which:
> +	 * - has most entropy in the high bits
> +	 * - is different precisely when 'dev' is different
> +	 * - is stable across unmount/remount
> +	 * NFSd can xor this with 'ino' to get a substantially more unique
> +	 * number for reporting to the client.
> +	 * The ino_uniquifier for a directory can reasonably be applied
> +	 * to inode numbers reported by the readdir filldir callback.
> +	 * It is NOT currently exported to user-space.
> +	 */
> +	u64		ino_uniquifier;
>  };
>  
>  #endif
> -- 
> 2.32.0
> 
> 

  reply	other threads:[~2021-08-26 20:19 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-26  4:28 [PATCH] NFSD: drop support for ancient file-handles NeilBrown
2021-08-26  6:03 ` [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export NeilBrown
2021-08-26 20:19   ` J.  Bruce Fields [this message]
2021-08-26 22:10     ` NeilBrown
2021-08-27 14:53       ` Frank Filz
2021-08-27 22:57         ` NeilBrown
2021-08-27 23:46           ` Frank Filz
2021-08-27 23:55             ` NeilBrown
2021-08-28  2:21               ` Frank Filz
2021-08-27 18:32       ` J.  Bruce Fields
2021-08-27 23:01         ` NeilBrown
2021-08-27 16:20   ` Christoph Hellwig
2021-08-27 23:05     ` NeilBrown
2021-08-28  7:09       ` Christoph Hellwig
2021-08-31  4:59         ` NeilBrown
2021-09-01  7:20           ` Christoph Hellwig
2021-09-01 15:22             ` J. Bruce Fields
2021-09-02  4:14               ` NeilBrown
2021-09-05 16:07                 ` J. Bruce Fields
2021-09-06  1:29                   ` NeilBrown
2021-09-11 14:12                     ` Amir Goldstein
2021-09-13  0:43                       ` NeilBrown
2021-09-13 10:04                         ` Amir Goldstein
2021-09-13 22:59                           ` NeilBrown
2021-09-14  5:45                             ` Amir Goldstein
2021-09-20 22:09                               ` NeilBrown
2021-09-02  7:11               ` Christoph Hellwig
2021-09-02  4:06             ` NeilBrown
2021-09-02  7:16               ` Christoph Hellwig
2021-09-02  7:53                 ` Miklos Szeredi
2021-09-02 14:16                   ` Frank Filz
2021-09-02 23:02                     ` NeilBrown
2021-08-26 14:10 ` [PATCH] NFSD: drop support for ancient file-handles Chuck Lever III
2021-08-26 21:38   ` NeilBrown
2021-08-26 14:51 ` J.  Bruce Fields
2021-08-26 21:41   ` NeilBrown
2021-08-27 15:15 ` Christoph Hellwig
2021-08-27 23:24   ` NeilBrown
2021-08-31  4:41   ` [PATCH 1/2 v2] NFSD: drop support for ancient filehandles NeilBrown
2021-08-31  4:42     ` [PATCH 2/2] NFSD: simplify struct nfsfh NeilBrown
2021-09-01  7:44       ` Christoph Hellwig
2021-09-01  7:44     ` [PATCH 1/2 v2] NFSD: drop support for ancient filehandles Christoph Hellwig
2021-09-01 14:21       ` Chuck Lever III
2021-09-02  1:14         ` [PATCH 1/3 v3] NFSD: move filehandle format declarations out of "uapi" NeilBrown
2021-09-02  1:15           ` [PATCH 2/3 v3] NFSD: drop support for ancient filehandles NeilBrown
2021-09-02  1:16             ` [PATCH 3/3 v3] NFSD: simplify struct nfsfh NeilBrown
2021-09-02  7:22             ` [PATCH 2/3 v3] NFSD: drop support for ancient filehandles Christoph Hellwig
2021-09-02  7:21           ` [PATCH 1/3 v3] NFSD: move filehandle format declarations out of "uapi" Christoph Hellwig
2021-09-23 21:21           ` Bruce Fields
2021-09-25  4:21             ` NeilBrown
  -- strict thread matches above, loose matches on Subject: below --
2021-07-27 22:37 [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly NeilBrown
2021-08-13  1:45 ` [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export NeilBrown
2021-08-15  7:39   ` Goffredo Baroncelli
2021-08-15 19:35     ` Roman Mamedov
2021-08-15 22:17       ` NeilBrown
2021-08-23  4:05         ` [PATCH v2] BTRFS/NFSD: " NeilBrown
2021-08-23  8:17           ` kernel test robot
2021-08-23  8:17             ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210826201916.GB10730@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.