Re: [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export

From: Amir Goldstein <amir73il@gmail.com>
To: NeilBrown <neilb@suse.de>
Cc: Roman Mamedov <rm@romanrm.net>,
	Goffredo Baroncelli <kreijack@libero.it>,
	Christoph Hellwig <hch@infradead.org>,
	Josef Bacik <josef@toxicpanda.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Chuck Lever <chuck.lever@oracle.com>, Chris Mason <clm@fb.com>,
	David Sterba <dsterba@suse.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Linux Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export
Date: Fri, 20 Aug 2021 09:23:14 +0300	[thread overview]
Message-ID: <CAOQ4uxjTc7amniZtSPWQbysD5YspE5Z=+5T1fC-bFqh56Y3_qQ@mail.gmail.com> (raw)
In-Reply-To: <162942971499.9892.4386273250573040668@noble.neil.brown.name>

On Fri, Aug 20, 2021 at 6:22 AM NeilBrown <neilb@suse.de> wrote:
>
> On Thu, 19 Aug 2021, Amir Goldstein wrote:
> > On Mon, Aug 16, 2021 at 1:21 AM NeilBrown <neilb@suse.de> wrote:
> > >
> > > There are a few ways to handle this more gracefully.
> > >
> > > 1/ We could get btrfs to hand out new filehandles as well as new inode
> > > numbers, but still accept the old filehandles.  Then we could make the
> > > inode number reported be based on the filehandle.  This would be nearly
> > > seamless but rather clumsy to code.  I'm not *very* keen on this idea,
> > > but it is worth keeping in mind.
> > >
> >
> > So objects would change their inode number after nfs inode cache is
> > evicted and while nfs filesystem is mounted. That does not sound ideal.
>
> No.  Almost all filehandle lookups happen in the context of some other
> filehandle.  If the provided context is an old-style filehandle, we
> provide an old-style filehandle for the lookup.  There is already code
> in nfsd to support this (as we have in the past changed how filesystems
> are identified).
>

I see. This is nice!
It almost sounds like "inode mapped mounts" :-)

> It would only be if the mountpoint filehandle (which is fetched without
> that context) went out of cache that inode numbers would change.  That
> would mean that the filesystem (possibly an automount) was unmounted.
> When it was remounted it could have a different device number anyway, so
> having different inode numbers would be of little consequence.
>
> >
> > But I am a bit confused about the problem.
> > If the export is of the btrfs root, then nfs client cannot access any
> > subvolumes (right?) - that was the bug report, so the value of inode
> > numbers in non-root subvolumes is not an issue.
>
> Not correct.  All objects in the filesystem are fully accessible.  The
> only problem is that some pairs of objects have the same inode number.
> This causes some programs like 'find' and 'du' to behave differently to
> expectations.  They will refuse to even look in a subvolume, because it
> looks like doing so could cause an infinite loop.  The values of inode
> numbers in non-root subvolumes is EXACTLY the issue.
>

Thanks for clarifying.

> > If export is of non-root subvolume, then why bother changing anything
> > at all? Is there a need to traverse into sub-sub-volumes?
> >
> > > 2/ We could add a btrfs mount option to control whether the uniquifier
> > > was set or not.  This would allow the sysadmin to choose when to manage
> > > any breakage.  I think this is my preference, but Josef has declared an
> > > aversion to mount options.
> > >
> > > 3/ We could add a module parameter to nfsd to control whether the
> > > uniquifier is merged in.  This again gives the sysadmin control, and it
> > > can be done despite any aversion from btrfs maintainers.  But I'd need
> > > to overcome any aversion from the nfsd maintainers, and I don't know how
> > > strong that would be yet. (A new export option isn't really appropriate.
> > > It is much more work to add an export option than the add a mount option).
> > >
> >
> > That is too bad, because IMO from users POV, "fsid=btrfsroot" or "cross-subvol"
> > export option would have been a nice way to describe and opt-in to this new
> > functionality.
> >
> > But let's consider for a moment the consequences of enabling this functionality
> > automatically whenever exporting a btrfs root volume without "crossmnt":
> >
> > 1. Objects inside a subvol that are inaccessible(?) with current
> > nfs/nfsd without
> >     "crossmnt" will become accessible after enabling the feature -
> > this will match
> >     the user experience of accessing btrfs on the host
>
> Not correct - as above.
>
> > 2. The inode numbers of the newly accessible objects would not match the inode
> >     numbers on the host fs (no big deal?)
>
> Unlikely to be a problem.  Inode numbers have no meaning beyond the facts
> that:
>   - they are stable for the lifetime of the object
>   - they are unique within a filesystem (except btrfs lies about
>     filesystems)
>   - they are not zero
>
> The facts only need to be equally true on the NFS server and client..
>
> > 3. The inode numbers of objects in a snapshot would not match the inode
> >     numbers of the original (pre-snapshot) objects (acceptable tradeoff for
> >     being able to access the snapshot objects without bloating /proc/mounts?)
>
> This also should not be a problem.  Files in different snapshots are
> different things that happen to share storage (like reflinks).
> Comparing inode numbers between places which report different st_dev
> does not fit within the meaning of inode numbers.
>
> > 4. The inode numbers of objects in a subvol observed via this "cross-subvol"
> >     export would not match the inode numbers of the same objects observed
> >     via an individual subvol export
>
> The device number would differ too, so the relative values of the inode
> numbers would be irrelevant.
>
> > 5. st_ino conflicts are possible when multiplexing subvol id and inode number.
> >     overlayfs resolved those conflicts by allocating an inode number from a
> >     reserved non-persistent inode range, which may cause objects to change
> >     their inode number during the lifetime on the filesystem (sensible
> > tradeoff?)
> >
> > I think that #4 is a bit hard to swallow and #3 is borderline acceptable...
> > Both and quite hard to document and to set expectations as a non-opt-in
> > change of behavior when exporting btrfs root.
> >
> > IMO, an nfsd module parameter will give some control and therefore is
> > a must, but it won't make life easier to document and set user expectations
> > when the semantics are not clearly stated in the exports table.
> >
> > You claim that "A new export option isn't really appropriate."
> > but your only argument is that "It is much more work to add
> > an export option than the add a mount option".
> >
> > With all due respect, for this particular challenge with all the
> > constraints involved, this sounds like a pretty weak argument.
> >
> > Surely, adding an export option is easier than slowly changing all
> > userspace tools to understand subvolumes? a solution that you had
> > previously brought up.
> >
> > Can you elaborate some more about your aversion to a new
> > export option.
>
> Export options are bits in a 32bit word - so both user-space and kernel
> need to agree on names for them.  We are currently using 18, so there is
> room to grow.  It is a perfectly reasonable way to implement sensible
> features.  It is, I think, a poor way to implement hacks to work around
> misfeatures in filesystems.
>
> This is the core of my dislike for adding an export option.  Using one
> effectively admits that what btrfs is doing is a valid thing to do.  I
> don't think it is.  I don't think we want any other filesystem developer
> to think that they can emulate the behaviour because support is already
> provided.
>
> If we add any configuration to support btrfs, I would much prefer it to
> be implemented in fs/btrfs, and if not, then with loud warnings that it
> works around a deficiency in btrfs.
>   /sys/modules/nfsd/parameters/btrfs_export_workaround
>

Thanks for clarifying.
I now understand how "hacky" the workaround in nfsd is.

Naive question: could this behavior be implemented in btrfs as you
prefer, but in a way that only serves nfsd and NEW nfs mounts for
that matter (i.e. only NEW filehandles)?

Meaning passing some hint in ->getattr() and ->iterate_shared(),
sort of like idmapped mount does for uid/gid.

Thanks,
Amir.