linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Amir Goldstein <amir73il@gmail.com>
To: NeilBrown <neilb@suse.de>
Cc: Roman Mamedov <rm@romanrm.net>,
	Goffredo Baroncelli <kreijack@libero.it>,
	Christoph Hellwig <hch@infradead.org>,
	Josef Bacik <josef@toxicpanda.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Chuck Lever <chuck.lever@oracle.com>, Chris Mason <clm@fb.com>,
	David Sterba <dsterba@suse.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Linux Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export
Date: Fri, 20 Aug 2021 09:23:14 +0300	[thread overview]
Message-ID: <CAOQ4uxjTc7amniZtSPWQbysD5YspE5Z=+5T1fC-bFqh56Y3_qQ@mail.gmail.com> (raw)
In-Reply-To: <162942971499.9892.4386273250573040668@noble.neil.brown.name>

On Fri, Aug 20, 2021 at 6:22 AM NeilBrown <neilb@suse.de> wrote:
>
> On Thu, 19 Aug 2021, Amir Goldstein wrote:
> > On Mon, Aug 16, 2021 at 1:21 AM NeilBrown <neilb@suse.de> wrote:
> > >
> > > There are a few ways to handle this more gracefully.
> > >
> > > 1/ We could get btrfs to hand out new filehandles as well as new inode
> > > numbers, but still accept the old filehandles.  Then we could make the
> > > inode number reported be based on the filehandle.  This would be nearly
> > > seamless but rather clumsy to code.  I'm not *very* keen on this idea,
> > > but it is worth keeping in mind.
> > >
> >
> > So objects would change their inode number after nfs inode cache is
> > evicted and while nfs filesystem is mounted. That does not sound ideal.
>
> No.  Almost all filehandle lookups happen in the context of some other
> filehandle.  If the provided context is an old-style filehandle, we
> provide an old-style filehandle for the lookup.  There is already code
> in nfsd to support this (as we have in the past changed how filesystems
> are identified).
>

I see. This is nice!
It almost sounds like "inode mapped mounts" :-)


> It would only be if the mountpoint filehandle (which is fetched without
> that context) went out of cache that inode numbers would change.  That
> would mean that the filesystem (possibly an automount) was unmounted.
> When it was remounted it could have a different device number anyway, so
> having different inode numbers would be of little consequence.
>
> >
> > But I am a bit confused about the problem.
> > If the export is of the btrfs root, then nfs client cannot access any
> > subvolumes (right?) - that was the bug report, so the value of inode
> > numbers in non-root subvolumes is not an issue.
>
> Not correct.  All objects in the filesystem are fully accessible.  The
> only problem is that some pairs of objects have the same inode number.
> This causes some programs like 'find' and 'du' to behave differently to
> expectations.  They will refuse to even look in a subvolume, because it
> looks like doing so could cause an infinite loop.  The values of inode
> numbers in non-root subvolumes is EXACTLY the issue.
>

Thanks for clarifying.

> > If export is of non-root subvolume, then why bother changing anything
> > at all? Is there a need to traverse into sub-sub-volumes?
> >
> > > 2/ We could add a btrfs mount option to control whether the uniquifier
> > > was set or not.  This would allow the sysadmin to choose when to manage
> > > any breakage.  I think this is my preference, but Josef has declared an
> > > aversion to mount options.
> > >
> > > 3/ We could add a module parameter to nfsd to control whether the
> > > uniquifier is merged in.  This again gives the sysadmin control, and it
> > > can be done despite any aversion from btrfs maintainers.  But I'd need
> > > to overcome any aversion from the nfsd maintainers, and I don't know how
> > > strong that would be yet. (A new export option isn't really appropriate.
> > > It is much more work to add an export option than the add a mount option).
> > >
> >
> > That is too bad, because IMO from users POV, "fsid=btrfsroot" or "cross-subvol"
> > export option would have been a nice way to describe and opt-in to this new
> > functionality.
> >
> > But let's consider for a moment the consequences of enabling this functionality
> > automatically whenever exporting a btrfs root volume without "crossmnt":
> >
> > 1. Objects inside a subvol that are inaccessible(?) with current
> > nfs/nfsd without
> >     "crossmnt" will become accessible after enabling the feature -
> > this will match
> >     the user experience of accessing btrfs on the host
>
> Not correct - as above.
>
> > 2. The inode numbers of the newly accessible objects would not match the inode
> >     numbers on the host fs (no big deal?)
>
> Unlikely to be a problem.  Inode numbers have no meaning beyond the facts
> that:
>   - they are stable for the lifetime of the object
>   - they are unique within a filesystem (except btrfs lies about
>     filesystems)
>   - they are not zero
>
> The facts only need to be equally true on the NFS server and client..
>
> > 3. The inode numbers of objects in a snapshot would not match the inode
> >     numbers of the original (pre-snapshot) objects (acceptable tradeoff for
> >     being able to access the snapshot objects without bloating /proc/mounts?)
>
> This also should not be a problem.  Files in different snapshots are
> different things that happen to share storage (like reflinks).
> Comparing inode numbers between places which report different st_dev
> does not fit within the meaning of inode numbers.
>
> > 4. The inode numbers of objects in a subvol observed via this "cross-subvol"
> >     export would not match the inode numbers of the same objects observed
> >     via an individual subvol export
>
> The device number would differ too, so the relative values of the inode
> numbers would be irrelevant.
>
> > 5. st_ino conflicts are possible when multiplexing subvol id and inode number.
> >     overlayfs resolved those conflicts by allocating an inode number from a
> >     reserved non-persistent inode range, which may cause objects to change
> >     their inode number during the lifetime on the filesystem (sensible
> > tradeoff?)
> >
> > I think that #4 is a bit hard to swallow and #3 is borderline acceptable...
> > Both and quite hard to document and to set expectations as a non-opt-in
> > change of behavior when exporting btrfs root.
> >
> > IMO, an nfsd module parameter will give some control and therefore is
> > a must, but it won't make life easier to document and set user expectations
> > when the semantics are not clearly stated in the exports table.
> >
> > You claim that "A new export option isn't really appropriate."
> > but your only argument is that "It is much more work to add
> > an export option than the add a mount option".
> >
> > With all due respect, for this particular challenge with all the
> > constraints involved, this sounds like a pretty weak argument.
> >
> > Surely, adding an export option is easier than slowly changing all
> > userspace tools to understand subvolumes? a solution that you had
> > previously brought up.
> >
> > Can you elaborate some more about your aversion to a new
> > export option.
>
> Export options are bits in a 32bit word - so both user-space and kernel
> need to agree on names for them.  We are currently using 18, so there is
> room to grow.  It is a perfectly reasonable way to implement sensible
> features.  It is, I think, a poor way to implement hacks to work around
> misfeatures in filesystems.
>
> This is the core of my dislike for adding an export option.  Using one
> effectively admits that what btrfs is doing is a valid thing to do.  I
> don't think it is.  I don't think we want any other filesystem developer
> to think that they can emulate the behaviour because support is already
> provided.
>
> If we add any configuration to support btrfs, I would much prefer it to
> be implemented in fs/btrfs, and if not, then with loud warnings that it
> works around a deficiency in btrfs.
>   /sys/modules/nfsd/parameters/btrfs_export_workaround
>

Thanks for clarifying.
I now understand how "hacky" the workaround in nfsd is.

Naive question: could this behavior be implemented in btrfs as you
prefer, but in a way that only serves nfsd and NEW nfs mounts for
that matter (i.e. only NEW filehandles)?

Meaning passing some hint in ->getattr() and ->iterate_shared(),
sort of like idmapped mount does for uid/gid.

Thanks,
Amir.

  reply	other threads:[~2021-08-20  6:23 UTC|newest]

Thread overview: 122+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-27 22:37 [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly NeilBrown
2021-07-27 22:37 ` [PATCH 07/11] exportfs: Allow filehandle lookup to cross internal mount points NeilBrown
2021-07-28 10:13   ` Amir Goldstein
2021-07-29  0:28     ` NeilBrown
2021-07-29  5:27       ` Amir Goldstein
2021-08-06  7:52         ` Miklos Szeredi
2021-08-06  8:08           ` Amir Goldstein
2021-08-06  8:18             ` Miklos Szeredi
2021-07-28 19:17   ` J. Bruce Fields
2021-07-28 22:25     ` NeilBrown
2021-07-27 22:37 ` [PATCH 04/11] VFS: export lookup_mnt() NeilBrown
2021-07-30  0:31   ` Al Viro
2021-07-30  5:33     ` NeilBrown
2021-07-27 22:37 ` [PATCH 01/11] VFS: show correct dev num in mountinfo NeilBrown
2021-07-30  0:25   ` Al Viro
2021-07-30  5:28     ` NeilBrown
2021-07-30  5:54       ` Miklos Szeredi
2021-07-30  6:13         ` NeilBrown
2021-07-30  7:18           ` Miklos Szeredi
2021-07-30  7:33             ` NeilBrown
2021-07-30  7:59               ` Miklos Szeredi
2021-08-02  4:18                 ` A Third perspective on BTRFS nfsd subvol dev/inode number issues NeilBrown
2021-08-02  5:25                   ` Al Viro
2021-08-02  5:40                     ` NeilBrown
2021-08-02  7:54                       ` Amir Goldstein
2021-08-02 13:53                         ` Josef Bacik
2021-08-03 22:29                           ` Qu Wenruo
2021-08-02 14:47                         ` Frank Filz
2021-08-02 21:24                         ` NeilBrown
2021-08-02  7:15                   ` Martin Steigerwald
2021-08-02 21:40                     ` NeilBrown
2021-08-02 12:39                   ` J. Bruce Fields
2021-08-02 20:32                     ` Patrick Goetz
2021-08-02 20:41                       ` J. Bruce Fields
2021-08-02 21:10                     ` NeilBrown
2021-08-02 21:50                       ` J. Bruce Fields
2021-08-02 21:59                         ` NeilBrown
2021-08-02 22:14                           ` J. Bruce Fields
2021-08-02 22:36                             ` NeilBrown
2021-08-03  0:15                               ` J. Bruce Fields
2021-07-27 22:37 ` [PATCH 03/11] VFS: pass lookup_flags into follow_down() NeilBrown
2021-07-27 22:37 ` [PATCH 11/11] btrfs: use automount to bind-mount all subvol roots NeilBrown
2021-07-28  8:37   ` kernel test robot
2021-07-28  8:37   ` [RFC PATCH] btrfs: btrfs_mountpoint_expiry_timeout can be static kernel test robot
2021-07-28 13:12   ` [PATCH 11/11] btrfs: use automount to bind-mount all subvol roots Christian Brauner
2021-07-29  0:43     ` NeilBrown
2021-07-29 14:38       ` Christian Brauner
2021-07-31  6:25   ` [btrfs] 5874902268: xfstests.btrfs.202.fail kernel test robot
2021-07-27 22:37 ` [PATCH 06/11] nfsd: include a vfsmount in struct svc_fh NeilBrown
2021-07-27 22:37 ` [PATCH 10/11] btrfs: introduce mapping function from location to inum NeilBrown
2021-07-27 22:37 ` [PATCH 02/11] VFS: allow d_automount to create in-place bind-mount NeilBrown
2021-07-27 22:37 ` [PATCH 09/11] nfsd: Allow filehandle lookup to cross internal mount points NeilBrown
2021-07-28 19:15   ` J. Bruce Fields
2021-07-28 22:29     ` NeilBrown
2021-07-30  0:42   ` Al Viro
2021-07-30  5:43     ` NeilBrown
2021-07-27 22:37 ` [PATCH 08/11] nfsd: change get_parent_attributes() to nfsd_get_mounted_on() NeilBrown
2021-07-27 22:37 ` [PATCH 05/11] VFS: new function: mount_is_internal() NeilBrown
2021-07-28  2:16   ` Al Viro
2021-07-28  3:32     ` NeilBrown
2021-07-30  0:34       ` Al Viro
2021-07-28  2:19 ` [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly Al Viro
2021-07-28  4:58 ` Wang Yugui
2021-07-28  6:04   ` Wang Yugui
2021-07-28  7:01     ` NeilBrown
2021-07-28 12:26       ` Neal Gompa
2021-07-28 19:14         ` J. Bruce Fields
2021-07-29  1:29           ` Zygo Blaxell
2021-07-29  1:43             ` NeilBrown
2021-07-29 23:20               ` Zygo Blaxell
2021-07-28 22:50         ` NeilBrown
2021-07-29  2:37           ` Zygo Blaxell
2021-07-29  3:36             ` NeilBrown
2021-07-29 23:20               ` Zygo Blaxell
2021-07-30  2:36                 ` NeilBrown
2021-07-30  5:25                   ` Qu Wenruo
2021-07-30  5:31                     ` Qu Wenruo
2021-07-30  5:53                       ` Amir Goldstein
2021-07-30  6:00                       ` NeilBrown
2021-07-30  6:09                         ` Qu Wenruo
2021-07-30  5:58                     ` NeilBrown
2021-07-30  6:23                       ` Qu Wenruo
2021-07-30  6:53                         ` NeilBrown
2021-07-30  7:09                           ` Qu Wenruo
2021-07-30 18:15                             ` Zygo Blaxell
2021-07-30 15:17                         ` J. Bruce Fields
2021-07-30 15:48                           ` Josef Bacik
2021-07-30 16:25                             ` Forza
2021-07-30 17:43                             ` Zygo Blaxell
2021-07-30  5:28                   ` Amir Goldstein
2021-07-28 13:43       ` g.btrfs
2021-07-29  1:39         ` NeilBrown
2021-07-29  9:28           ` Graham Cobb
2021-07-28  7:06   ` NeilBrown
2021-07-28  9:36     ` Wang Yugui
2021-07-28 19:35 ` J. Bruce Fields
2021-07-28 21:30   ` Josef Bacik
2021-07-30  0:13     ` Al Viro
2021-07-30  6:08       ` NeilBrown
2021-08-13  1:45 ` [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export NeilBrown
2021-08-13 14:55   ` Josef Bacik
2021-08-15  7:39   ` Goffredo Baroncelli
2021-08-15 19:35     ` Roman Mamedov
2021-08-15 21:03       ` Goffredo Baroncelli
2021-08-15 21:53         ` NeilBrown
2021-08-17 19:34           ` Goffredo Baroncelli
2021-08-17 21:39             ` NeilBrown
2021-08-18 17:24               ` Goffredo Baroncelli
2021-08-15 22:17       ` NeilBrown
2021-08-19  8:01         ` Amir Goldstein
2021-08-20  3:21           ` NeilBrown
2021-08-20  6:23             ` Amir Goldstein [this message]
2021-08-23  4:05         ` [PATCH v2] BTRFS/NFSD: " NeilBrown
2021-08-18 14:54   ` [PATCH] VFS/BTRFS/NFSD: " Wang Yugui
2021-08-18 21:46     ` NeilBrown
2021-08-19  2:19       ` Zygo Blaxell
2021-08-20  2:54         ` NeilBrown
2021-08-22 19:29           ` Zygo Blaxell
2021-08-23  5:51             ` NeilBrown
2021-08-23 23:22             ` NeilBrown
2021-08-25  2:06               ` Zygo Blaxell
2021-08-23  0:57         ` Wang Yugui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOQ4uxjTc7amniZtSPWQbysD5YspE5Z=+5T1fC-bFqh56Y3_qQ@mail.gmail.com' \
    --to=amir73il@gmail.com \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=hch@infradead.org \
    --cc=josef@toxicpanda.com \
    --cc=kreijack@libero.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=rm@romanrm.net \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).