All of lore.kernel.org
 help / color / mirror / Atom feed
From: "NeilBrown" <neilb@suse.de>
To: "Miklos Szeredi" <miklos@szeredi.hu>
Cc: "Al Viro" <viro@zeniv.linux.org.uk>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Josef Bacik" <josef@toxicpanda.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	"Chuck Lever" <chuck.lever@oracle.com>,
	"Chris Mason" <clm@fb.com>, "David Sterba" <dsterba@suse.com>,
	linux-fsdevel@vger.kernel.org,
	"Linux NFS list" <linux-nfs@vger.kernel.org>,
	"Btrfs BTRFS" <linux-btrfs@vger.kernel.org>
Subject: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
Date: Mon, 02 Aug 2021 14:18:29 +1000	[thread overview]
Message-ID: <162787790940.32159.14588617595952736785@noble.neil.brown.name> (raw)
In-Reply-To: <CAJfpegub4oBZCBXFQqc8J-zUiSW+KaYZLjZaeVm_cGzNVpxj+A@mail.gmail.com>

On Fri, 30 Jul 2021, Miklos Szeredi wrote:
> On Fri, 30 Jul 2021 at 09:34, NeilBrown <neilb@suse.de> wrote:
> 
> > But I'm curious about your reference to "some sort of subvolume
> > structure that the VFS knows about".  Do you have any references, or can
> > you suggest a search term I could try?
> 
> Found this:
> https://lore.kernel.org/linux-fsdevel/20180508180436.716-1-mfasheh@suse.de/
> 

Excellent, thanks.  Very useful.

OK.  Time for a third perspective.

With its current framing the problem is unsolvable.  So it needs to be
reframed.

By "current framing", I mean that we are trying to get btrfs to behave
in a way that meets current user-space expectations.  Specially, the
expectation that each object in any filesystem can be uniquely
identified by a 64bit inode number.  btrfs provides functionality which
needs more than 64bits.  So it simple does not fit.  btrfs currently
fudges with device numbers to hide the problem.  This is at best an
incomplete solution, and is already being shown to be insufficient.

Therefore we need to change user-space expectations.  This has been done
before multiple times - often by breaking things and leaving it up to
user-space to fix it.  My favourite example is that NFSv2 broke the
creation of lock files with O_CREAT|O_EXCL.  USER-space starting using
hard-links to achieve the same result.  When NFSv3 added reliable
O_CREAT|O_EXCL support, it hardly mattered.... but I digress.

It think we need to bite-the-bullet and decide that 64bits is not
enough, and in fact no number of bits will ever be enough.  overlayfs
makes this clear.  overlayfs merges multiple filesystems, and so needs
strictly more bits to uniquely identify all inodes than any of the
filesystems use.  Currently it over-loads the high bits and hopes the
filesystem doesn't use them.

The "obvious" choice for a replacement is the file handle provided by
name_to_handle_at() (falling back to st_ino if name_to_handle_at isn't
supported by the filesystem).  This returns an extensible opaque
byte-array.  It is *already* more reliable than st_ino.  Comparing
st_ino is only a reliable way to check if two files are the same if you
have both of them open.  If you don't, then one of the files might have
been deleted and the inode number reused for the other.  A filehandle
contains a generation number which protects against this.

So I think we need to strongly encourage user-space to start using
name_to_handle_at() whenever there is a need to test if two things are
the same.

This frees us to be a little less precise about assuring st_ino is
always unique, but only a little.  We still want to minimize conflicts
and avoid them in common situations.

A filehandle typically has some bytes used to locate the inode -
"location" - and some to validate it - "generation".  In general, st_ino
must now be seen as a hash of the "location".  It could be a generic hash
(xxhash? jhash?) or it could be a careful xor of the bits.

For btrfs, the "location" is root.objectid ++ file.objectid.  I think
the inode should become (file.objectid ^ swab64(root.objectid)).  This
will provide numbers that are unique until you get very large subvols,
and very many subvols.  It also ensures that two inodes in the same
subvol will be guaranteed to have different st_ino.

This will quickly cause problems for overlayfs as it means that if btrfs
is used with overlayfs, the top few bits won't be zero.  Possibly btrfs
could be polite and shift the swab64(root.objectid) down 8 bits to make
room.  Possible overlayfs should handle this case (top N-bits not all
zero), and switch to a generic hash of the inode number (or preferably
the filehandle) to (64-N bits).

If we convince user-space to use filehandles to compare objects, the NFS
problems I initially was trying to address go away.  Even before that,
if btrfs switches to a hashed (i.e. xor) inode number, then the problems
also go away.  but they aren't the only problems here.

Accessing the fhandle isn't always possible.  For example reading
/proc/locks reports major:minor:inode-number for each file (This is the
major:minor from the superblock, so btrfs internal dev numbers aren't
used).  The filehandle is simply not available.  I think the only way
to address this is to create a new file. "/proc/locks2" :-)
Similarly the "lock:" lines in /proc/$PID/fdinfo/$FD need to be duplicated
as "lock2:" lines with filehandle instead of inode number.  Ditto for
'inotify:' lines and possibly others.

Similarly /proc/$PID/maps contains the inode number with no fhandle.
The situation isn't so bad there as there is a path name, and you can
even call name_to_handle_at("/proc/$PID/map_files/$RANGE") to get the
fhandle.  It might be better to provide a new file though.

Next we come to the use of different device numbers in the one btrfs
filesystem.  I'm of the opinion that this was an unfortunately choice
that we need to deprecate.  Tools that use fhandle won't need it to
differentiate inodes, but there is more to the story than just that
need.

As has been mentioned, people depend on "du -x" and "find -mount" (aka
"-xdev") to stay within a "subvol".  We need to provide a clean
alternate before discouraging that usage.

xfs, ext4, fuse, and f2fs each (can) maintain a "project id" for each
inode, which effectively groups inodes into a tree.  This is used for
project quotas.  At one level this is conceptually very similar to the
btrfs subtree.root.objectid.  It is different in that it is only 32 bits
(:-[) and is mapped between user name-spaces like uids and gids.  It is
similar in that it identifies a group of inodes that are accounted
together and are (generally) contiguous in a tree.

If we encouraged "du" to have a "--proj" option (-j) which stays within
a project, and gave a similar option to find, that could be broadly
useful.  Then if btrfs provided the subvol objectid as fsx_projid
(available in FS_IOC_FSGETXATTR ioctl), then "du --proj" on btrfs would
stay in a subvol.  Longer term it might make sense to add a 64bit
project-id to statx.  I don't think it would make sense for btrfs to
have a 'project' concept that is different from the "subvolume".

It would be cool if "df" could have a "--proj" (or similar) flag so that
it would report the usage of a "subtree" (given a path).  Unfortunately
there isn't really an interface for this.  Going through the quota
system might be nice, I don't think it would work.

Another thought about btrfs device numbers is that, providing inode
numbers are (nearly) unique, we don't really need more than 2.  A btrfs
filesystem could allocate 2 anon device numbers.  One would be assigned
to the root, and each subvolume would get whichever device number its
parent doesn't have.  This would stop "du -x" and "find -mount" and
similar from crossing into subvols.  There could be a mount option to
select between "1", "2", and "many" device numbers for a filesystem.

- I note that cephfs place games with st_dev too....  I wonder if we can
  learn anything from that. 
- audit uses sb->s_dev without asking the filesystem.  So it won't
  handle  btrfs correctly.  I wonder if there is room for it to use
  file handles.

I accept that I'm proposing some BIG changes here, and they might break
things.  But btrfs is already broken in various ways.  I think we need a
goal to work towards which will eventually remove all breakage and still
have room for expansion.  I think that must include:

- providing as-unique-as-practical inode numbers across the whole
  filesystem, and deprecating the internal use of different device
  numbers.  Make it possible to mount without them ASAP, and aim to
  make that the default eventually.
- working with user-space tool/library developers to use
  name_to_handle_at() to identify inodes, only using st_ino
  as a fall-back
- adding filehandles to various /proc etc files as needed, either
  duplicating lines or duplicating files.  And helping application which
  use these files to migrate (I would *NOT* change the dev numbers in
  the current file to report the internal btrfs dev numbers the way that
  SUSE does.  I would prefer that current breakage could be used to
  motivate developers towards depending instead on fhandles).
- exporting subtree (aka subvol) id to user-space, possibly paralleling
  proj_id in some way, and extending various tools to understand
  subtrees

Who's with me??

NeilBrown

  reply	other threads:[~2021-08-02  4:18 UTC|newest]

Thread overview: 129+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-27 22:37 [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly NeilBrown
2021-07-27 22:37 ` [PATCH 07/11] exportfs: Allow filehandle lookup to cross internal mount points NeilBrown
2021-07-28 10:13   ` Amir Goldstein
2021-07-29  0:28     ` NeilBrown
2021-07-29  5:27       ` Amir Goldstein
2021-08-06  7:52         ` Miklos Szeredi
2021-08-06  8:08           ` Amir Goldstein
2021-08-06  8:18             ` Miklos Szeredi
2021-07-28 19:17   ` J. Bruce Fields
2021-07-28 22:25     ` NeilBrown
2021-07-27 22:37 ` [PATCH 04/11] VFS: export lookup_mnt() NeilBrown
2021-07-30  0:31   ` Al Viro
2021-07-30  5:33     ` NeilBrown
2021-07-27 22:37 ` [PATCH 01/11] VFS: show correct dev num in mountinfo NeilBrown
2021-07-30  0:25   ` Al Viro
2021-07-30  5:28     ` NeilBrown
2021-07-30  5:54       ` Miklos Szeredi
2021-07-30  6:13         ` NeilBrown
2021-07-30  7:18           ` Miklos Szeredi
2021-07-30  7:33             ` NeilBrown
2021-07-30  7:59               ` Miklos Szeredi
2021-08-02  4:18                 ` NeilBrown [this message]
2021-08-02  5:25                   ` A Third perspective on BTRFS nfsd subvol dev/inode number issues Al Viro
2021-08-02  5:40                     ` NeilBrown
2021-08-02  7:54                       ` Amir Goldstein
2021-08-02 13:53                         ` Josef Bacik
2021-08-03 22:29                           ` Qu Wenruo
2021-08-02 14:47                         ` Frank Filz
2021-08-02 21:24                         ` NeilBrown
2021-08-02  7:15                   ` Martin Steigerwald
2021-08-02 21:40                     ` NeilBrown
2021-08-02 12:39                   ` J. Bruce Fields
2021-08-02 20:32                     ` Patrick Goetz
2021-08-02 20:41                       ` J. Bruce Fields
2021-08-02 21:10                     ` NeilBrown
2021-08-02 21:50                       ` J. Bruce Fields
2021-08-02 21:59                         ` NeilBrown
2021-08-02 22:14                           ` J. Bruce Fields
2021-08-02 22:36                             ` NeilBrown
2021-08-03  0:15                               ` J. Bruce Fields
2021-07-27 22:37 ` [PATCH 03/11] VFS: pass lookup_flags into follow_down() NeilBrown
2021-07-27 22:37 ` [PATCH 11/11] btrfs: use automount to bind-mount all subvol roots NeilBrown
2021-07-28  8:37   ` kernel test robot
2021-07-28  8:37     ` kernel test robot
2021-07-28  8:37   ` [RFC PATCH] btrfs: btrfs_mountpoint_expiry_timeout can be static kernel test robot
2021-07-28  8:37     ` kernel test robot
2021-07-28 13:12   ` [PATCH 11/11] btrfs: use automount to bind-mount all subvol roots Christian Brauner
2021-07-29  0:43     ` NeilBrown
2021-07-29 14:38       ` Christian Brauner
2021-07-31  6:25   ` [btrfs] 5874902268: xfstests.btrfs.202.fail kernel test robot
2021-07-31  6:25     ` kernel test robot
2021-07-27 22:37 ` [PATCH 06/11] nfsd: include a vfsmount in struct svc_fh NeilBrown
2021-07-27 22:37 ` [PATCH 10/11] btrfs: introduce mapping function from location to inum NeilBrown
2021-07-27 22:37 ` [PATCH 02/11] VFS: allow d_automount to create in-place bind-mount NeilBrown
2021-07-27 22:37 ` [PATCH 09/11] nfsd: Allow filehandle lookup to cross internal mount points NeilBrown
2021-07-28 19:15   ` J. Bruce Fields
2021-07-28 22:29     ` NeilBrown
2021-07-30  0:42   ` Al Viro
2021-07-30  5:43     ` NeilBrown
2021-07-27 22:37 ` [PATCH 08/11] nfsd: change get_parent_attributes() to nfsd_get_mounted_on() NeilBrown
2021-07-27 22:37 ` [PATCH 05/11] VFS: new function: mount_is_internal() NeilBrown
2021-07-28  2:16   ` Al Viro
2021-07-28  3:32     ` NeilBrown
2021-07-30  0:34       ` Al Viro
2021-07-28  2:19 ` [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly Al Viro
2021-07-28  4:58 ` Wang Yugui
2021-07-28  6:04   ` Wang Yugui
2021-07-28  7:01     ` NeilBrown
2021-07-28 12:26       ` Neal Gompa
2021-07-28 19:14         ` J. Bruce Fields
2021-07-29  1:29           ` Zygo Blaxell
2021-07-29  1:43             ` NeilBrown
2021-07-29 23:20               ` Zygo Blaxell
2021-07-28 22:50         ` NeilBrown
2021-07-29  2:37           ` Zygo Blaxell
2021-07-29  3:36             ` NeilBrown
2021-07-29 23:20               ` Zygo Blaxell
2021-07-30  2:36                 ` NeilBrown
2021-07-30  5:25                   ` Qu Wenruo
2021-07-30  5:31                     ` Qu Wenruo
2021-07-30  5:53                       ` Amir Goldstein
2021-07-30  6:00                       ` NeilBrown
2021-07-30  6:09                         ` Qu Wenruo
2021-07-30  5:58                     ` NeilBrown
2021-07-30  6:23                       ` Qu Wenruo
2021-07-30  6:53                         ` NeilBrown
2021-07-30  7:09                           ` Qu Wenruo
2021-07-30 18:15                             ` Zygo Blaxell
2021-07-30 15:17                         ` J. Bruce Fields
2021-07-30 15:48                           ` Josef Bacik
2021-07-30 16:25                             ` Forza
2021-07-30 17:43                             ` Zygo Blaxell
2021-07-30  5:28                   ` Amir Goldstein
2021-07-28 13:43       ` g.btrfs
2021-07-29  1:39         ` NeilBrown
2021-07-29  9:28           ` Graham Cobb
2021-07-28  7:06   ` NeilBrown
2021-07-28  9:36     ` Wang Yugui
2021-07-28 19:35 ` J. Bruce Fields
2021-07-28 21:30   ` Josef Bacik
2021-07-30  0:13     ` Al Viro
2021-07-30  6:08       ` NeilBrown
2021-08-13  1:45 ` [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export NeilBrown
2021-08-13 14:55   ` Josef Bacik
2021-08-15  7:39   ` Goffredo Baroncelli
2021-08-15 19:35     ` Roman Mamedov
2021-08-15 21:03       ` Goffredo Baroncelli
2021-08-15 21:53         ` NeilBrown
2021-08-17 19:34           ` Goffredo Baroncelli
2021-08-17 21:39             ` NeilBrown
2021-08-18 17:24               ` Goffredo Baroncelli
2021-08-15 22:17       ` NeilBrown
2021-08-19  8:01         ` Amir Goldstein
2021-08-20  3:21           ` NeilBrown
2021-08-20  6:23             ` Amir Goldstein
2021-08-23  4:05         ` [PATCH v2] BTRFS/NFSD: " NeilBrown
2021-08-23  8:17           ` kernel test robot
2021-08-23  8:17             ` kernel test robot
2021-08-18 14:54   ` [PATCH] VFS/BTRFS/NFSD: " Wang Yugui
2021-08-18 21:46     ` NeilBrown
2021-08-19  2:19       ` Zygo Blaxell
2021-08-20  2:54         ` NeilBrown
2021-08-22 19:29           ` Zygo Blaxell
2021-08-23  5:51             ` NeilBrown
2021-08-23 23:22             ` NeilBrown
2021-08-25  2:06               ` Zygo Blaxell
2021-08-23  0:57         ` Wang Yugui
2021-08-02  9:11 A Third perspective on BTRFS nfsd subvol dev/inode number issues Forza
2021-08-02 21:50 ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=162787790940.32159.14588617595952736785@noble.neil.brown.name \
    --to=neilb@suse.de \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=hch@infradead.org \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.