[PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.
@ 2021-08-09  3:55 NeilBrown
  2021-08-09  3:55 ` [PATCH 4/4] Add "tree" number to "inode" number in various /proc files NeilBrown
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: NeilBrown @ 2021-08-09  3:55 UTC (permalink / raw)
  To: Josef Bacik, Chris Mason, David Sterba
  Cc: linux-fsdevel, Linux NFS list, Btrfs BTRFS

I continue to search for a way forward for btrfs so that its behaviour
with respect to device numbers and subvols is somewhat coherent.

This series implements some of the ideas in my "A Third perspective"[1],
though with changes is various details.

I introduce two new mount options, which default to
no-change-in-behaviour.

 -o inumbits=  causes inode numbers to be more unique across a whole btrfs
               filesystem, and is many cases completely unique.  Mounting
               with "-i inumbits=56" will resolve the NFS issues that
               started me tilting at this particular windmill.

 -o numdevs=  can reduce the number of distinct devices reported by
              stat(), either to 2 or to 1.
              Both ease problems for sites that exhaust their supply of
              device numbers.
              '2' allows "du -x" to continue to work, but is otherwise
              rather strange.
              '1' breaks the use of "du -x" and similar to examine a
              single subvol which might have subvol descendants, but
              provides generally sane behaviour
              "-o numdevs=1" also forces inumbits to have a useful value.

I introduce a "tree id" which can be discovered using statx().  Two
files with the same dev and ino might still be different if the tree-ids
are different.  Connected files with the same tree-id may be usefully
considered to be related.

I also change various /proc files (only when numdevs=1 is used) to
provide extra information so they are useful with btrfs despite subvols.
/proc/maps /proc/smaps /proc/locks /proc/X/fdinfo/Y are affected.
The inode number becomes "XX:YY" where XX is the subvol number (tree id)
and YY is the inode number.

An alternate might be to report a number which might use up to 128 bits.
Which is less likely to seriously break code?

Note that code which ignores badly formatted lines is safe, because it
will never currently find a match for a btrfs file in these files
anyway.  The device number they report is never returned in st_dev for
stat() on any file.

The audit subsystem and one or two other places report dev/ino and so
need enhanced, but I haven't tried to address those.

Various trace points also report dev/ino.  I haven't tried thinking
about those either.

Thanks for your upcoming replies!

NeilBrown

---

NeilBrown (4):
      btrfs: include subvol identifier in inode number if -o inumbits=...
      btrfs: add numdevs= mount option.
      VFS/btrfs: add STATX_TREE_ID
      Add "tree" number to "inode" number in various /proc files.

 fs/btrfs/ctree.h          | 17 +++++++++++++++--
 fs/btrfs/disk-io.c        | 24 +++++++++++++++++++++---
 fs/btrfs/inode.c          | 39 ++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/ioctl.c          |  6 ++++--
 fs/btrfs/super.c          | 31 +++++++++++++++++++++++++++++++
 fs/inode.c                |  1 +
 fs/locks.c                | 12 +++++++++---
 fs/notify/fdinfo.c        | 19 ++++++++++++++-----
 fs/proc/nommu.c           | 11 ++++++++---
 fs/proc/task_mmu.c        | 17 ++++++++++++-----
 fs/proc/task_nommu.c      | 11 ++++++++---
 fs/stat.c                 |  2 ++
 include/linux/fs.h        |  3 ++-
 include/linux/stat.h      | 13 +++++++++++++
 include/uapi/linux/stat.h |  3 ++-
 samples/vfs/test-statx.c  |  4 +++-
 16 files changed, 183 insertions(+), 30 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/4] btrfs: include subvol identifier in inode number if -o inumbits=...
  2021-08-09  3:55 [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness NeilBrown
  2021-08-09  3:55 ` [PATCH 4/4] Add "tree" number to "inode" number in various /proc files NeilBrown
  2021-08-09  3:55 ` [PATCH 3/4] VFS/btrfs: add STATX_TREE_ID NeilBrown
@ 2021-08-09  3:55 ` NeilBrown
  2021-08-09  3:55 ` [PATCH 2/4] btrfs: add numdevs= mount option NeilBrown
  2021-08-10 20:51 ` [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness Josef Bacik
  4 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2021-08-09  3:55 UTC (permalink / raw)
  To: Josef Bacik, Chris Mason, David Sterba
  Cc: linux-fsdevel, Linux NFS list, Btrfs BTRFS

A btrfs filesystem uses 112 bits to identify an object: 48 for the
object-id of the subvolume (really a 'subtree') and 64 for the object
within that tree.

It only exposes the 64bits in i_ino (and st_ino for stat()) and attempts
to hide the non-uniqueness by reporting a different st_dev in stat() for
each different subvol.

This is incomplete and doesn't scale.

It is incomplete because there are places other than 'stat()' where the
device number is visible to user-space, including /proc/$PID/mountinfo,
/proc/$PID/maps and /proc/locks.  These report the device-number for the
whole filesystem together with i_ino, and so do not match the
st_dev+st_ino reported by stat().

It is also incomplete because nfsd doesn't notice the st_dev value,
depending only of the existence of mount points to discover filesystem
changes.

It doesn't scale because there are a limited number of anon device
numbers that can be allocated, which is approximately 20 bits, much less
than the 48 bits worth of subvols which btrfs supports.

I believe that we *must* extend the user-space API to properly support
btrfs - trying to fit within it will always cause pain, at least in the
extremes.  I believe that the use of varying device numbers is not a
viable long-term solution and must be deprecated.

This patch is a first step towards deprecating the use of device numbers
to add uniqueness.  It changes the inode numbers reported so they are
unique across all subvols for modest sized filesystems.  It does this
by creating an 'overlay' from the subvol number and xor-ing that into
the file's object id.  This results in reported inode numbers being
completely unique within a subvol, and mostly unique between subvols.

The overlay is *not* xor-ed in when it exactly matches the objectid, as
that would produce zero.

A few placed in the code assume that ->i_ino is the objectid.  Those few
are changed to call btrfs_ino(), and ino now subtracts the overlay.

The "overlay" is created by byte-swapping the subvol identifier, then
optionally shifting down a few bits so there is unused space at the
top-end.  When the maximum objectid in use requires many fewer than 64
bits, and the maximum subvol id in use does not use all of the
remaining bits, complete uniqueness can be provided.  For larger
fileystems, complete uniqueness cannot be guaranteed.

The size of the shift can be set using the "inumbits" mount option.  A
value of 64 suppresses any shift and maximum uniqueness is provided.  A
value of 0 (the default) disables the overlay functionality.

A generally good value on 64bit systems which might use overlayfs with
btrdfs is 56, as this provides broad uniqueness, while leaving some bits
for overlayfs to differentiate between merged filesystems.

The only material improvement from this patch alone will come to
applications and tools which do not pay attention to st_dev (as that is
unchanged).  In particular, nfsd will, when "-o inumbits=56", report
(mostly) unique inode numbers to the NFS client, and some problems
caused by "find" and related tools detecting that the root of a subvol
has the name inode number as the root of its parent - both of which
appear to NFS to be in the same filesystem.

Subsequent patches will build on this base to allow the use of multiple
devices to be controlled, and then to allow complete uniqueness through
interface extensions.

ISSUE: In btrfs, inode numbers below the highest-currently-allocated are
   never reused.  This allows the highest inode number to be arbitrarily
   higher than the number of inodes.  This means that an "old"
   filesystem can trigger a risk of non-uniqueness just as a large
   filesystem can.

ISSUE: I don't understand the role of BTRFS_EMPTY_SUBVOL_DIR_OBJECTID
   so I don't know if I have to do anything when that value is assigned
   to i_ino.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/btrfs/btrfs_inode.h |   22 ++++++++++++++--------
 fs/btrfs/ctree.h       |    4 ++++
 fs/btrfs/disk-io.c     |   15 +++++++++++++++
 fs/btrfs/inode.c       |   17 ++++++++++++++---
 fs/btrfs/ioctl.c       |    2 +-
 fs/btrfs/super.c       |   24 +++++++++++++++++++++++-
 6 files changed, 71 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index c652e19ad74e..18e1b071bb69 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -246,13 +246,6 @@ static inline unsigned long btrfs_inode_hash(u64 objectid,
 	return (unsigned long)h;
 }

-static inline void btrfs_insert_inode_hash(struct inode *inode)
-{
-	unsigned long h = btrfs_inode_hash(inode->i_ino, BTRFS_I(inode)->root);
-
-	__insert_inode_hash(inode, h);
-}
-
 static inline u64 btrfs_ino(const struct btrfs_inode *inode)
 {
 	u64 ino = inode->location.objectid;
@@ -261,11 +254,24 @@ static inline u64 btrfs_ino(const struct btrfs_inode *inode)
 	 * !ino: btree_inode
 	 * type == BTRFS_ROOT_ITEM_KEY: subvol dir
 	 */
-	if (!ino || inode->location.type == BTRFS_ROOT_ITEM_KEY)
+	if (!ino || inode->location.type == BTRFS_ROOT_ITEM_KEY) {
+		/* vfs_inode.i_ino has inum_overlay merged in, when
+		 * that wouldn't produce zero. We need to remove it here.
+		 */
 		ino = inode->vfs_inode.i_ino;
+		if (ino != inode->root->inum_overlay)
+			ino ^= inode->root->inum_overlay;
+	}
 	return ino;
 }

+static inline void btrfs_insert_inode_hash(struct inode *inode)
+{
+	unsigned long h = btrfs_inode_hash(btrfs_ino(BTRFS_I(inode)), BTRFS_I(inode)->root);
+
+	__insert_inode_hash(inode, h);
+}
+
 static inline void btrfs_i_size_write(struct btrfs_inode *inode, u64 size)
 {
 	i_size_write(&inode->vfs_inode, size);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e5e53e592d4f..0ef557db3a8b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -987,6 +987,8 @@ struct btrfs_fs_info {
 	u32 csums_per_leaf;
 	u32 stripesize;

+	unsigned short inumbits;
+
 	/* Block groups and devices containing active swapfiles. */
 	spinlock_t swapfile_pins_lock;
 	struct rb_root swapfile_pins;
@@ -1145,6 +1147,8 @@ struct btrfs_root {

 	u64 last_trans;

+	u64 inum_overlay;
+
 	u32 type;

 	u64 free_objectid;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a59ab7b9aea0..7f3bfa042d66 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1202,6 +1202,12 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	memset(&root->defrag_progress, 0, sizeof(root->defrag_progress));
 	root->root_key.objectid = objectid;
 	root->anon_dev = 0;
+	if (fs_info->inumbits &&
+	    objectid != BTRFS_FS_TREE_OBJECTID &&
+	    is_fstree(objectid))
+		root->inum_overlay = swab64(objectid) >> (64 - fs_info->inumbits);
+	else
+		root->inum_overlay = 0;

 	spin_lock_init(&root->root_item_lock);
 	btrfs_qgroup_init_swapped_blocks(&root->swapped_blocks);
@@ -3314,12 +3320,21 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	 */
 	fs_info->compress_type = BTRFS_COMPRESS_ZLIB;

+	fs_info->inumbits = BITS_PER_LONG + 1; /* impossible value */
+
 	ret = btrfs_parse_options(fs_info, options, sb->s_flags);
 	if (ret) {
 		err = ret;
 		goto fail_alloc;
 	}

+	/* By default, use inumbits=0 to avoid behaviour change.
+	 * "-o inumbits" can over-ride this default.
+	 * BITS_PER_LONG * 7 / 8 is a good value to use
+	 */
+	if (fs_info->inumbits > BITS_PER_LONG)
+		fs_info->inumbits = 0;
+
 	features = btrfs_super_incompat_flags(disk_super) &
 		~BTRFS_FEATURE_INCOMPAT_SUPP;
 	if (features) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 01099bf602fb..860cb5045123 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5782,6 +5782,11 @@ static int btrfs_init_locked_inode(struct inode *inode, void *p)
 	struct btrfs_iget_args *args = p;

 	inode->i_ino = args->ino;
+	if (args->root && args->ino != args->root->inum_overlay)
+		/* This inode number will still be unique within this
+		 * 'root', and should be nearly unique across the filesystem.
+		 */
+		inode->i_ino ^= args->root->inum_overlay;
 	BTRFS_I(inode)->location.objectid = args->ino;
 	BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY;
 	BTRFS_I(inode)->location.offset = 0;
@@ -6092,6 +6097,7 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)

 	while (1) {
 		struct dir_entry *entry;
+		u64 inum;

 		leaf = path->nodes[0];
 		slot = path->slots[0];
@@ -6136,7 +6142,10 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
 		put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)),
 				&entry->type);
 		btrfs_dir_item_key_to_cpu(leaf, di, &location);
-		put_unaligned(location.objectid, &entry->ino);
+		inum = location.objectid;
+		if (inum != root->inum_overlay)
+			inum ^= root->inum_overlay;
+		put_unaligned(inum, &entry->ino);
 		put_unaligned(found_key.offset, &entry->offset);
 		entries++;
 		addr += sizeof(struct dir_entry) + name_len;
@@ -6333,7 +6342,7 @@ static int btrfs_insert_inode_locked(struct inode *inode)
 	args.root = BTRFS_I(inode)->root;

 	return insert_inode_locked4(inode,
-		   btrfs_inode_hash(inode->i_ino, BTRFS_I(inode)->root),
+		   btrfs_inode_hash(btrfs_ino(BTRFS_I(inode)), BTRFS_I(inode)->root),
 		   btrfs_find_actor, &args);
 }

@@ -6412,6 +6421,8 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
 	 * number if we fail afterwards in this function.
 	 */
 	inode->i_ino = objectid;
+	if (objectid != root->inum_overlay)
+		inode->i_ino ^= root->inum_overlay;

 	if (dir && name) {
 		trace_btrfs_inode_request(dir);
@@ -9515,7 +9526,7 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,

 	/* check for collisions, even if the  name isn't there */
-	ret = btrfs_check_dir_item_collision(dest, new_dir->i_ino,
+	ret = btrfs_check_dir_item_collision(dest, btrfs_ino(BTRFS_I(new_dir)),
 			     new_dentry->d_name.name,
 			     new_dentry->d_name.len);

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0ba98e08a029..e008a9ceb827 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -907,7 +907,7 @@ static noinline int btrfs_mksubvol(const struct path *parent,
 	 * check for them now when we can safely fail
 	 */
 	error = btrfs_check_dir_item_collision(BTRFS_I(dir)->root,
-					       dir->i_ino, name,
+					       btrfs_ino(BTRFS_I(dir)), name,
 					       namelen);
 	if (error)
 		goto out_dput;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d07b18b2b250..5f3350e2f7ec 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -359,6 +359,7 @@ enum {
 	Opt_defrag, Opt_nodefrag,
 	Opt_discard, Opt_nodiscard,
 	Opt_discard_mode,
+	Opt_inumbits,
 	Opt_norecovery,
 	Opt_ratio,
 	Opt_rescan_uuid_tree,
@@ -427,6 +428,7 @@ static const match_table_t tokens = {
 	{Opt_nodefrag, "noautodefrag"},
 	{Opt_discard, "discard"},
 	{Opt_discard_mode, "discard=%s"},
+	{Opt_inumbits, "inumbits=%u"},
 	{Opt_nodiscard, "nodiscard"},
 	{Opt_norecovery, "norecovery"},
 	{Opt_ratio, "metadata_ratio=%u"},
@@ -830,6 +832,25 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_clear_and_info(info, FLUSHONCOMMIT,
 					     "turning off flush-on-commit");
 			break;
+		case Opt_inumbits:
+			if (info->inumbits <= BITS_PER_LONG)
+				/* silently ignore subsequent change
+				 * e.g. on remount
+				 */
+				break;
+			ret = match_int(&args[0], &intarg);
+			if (ret)
+				goto out;
+			if (intarg > BITS_PER_LONG ||
+			    (intarg && intarg < BITS_PER_LONG / 2)) {
+				btrfs_err(info,
+					  "inumbits must be 0 or in range [%d..%d]",
+					  BITS_PER_LONG/2, BITS_PER_LONG);
+				ret = -EINVAL;
+				goto out;
+			}
+			info->inumbits = intarg;
+			break;
 		case Opt_ratio:
 			ret = match_int(&args[0], &intarg);
 			if (ret)
@@ -1537,6 +1558,7 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 #endif
 	if (btrfs_test_opt(info, REF_VERIFY))
 		seq_puts(seq, ",ref_verify");
+	seq_printf(seq, ",inumbits=%u", info->inumbits);
 	seq_printf(seq, ",subvolid=%llu",
 		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
 	subvol_name = btrfs_get_subvol_name_from_objectid(info,
@@ -1570,7 +1592,7 @@ static int btrfs_set_super(struct super_block *s, void *data)
  */
 static inline int is_subvolume_inode(struct inode *inode)
 {
-	if (inode && inode->i_ino == BTRFS_FIRST_FREE_OBJECTID)
+	if (inode && btrfs_ino(BTRFS_I(inode)) == BTRFS_FIRST_FREE_OBJECTID)
 		return 1;
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/4] btrfs: add numdevs= mount option.
  2021-08-09  3:55 [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness NeilBrown
                   ` (2 preceding siblings ...)
  2021-08-09  3:55 ` [PATCH 1/4] btrfs: include subvol identifier in inode number if -o inumbits= NeilBrown
@ 2021-08-09  3:55 ` NeilBrown
  2021-08-09  7:50   ` kernel test robot
  2021-08-10 20:51 ` [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness Josef Bacik
  4 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2021-08-09  3:55 UTC (permalink / raw)
  To: Josef Bacik, Chris Mason, David Sterba
  Cc: linux-fsdevel, Linux NFS list, Btrfs BTRFS

btrfs currently allocates multiple anonymous bdev numbers to hide the
fact that inode numbers are not unique across "subvolumes".
Each subvol gets a different device number.

As described in a previous patch, this is incomplete, doesn't scale, and
should be deprecated.  This patch is another step to deprecation.

With mount option "-o numdevs=many", which is the default, the current
behaviour is preserved.

With mount option "-o numdevs=1", the st_dev reported by stat() is
exactly the number that appears in /proc/$PID/mountinfo (i.e.
sb->s_dev).  This will prevent "du -x", "find -xdev" and similar tools
from keeping within a subvol, but is otherwise quite functional.

If numdevs=1 and inumbits=0, then there will often be inode number
reuse, so that combination is forbidden and the default fo inumbits
changes to BITS_PER_LONG*7/8.  With larger inumbits (close to
BITS_PER_LONG), inode number reuse is still possible, but only with
large or old filesystems.

With mount option "-o numdevs=2", precisely two anon device numbers are
allocated.  Each subvol gets the number that its parent isn't using.
When subvols are moved, the device number reported will change if needed
to differentiate from its parent.
If a subvol with dependent subvols is moved and the device numbers need
to change, the numbers in dependent subvols that are currently in cache
will NOT change.  Fixing this is a stretch goal.

Using numdevs=2 removes any problems with exhausting the number of
available anon devs, and preserves the functionality of "du -x" and
similar.  It may be a useful option for sites that experience exhaustion
problems.

numdevs=1 is, at this stage, most useful for exploring the consequences
of fully deprecating the use of multiple device numbers.  It may also be
useful for site that find they have no dependency on multiple device
numbers.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/btrfs/ctree.h   |   17 +++++++++++++++--
 fs/btrfs/disk-io.c |   24 +++++++++++++++++++++---
 fs/btrfs/inode.c   |   29 ++++++++++++++++++++++++++++-
 fs/btrfs/ioctl.c   |    6 ++++--
 fs/btrfs/super.c   |   30 ++++++++++++++++++++++++++++++
 5 files changed, 98 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0ef557db3a8b..2caedb8c8c6d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -988,6 +988,14 @@ struct btrfs_fs_info {
 	u32 stripesize;
 
 	unsigned short inumbits;
+	/* num_devs can be:
+	 * 1 - all files in all trees use sb->s_dev
+	 * 2 - file trees alternate between using sb->s_dev and
+	 *     secondary_anon_dev.
+	 * 3 (BTTSF_MANY_DEVS) - Each subtree uses a unique ->anon_dev
+	 */
+	unsigned short num_devs;
+	dev_t secondary_anon_dev;
 
 	/* Block groups and devices containing active swapfiles. */
 	spinlock_t swapfile_pins_lock;
@@ -1035,6 +1043,8 @@ struct btrfs_fs_info {
 #endif
 };
 
+#define BTRFS_MANY_DEVS	(3)
+
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
 {
 	return sb->s_fs_info;
@@ -1176,10 +1186,13 @@ struct btrfs_root {
 	 */
 	struct radix_tree_root delayed_nodes_tree;
 	/*
-	 * right now this just gets used so that a root has its own devid
-	 * for stat.  It may be used for more later
+	 * If fs_info->num_devs == 3 (BTRFS_MANY_DEVS) anon_dev holds a device
+	 * number to be reported by ->getattr().
+	 * If fs_info->num_devs == 2, anon_dev is 0 and use_secondary_dev
+	 * is true when this root uses the secondary, not primary, dev.
 	 */
 	dev_t anon_dev;
+	bool use_secondary_dev;
 
 	spinlock_t root_item_lock;
 	refcount_t refs;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7f3bfa042d66..5127e2689756 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1516,7 +1516,8 @@ static int btrfs_init_fs_root(struct btrfs_root *root, dev_t anon_dev)
 	 * userspace, the id pool is limited to 1M
 	 */
 	if (is_fstree(root->root_key.objectid) &&
-	    btrfs_root_refs(&root->root_item) > 0) {
+	    btrfs_root_refs(&root->root_item) > 0 &&
+	    root->fs_info->num_devs == BTRFS_MANY_DEVS) {
 		if (!anon_dev) {
 			ret = get_anon_bdev(&root->anon_dev);
 			if (ret)
@@ -3332,8 +3333,12 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	 * "-o inumbits" can over-ride this default.
 	 * BITS_PER_LONG * 7 / 8 is a good value to use
 	 */
-	if (fs_info->inumbits > BITS_PER_LONG)
-		fs_info->inumbits = 0;
+	if (fs_info->inumbits > BITS_PER_LONG) {
+		if (fs_info->num_devs == 1)
+			fs_info->inumbits = BITS_PER_LONG * 7 / 8;
+		else
+			fs_info->inumbits = 0;
+	}
 
 	features = btrfs_super_incompat_flags(disk_super) &
 		~BTRFS_FEATURE_INCOMPAT_SUPP;
@@ -3379,6 +3384,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	fs_info->csums_per_leaf = BTRFS_MAX_ITEM_SIZE(fs_info) / fs_info->csum_size;
 	fs_info->stripesize = stripesize;
 
+	if (fs_info->num_devs == 0)
+		/* set default value */
+		fs_info->num_devs = BTRFS_MANY_DEVS;
+
+	if (fs_info->num_devs == 2) {
+		err = get_anon_bdev(&fs_info->secondary_anon_dev);
+		if (err)
+			goto fail_alloc;
+	}
 	/*
 	 * mixed block groups end up with duplicate but slightly offset
 	 * extent buffers for the same range.  It leads to corruptions
@@ -4446,6 +4460,10 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 
 	btrfs_mapping_tree_free(&fs_info->mapping_tree);
 	btrfs_close_devices(fs_info->fs_devices);
+
+	if (fs_info->secondary_anon_dev)
+		free_anon_bdev(fs_info->secondary_anon_dev);
+	fs_info->secondary_anon_dev = 0;
 }
 
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 860cb5045123..30fa64cbe6dc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5966,6 +5966,8 @@ struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry)
 			iput(inode);
 			inode = ERR_PTR(ret);
 		}
+		if (fs_info->num_devs == 2)
+			sub_root->use_secondary_dev = !root->use_secondary_dev;
 	}
 
 	return inode;
@@ -9204,7 +9206,15 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
 				  STATX_ATTR_NODUMP);
 
 	generic_fillattr(&init_user_ns, inode, stat);
-	stat->dev = BTRFS_I(inode)->root->anon_dev;
+	/* If we don't set stat->dev here, sb->s_dev will be used */
+	switch (btrfs_sb(inode->i_sb)->num_devs) {
+	case 2:
+		if (BTRFS_I(inode)->root->use_secondary_dev)
+			stat->dev = btrfs_sb(inode->i_sb)->secondary_anon_dev;
+		break;
+	case BTRFS_MANY_DEVS:
+		stat->dev = BTRFS_I(inode)->root->anon_dev;
+	}
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
@@ -9390,6 +9400,15 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 	if (new_inode->i_nlink == 1)
 		BTRFS_I(new_inode)->dir_index = new_idx;
 
+	if (fs_info->num_devs == 2 &&
+	    root->use_secondary_dev != dest->use_secondary_dev) {
+		BTRFS_I(old_inode)->root->use_secondary_dev =
+				!dest->use_secondary_dev;
+		BTRFS_I(new_inode)->root->use_secondary_dev =
+				!root->use_secondary_dev;
+		// FIXME any subvols beneeath 'old_inode' or 'new_inode'
+		// that are in cache are now wrong.
+	}
 	if (root_log_pinned) {
 		btrfs_log_new_name(trans, BTRFS_I(old_inode), BTRFS_I(old_dir),
 				   new_dentry->d_parent);
@@ -9656,6 +9675,14 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 		goto out_fail;
 	}
 
+	if (fs_info->num_devs == 2 &&
+	    root->use_secondary_dev != dest->use_secondary_dev) {
+		BTRFS_I(old_inode)->root->use_secondary_dev =
+				!dest->use_secondary_dev;
+		// FIXME any subvols beneeath 'old_inode' that are
+		// in cache are now wrong.
+	}
+
 	if (old_inode->i_nlink == 1)
 		BTRFS_I(old_inode)->dir_index = index;
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e008a9ceb827..a246f91b4df4 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -522,7 +522,8 @@ static noinline int create_subvol(struct inode *dir,
 	if (ret)
 		goto fail_free;
 
-	ret = get_anon_bdev(&anon_dev);
+	if (fs_info->num_devs == BTRFS_MANY_DEVS)
+		ret = get_anon_bdev(&anon_dev);
 	if (ret < 0)
 		goto fail_free;
 
@@ -729,7 +730,8 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
 	if (!pending_snapshot)
 		return -ENOMEM;
 
-	ret = get_anon_bdev(&pending_snapshot->anon_dev);
+	if (fs_info->num_devs == BTRFS_MANY_DEVS)
+		ret = get_anon_bdev(&pending_snapshot->anon_dev);
 	if (ret < 0)
 		goto free_pending;
 	pending_snapshot->root_item = kzalloc(sizeof(struct btrfs_root_item),
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 5f3350e2f7ec..b1aecb834234 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -361,6 +361,7 @@ enum {
 	Opt_discard_mode,
 	Opt_inumbits,
 	Opt_norecovery,
+	Opt_numdevs,
 	Opt_ratio,
 	Opt_rescan_uuid_tree,
 	Opt_skip_balance,
@@ -431,6 +432,7 @@ static const match_table_t tokens = {
 	{Opt_inumbits, "inumbits=%u"},
 	{Opt_nodiscard, "nodiscard"},
 	{Opt_norecovery, "norecovery"},
+	{Opt_numdevs, "numdevs=%s"},
 	{Opt_ratio, "metadata_ratio=%u"},
 	{Opt_rescan_uuid_tree, "rescan_uuid_tree"},
 	{Opt_skip_balance, "skip_balance"},
@@ -849,8 +851,35 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 				ret = -EINVAL;
 				goto out;
 			}
+			if (intarg == 0 && info->num_devs == 1) {
+				btrfs_err(info,
+					  "inumbits=0 not permitted when numdevs=1");
+				ret = -EINVAL;
+				goto out;
+			}
 			info->inumbits = intarg;
 			break;
+		case Opt_numdevs:
+			if (info->num_devs) {
+				; /* silently ignore attempts to change this */
+			} else if (strcmp(args[0].from, "many") == 0) {
+				info->num_devs = BTRFS_MANY_DEVS;
+			} else if (strcmp(args[0].from, "1") == 0) {
+				if (info->inumbits == 0) {
+					btrfs_err(info,
+"numdevs=1 not permitted with inumbits=0");
+					ret = -EINVAL;
+				}
+				info->num_devs = 1;
+			} else if (strcmp(args[0].from, "2") == 0) {
+				info->num_devs = 2;
+			} else {
+				btrfs_err(info,
+					  "numdevs must be \"1\", \"2\", or \"many\".");
+				ret = -EINVAL;
+				goto out;
+			}
+			break;
 		case Opt_ratio:
 			ret = match_int(&args[0], &intarg);
 			if (ret)
@@ -1559,6 +1588,7 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 	if (btrfs_test_opt(info, REF_VERIFY))
 		seq_puts(seq, ",ref_verify");
 	seq_printf(seq, ",inumbits=%u", info->inumbits);
+	seq_printf(seq, ",numdevs=%u", info->num_devs);
 	seq_printf(seq, ",subvolid=%llu",
 		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
 	subvol_name = btrfs_get_subvol_name_from_objectid(info,



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/4] VFS/btrfs: add STATX_TREE_ID
  2021-08-09  3:55 [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness NeilBrown
  2021-08-09  3:55 ` [PATCH 4/4] Add "tree" number to "inode" number in various /proc files NeilBrown
@ 2021-08-09  3:55 ` NeilBrown
  2021-08-09  3:55 ` [PATCH 1/4] btrfs: include subvol identifier in inode number if -o inumbits= NeilBrown
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2021-08-09  3:55 UTC (permalink / raw)
  To: Josef Bacik, Chris Mason, David Sterba
  Cc: linux-fsdevel, Linux NFS list, Btrfs BTRFS

A new 64bit field is added to the data that can be returned by statx() -
the "tree id".
The tree id serves two needs.
1/ it extends the available inode number space.  While a filesystem
   SHOULD ensure the inode number is unique across the filesystem,
   this is sometimes impractical.  In such situations, 'tree id'
   may be used to guarantee uniqueness.  It can identify a separate
   allocation domain.
   A particular case when separate allocation domains is useful
   is when a directory tree can be effectively "reflink"ed.
   Updating all inode numbers in such a tree is prohibitively expensive.
2/ it can identify a collection of objects that provide a coherent
   "tree" in some locally-defined sense.

This patch uses STATX_TREE_ID to export the subvol id for btrfs.

samples/vfs/test_statx.c is extended to report the treeid.

Also: a new superblock field is added: s_tree_id_bits.  This can store
  the number of significant bits in the reported treeid.  It is
  currently unused, but could be used by overlayfs to determine how
  to add a filesystem number to the treeid to differentiate files in
  different underlying filesystems.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/btrfs/inode.c          |    4 ++++
 fs/btrfs/super.c          |    1 +
 fs/stat.c                 |    1 +
 include/linux/fs.h        |    2 +-
 include/linux/stat.h      |   13 +++++++++++++
 include/uapi/linux/stat.h |    3 ++-
 samples/vfs/test-statx.c  |    4 +++-
 7 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 30fa64cbe6dc..c878726d090c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9215,6 +9215,10 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
 	case BTRFS_MANY_DEVS:
 		stat->dev = BTRFS_I(inode)->root->anon_dev;
 	}
+	if (request_mask & STATX_TREE_ID) {
+		stat->tree_id = BTRFS_I(inode)->root->root_key.objectid;
+		stat->result_mask |= STATX_TREE_ID;
+	}
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b1aecb834234..e6d166150660 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1410,6 +1410,7 @@ static int btrfs_fill_super(struct super_block *sb,
 #endif
 	sb->s_flags |= SB_I_VERSION;
 	sb->s_iflags |= SB_I_CGROUPWB;
+	sb->s_tree_id_bits = 48;
 
 	err = super_setup_bdi(sb);
 	if (err) {
diff --git a/fs/stat.c b/fs/stat.c
index 1fa38bdec1a6..2dd5d3d67793 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -580,6 +580,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_dev_major = MAJOR(stat->dev);
 	tmp.stx_dev_minor = MINOR(stat->dev);
 	tmp.stx_mnt_id = stat->mnt_id;
+	tmp.stx_tree_id = stat->tree_id;
 
 	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 640574294216..a777c1b1706a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1482,7 +1482,7 @@ struct super_block {
 
 	unsigned int		s_max_links;
 	fmode_t			s_mode;
-
+	short			s_tree_id_bits;
 	/*
 	 * The next field is for VFS *only*. No filesystems have any business
 	 * even looking at it. You had been warned.
diff --git a/include/linux/stat.h b/include/linux/stat.h
index fff27e603814..08ee409786b3 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -46,6 +46,19 @@ struct kstat {
 	struct timespec64 btime;			/* File creation time */
 	u64		blocks;
 	u64		mnt_id;
+	/* Treeid can be used to extend the inode number space.  Two inodes
+	 * with different 'tree_id' are different, even if 'ino' is the same
+	 * (though fs should make ino different as often as possible).
+	 * When tree_id is requested and STATX_TREE_ID is set in result_mask,
+	 * 'ino' MUST be unique across the filesystem.  Specifically, two
+	 * open files that report the same dev, ino, and tree_id MUST be
+	 * the same.
+	 * If a directory and an object in that directory have the same dev
+	 * and tree_id, they can be assumed to be in a meaningful tree, though
+	 * the meaning is subject to local interpretation.  The set of inodes
+	 * with a common tree_id is not required to be contiguous.
+	 */
+	u64		tree_id;
 };
 
 #endif
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 1500a0f58041..725cf3f8e873 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -124,7 +124,7 @@ struct statx {
 	__u32	stx_dev_minor;
 	/* 0x90 */
 	__u64	stx_mnt_id;
-	__u64	__spare2;
+	__u64	stx_tree_id;
 	/* 0xa0 */
 	__u64	__spare3[12];	/* Spare space for future expansion */
 	/* 0x100 */
@@ -152,6 +152,7 @@ struct statx {
 #define STATX_BASIC_STATS	0x000007ffU	/* The stuff in the normal stat struct */
 #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
 #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
+#define STATX_TREE_ID		0x00002000U	/* Want/got stx_treeid and clean stX_ino */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c
index 49c7a46cee07..c1141764fa2e 100644
--- a/samples/vfs/test-statx.c
+++ b/samples/vfs/test-statx.c
@@ -118,6 +118,8 @@ static void dump_statx(struct statx *stx)
 			break;
 		}
 	}
+	if (stx->stx_mask & STATX_TREE_ID)
+		printf(" Tree: %-12llu", (unsigned long long) stx->stx_tree_id);
 	printf("\n");
 
 	if (stx->stx_mask & STATX_MODE)
@@ -218,7 +220,7 @@ int main(int argc, char **argv)
 	struct statx stx;
 	int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
 
-	unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
+	unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_TREE_ID;
 
 	for (argv++; *argv; argv++) {
 		if (strcmp(*argv, "-F") == 0) {



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/4] Add "tree" number to "inode" number in various /proc files.
  2021-08-09  3:55 [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness NeilBrown
@ 2021-08-09  3:55 ` NeilBrown
  2021-08-09  3:55 ` [PATCH 3/4] VFS/btrfs: add STATX_TREE_ID NeilBrown
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2021-08-09  3:55 UTC (permalink / raw)
  To: Josef Bacik, Chris Mason, David Sterba
  Cc: linux-fsdevel, Linux NFS list, Btrfs BTRFS

Various /proc files reporting locks, inotify status, or memory mappings
currently report device number and inode node.

These are already "broken" for btrfs as the device number is not one
that is reported by "stat()" (though a program could find a way to map a
file to an entry in /proc/self/mountinfo, and get the device number that
way).

This patch changes all the inode number is those files to "tree:inode"
when the treeid is non-zero.  This it only affect btrfs (at this stage),
and then only when mounted with "-o numdevs=1", as in other cases there
is no value in changing the proc files.

As none of these call ->getattr() to get ino or dev, I have added i_tree
to struct inode so they can get it directly from there.  This isn't
ideal, but is consistent with current code.

Programs that looks for dev:ino based in information from stat(), and
which don't crash on "badly" formatted entries will continue to work as
well as they ever did.

Programs which crash when an entry looks wrong should be fixed anyway.

Programs which correlate a file with /proc/self/mountinfo to find the
"real" device number .... would make me sad.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/btrfs/inode.c     |    6 ++++++
 fs/inode.c           |    1 +
 fs/locks.c           |   12 +++++++++---
 fs/notify/fdinfo.c   |   19 ++++++++++++++-----
 fs/proc/nommu.c      |   11 ++++++++---
 fs/proc/task_mmu.c   |   17 ++++++++++++-----
 fs/proc/task_nommu.c |   11 ++++++++---
 fs/stat.c            |    1 +
 include/linux/fs.h   |    1 +
 9 files changed, 60 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c878726d090c..98ba5f32a2b8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5787,6 +5787,8 @@ static int btrfs_init_locked_inode(struct inode *inode, void *p)
 		 * 'root', and should be nearly unique across the filesystem.
 		 */
 		inode->i_ino ^= args->root->inum_overlay;
+	if (args->root && args->root->fs_info->num_devs == 1)
+		inode->i_tree = args->root->root_key.objectid;
 	BTRFS_I(inode)->location.objectid = args->ino;
 	BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY;
 	BTRFS_I(inode)->location.offset = 0;
@@ -5876,6 +5878,8 @@ static struct inode *new_simple_dir(struct super_block *s,
 	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
 
 	inode->i_ino = BTRFS_EMPTY_SUBVOL_DIR_OBJECTID;
+	if (root->fs_info->num_devs == 1)
+		inode->i_tree = root->root_key.objectid;
 	/*
 	 * We only need lookup, the rest is read-only and there's no inode
 	 * associated with the dentry
@@ -6425,6 +6429,8 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
 	inode->i_ino = objectid;
 	if (objectid != root->inum_overlay)
 		inode->i_ino ^= root->inum_overlay;
+	if (root->fs_info->num_devs == 1)
+		inode->i_tree = root->root_key.objectid;
 
 	if (dir && name) {
 		trace_btrfs_inode_request(dir);
diff --git a/fs/inode.c b/fs/inode.c
index c93500d84264..7f62ac35de02 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -142,6 +142,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_op = &empty_iops;
 	inode->i_fop = &no_open_fops;
 	inode->i_ino = 0;
+	inode->i_tree = 0;
 	inode->__i_nlink = 1;
 	inode->i_opflags = 0;
 	if (sb->s_xattr)
diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1dfe8d8..21b28c019052 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2893,9 +2893,15 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
 	}
 	if (inode) {
 		/* userspace relies on this representation of dev_t */
-		seq_printf(f, "%d %02x:%02x:%lu ", fl_pid,
-				MAJOR(inode->i_sb->s_dev),
-				MINOR(inode->i_sb->s_dev), inode->i_ino);
+		if (inode->i_tree)
+			seq_printf(f, "%d %02x:%02x:%lu:%lu ", fl_pid,
+				   MAJOR(inode->i_sb->s_dev),
+				   MINOR(inode->i_sb->s_dev),
+				   inode->i_tree, inode->i_ino);
+		else
+			seq_printf(f, "%d %02x:%02x:%lu ", fl_pid,
+				   MAJOR(inode->i_sb->s_dev),
+				   MINOR(inode->i_sb->s_dev), inode->i_ino);
 	} else {
 		seq_printf(f, "%d <none>:0 ", fl_pid);
 	}
diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
index 57f0d5d9f934..4e8a363d171b 100644
--- a/fs/notify/fdinfo.c
+++ b/fs/notify/fdinfo.c
@@ -90,9 +90,13 @@ static void inotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark)
 		 * used only internally to the kernel.
 		 */
 		u32 mask = mark->mask & IN_ALL_EVENTS;
-		seq_printf(m, "inotify wd:%x ino:%lx sdev:%x mask:%x ignored_mask:%x ",
-			   inode_mark->wd, inode->i_ino, inode->i_sb->s_dev,
-			   mask, mark->ignored_mask);
+		seq_printf(m, "inotify wd:%x ", inode_mark->wd);
+		if (inode->i_tree)
+			seq_printf(m, "ino:%lx:%lx ", inode->i_tree, inode->i_ino);
+		else
+			seq_printf(m, "ino:%lx ", inode->i_ino);
+		seq_printf(m, "sdev:%x mask:%x ignored_mask:%x ",
+			   inode->i_sb->s_dev, mask, mark->ignored_mask);
 		show_mark_fhandle(m, inode);
 		seq_putc(m, '\n');
 		iput(inode);
@@ -120,8 +124,13 @@ static void fanotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark)
 		inode = igrab(fsnotify_conn_inode(mark->connector));
 		if (!inode)
 			return;
-		seq_printf(m, "fanotify ino:%lx sdev:%x mflags:%x mask:%x ignored_mask:%x ",
-			   inode->i_ino, inode->i_sb->s_dev,
+		if (inode->i_tree)
+			seq_printf(m, "fanotify ino:%lx:%lx", inode->i_tree,
+				   inode->i_ino);
+		else
+			seq_printf(m, "fanotify ino:%lx", inode->i_ino);
+		seq_printf(m, " sdev:%x mflags:%x mask:%x ignored_mask:%x ",
+			   inode->i_sb->s_dev,
 			   mflags, mark->mask, mark->ignored_mask);
 		show_mark_fhandle(m, inode);
 		seq_putc(m, '\n');
diff --git a/fs/proc/nommu.c b/fs/proc/nommu.c
index 13452b32e2bd..371caf60d4a4 100644
--- a/fs/proc/nommu.c
+++ b/fs/proc/nommu.c
@@ -31,7 +31,7 @@
  */
 static int nommu_region_show(struct seq_file *m, struct vm_region *region)
 {
-	unsigned long ino = 0;
+	unsigned long ino = 0, tree = 0;
 	struct file *file;
 	dev_t dev = 0;
 	int flags;
@@ -43,11 +43,12 @@ static int nommu_region_show(struct seq_file *m, struct vm_region *region)
 		struct inode *inode = file_inode(region->vm_file);
 		dev = inode->i_sb->s_dev;
 		ino = inode->i_ino;
+		tree = inode->i_tree;
 	}
 
 	seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
 	seq_printf(m,
-		   "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
+		   "%08lx-%08lx %c%c%c%c %08llx %02x:%02x ",
 		   region->vm_start,
 		   region->vm_end,
 		   flags & VM_READ ? 'r' : '-',
@@ -55,7 +56,11 @@ static int nommu_region_show(struct seq_file *m, struct vm_region *region)
 		   flags & VM_EXEC ? 'x' : '-',
 		   flags & VM_MAYSHARE ? flags & VM_SHARED ? 'S' : 's' : 'p',
 		   ((loff_t)region->vm_pgoff) << PAGE_SHIFT,
-		   MAJOR(dev), MINOR(dev), ino);
+		   MAJOR(dev), MINOR(dev));
+	if (tree)
+		seq_printf(m, "%lu:%lu ", tree, ino);
+	else
+		seq_printf(m, "%lu ", ino);
 
 	if (file) {
 		seq_pad(m, ' ');
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..9e6439d7939b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -250,7 +250,8 @@ static int is_stack(struct vm_area_struct *vma)
 static void show_vma_header_prefix(struct seq_file *m,
 				   unsigned long start, unsigned long end,
 				   vm_flags_t flags, unsigned long long pgoff,
-				   dev_t dev, unsigned long ino)
+				   dev_t dev, unsigned long ino,
+				   unsigned long tree)
 {
 	seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
 	seq_put_hex_ll(m, NULL, start, 8);
@@ -263,7 +264,12 @@ static void show_vma_header_prefix(struct seq_file *m,
 	seq_put_hex_ll(m, " ", pgoff, 8);
 	seq_put_hex_ll(m, " ", MAJOR(dev), 2);
 	seq_put_hex_ll(m, ":", MINOR(dev), 2);
-	seq_put_decimal_ull(m, " ", ino);
+	if (tree) {
+		seq_put_decimal_ull(m, " ", tree);
+		seq_put_decimal_ull(m, ":", ino);
+	} else {
+		seq_put_decimal_ull(m, " ", ino);
+	}
 	seq_putc(m, ' ');
 }
 
@@ -273,7 +279,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 	struct mm_struct *mm = vma->vm_mm;
 	struct file *file = vma->vm_file;
 	vm_flags_t flags = vma->vm_flags;
-	unsigned long ino = 0;
+	unsigned long ino = 0, tree = 0;
 	unsigned long long pgoff = 0;
 	unsigned long start, end;
 	dev_t dev = 0;
@@ -283,12 +289,13 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 		struct inode *inode = file_inode(vma->vm_file);
 		dev = inode->i_sb->s_dev;
 		ino = inode->i_ino;
+		tree = inode->i_tree;
 		pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
 	}
 
 	start = vma->vm_start;
 	end = vma->vm_end;
-	show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino);
+	show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino, tree);
 
 	/*
 	 * Print the dentry name for named mappings, and a
@@ -934,7 +941,7 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
 	}
 
 	show_vma_header_prefix(m, priv->mm->mmap->vm_start,
-			       last_vma_end, 0, 0, 0, 0);
+			       last_vma_end, 0, 0, 0, 0, 0);
 	seq_pad(m, ' ');
 	seq_puts(m, "[rollup]\n");
 
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index a6d21fc0033c..c33d7aad3927 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -145,7 +145,7 @@ static int is_stack(struct vm_area_struct *vma)
 static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long ino = 0;
+	unsigned long ino = 0, tree = 0;
 	struct file *file;
 	dev_t dev = 0;
 	int flags;
@@ -158,12 +158,13 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
 		struct inode *inode = file_inode(vma->vm_file);
 		dev = inode->i_sb->s_dev;
 		ino = inode->i_ino;
+		tree = inode->i_tree;
 		pgoff = (loff_t)vma->vm_pgoff << PAGE_SHIFT;
 	}
 
 	seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
 	seq_printf(m,
-		   "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
+		   "%08lx-%08lx %c%c%c%c %08llx %02x:%02x ",
 		   vma->vm_start,
 		   vma->vm_end,
 		   flags & VM_READ ? 'r' : '-',
@@ -171,7 +172,11 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
 		   flags & VM_EXEC ? 'x' : '-',
 		   flags & VM_MAYSHARE ? flags & VM_SHARED ? 'S' : 's' : 'p',
 		   pgoff,
-		   MAJOR(dev), MINOR(dev), ino);
+		   MAJOR(dev), MINOR(dev));
+	if (tree)
+		seq_printf(m, "%lu:%lu ", ino, tree);
+	else
+		seq_printf(m, "%lu ", ino);
 
 	if (file) {
 		seq_pad(m, ' ');
diff --git a/fs/stat.c b/fs/stat.c
index 2dd5d3d67793..4aa402858f64 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -45,6 +45,7 @@ void generic_fillattr(struct user_namespace *mnt_userns, struct inode *inode,
 {
 	stat->dev = inode->i_sb->s_dev;
 	stat->ino = inode->i_ino;
+	stat->tree_id = inode->i_tree;
 	stat->mode = inode->i_mode;
 	stat->nlink = inode->i_nlink;
 	stat->uid = i_uid_into_mnt(mnt_userns, inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a777c1b1706a..86dc586c408b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -630,6 +630,7 @@ struct inode {
 
 	/* Stat data, not accessed from path walking */
 	unsigned long		i_ino;
+	unsigned long		i_tree;
 	/*
 	 * Filesystems may only read i_nlink directly.  They shall use the
 	 * following functions for modification:



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] btrfs: add numdevs= mount option.
  2021-08-09  3:55 ` [PATCH 2/4] btrfs: add numdevs= mount option NeilBrown
@ 2021-08-09  7:50   ` kernel test robot
  0 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2021-08-09  7:50 UTC (permalink / raw)
  To: NeilBrown, Josef Bacik, Chris Mason, David Sterba
  Cc: clang-built-linux, kbuild-all, linux-fsdevel, Linux NFS list,
	Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 5847 bytes --]

Hi NeilBrown,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on ext3/fsnotify linus/master v5.14-rc5 next-20210806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/Attempt-to-make-progress-with-btrfs-dev-number-strangeness/20210809-120046
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-randconfig-c001-20210809 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c5c3cdb9c92895a63993cee70d2dd776ff9519c3)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/c5bae87ed5b72b9fd999fa935f477483da001f63
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review NeilBrown/Attempt-to-make-progress-with-btrfs-dev-number-strangeness/20210809-120046
        git checkout c5bae87ed5b72b9fd999fa935f477483da001f63
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/btrfs/ioctl.c:740:6: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
           if (fs_info->num_devs == BTRFS_MANY_DEVS)
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/ioctl.c:742:6: note: uninitialized use occurs here
           if (ret < 0)
               ^~~
   fs/btrfs/ioctl.c:740:2: note: remove the 'if' if its condition is always true
           if (fs_info->num_devs == BTRFS_MANY_DEVS)
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/ioctl.c:725:9: note: initialize the variable 'ret' to silence this warning
           int ret;
                  ^
                   = 0
   1 warning generated.


vim +740 fs/btrfs/ioctl.c

   716	
   717	static int create_snapshot(struct btrfs_root *root, struct inode *dir,
   718				   struct dentry *dentry, bool readonly,
   719				   struct btrfs_qgroup_inherit *inherit)
   720	{
   721		struct btrfs_fs_info *fs_info = btrfs_sb(dir->i_sb);
   722		struct inode *inode;
   723		struct btrfs_pending_snapshot *pending_snapshot;
   724		struct btrfs_trans_handle *trans;
   725		int ret;
   726	
   727		if (!test_bit(BTRFS_ROOT_SHAREABLE, &root->state))
   728			return -EINVAL;
   729	
   730		if (atomic_read(&root->nr_swapfiles)) {
   731			btrfs_warn(fs_info,
   732				   "cannot snapshot subvolume with active swapfile");
   733			return -ETXTBSY;
   734		}
   735	
   736		pending_snapshot = kzalloc(sizeof(*pending_snapshot), GFP_KERNEL);
   737		if (!pending_snapshot)
   738			return -ENOMEM;
   739	
 > 740		if (fs_info->num_devs == BTRFS_MANY_DEVS)
   741			ret = get_anon_bdev(&pending_snapshot->anon_dev);
   742		if (ret < 0)
   743			goto free_pending;
   744		pending_snapshot->root_item = kzalloc(sizeof(struct btrfs_root_item),
   745				GFP_KERNEL);
   746		pending_snapshot->path = btrfs_alloc_path();
   747		if (!pending_snapshot->root_item || !pending_snapshot->path) {
   748			ret = -ENOMEM;
   749			goto free_pending;
   750		}
   751	
   752		btrfs_init_block_rsv(&pending_snapshot->block_rsv,
   753				     BTRFS_BLOCK_RSV_TEMP);
   754		/*
   755		 * 1 - parent dir inode
   756		 * 2 - dir entries
   757		 * 1 - root item
   758		 * 2 - root ref/backref
   759		 * 1 - root of snapshot
   760		 * 1 - UUID item
   761		 */
   762		ret = btrfs_subvolume_reserve_metadata(BTRFS_I(dir)->root,
   763						&pending_snapshot->block_rsv, 8,
   764						false);
   765		if (ret)
   766			goto free_pending;
   767	
   768		pending_snapshot->dentry = dentry;
   769		pending_snapshot->root = root;
   770		pending_snapshot->readonly = readonly;
   771		pending_snapshot->dir = dir;
   772		pending_snapshot->inherit = inherit;
   773	
   774		trans = btrfs_start_transaction(root, 0);
   775		if (IS_ERR(trans)) {
   776			ret = PTR_ERR(trans);
   777			goto fail;
   778		}
   779	
   780		spin_lock(&fs_info->trans_lock);
   781		list_add(&pending_snapshot->list,
   782			 &trans->transaction->pending_snapshots);
   783		spin_unlock(&fs_info->trans_lock);
   784	
   785		ret = btrfs_commit_transaction(trans);
   786		if (ret)
   787			goto fail;
   788	
   789		ret = pending_snapshot->error;
   790		if (ret)
   791			goto fail;
   792	
   793		ret = btrfs_orphan_cleanup(pending_snapshot->snap);
   794		if (ret)
   795			goto fail;
   796	
   797		inode = btrfs_lookup_dentry(d_inode(dentry->d_parent), dentry);
   798		if (IS_ERR(inode)) {
   799			ret = PTR_ERR(inode);
   800			goto fail;
   801		}
   802	
   803		d_instantiate(dentry, inode);
   804		ret = 0;
   805		pending_snapshot->anon_dev = 0;
   806	fail:
   807		/* Prevent double freeing of anon_dev */
   808		if (ret && pending_snapshot->snap)
   809			pending_snapshot->snap->anon_dev = 0;
   810		btrfs_put_root(pending_snapshot->snap);
   811		btrfs_subvolume_release_metadata(root, &pending_snapshot->block_rsv);
   812	free_pending:
   813		if (pending_snapshot->anon_dev)
   814			free_anon_bdev(pending_snapshot->anon_dev);
   815		kfree(pending_snapshot->root_item);
   816		btrfs_free_path(pending_snapshot->path);
   817		kfree(pending_snapshot);
   818	
   819		return ret;
   820	}
   821	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32874 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.
  2021-08-09  3:55 [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness NeilBrown
                   ` (3 preceding siblings ...)
  2021-08-09  3:55 ` [PATCH 2/4] btrfs: add numdevs= mount option NeilBrown
@ 2021-08-10 20:51 ` Josef Bacik
  2021-08-11 22:13   ` NeilBrown
  4 siblings, 1 reply; 11+ messages in thread
From: Josef Bacik @ 2021-08-10 20:51 UTC (permalink / raw)
  To: NeilBrown, Chris Mason, David Sterba
  Cc: linux-fsdevel, Linux NFS list, Btrfs BTRFS

On 8/8/21 11:55 PM, NeilBrown wrote:
> I continue to search for a way forward for btrfs so that its behaviour
> with respect to device numbers and subvols is somewhat coherent.
> 
> This series implements some of the ideas in my "A Third perspective"[1],
> though with changes is various details.
> 
> I introduce two new mount options, which default to
> no-change-in-behaviour.
> 
>   -o inumbits=  causes inode numbers to be more unique across a whole btrfs
>                 filesystem, and is many cases completely unique.  Mounting
>                 with "-i inumbits=56" will resolve the NFS issues that
>                 started me tilting at this particular windmill.
> 
>   -o numdevs=  can reduce the number of distinct devices reported by
>                stat(), either to 2 or to 1.
>                Both ease problems for sites that exhaust their supply of
>                device numbers.
>                '2' allows "du -x" to continue to work, but is otherwise
>                rather strange.
>                '1' breaks the use of "du -x" and similar to examine a
>                single subvol which might have subvol descendants, but
>                provides generally sane behaviour
>                "-o numdevs=1" also forces inumbits to have a useful value.
> 
> I introduce a "tree id" which can be discovered using statx().  Two
> files with the same dev and ino might still be different if the tree-ids
> are different.  Connected files with the same tree-id may be usefully
> considered to be related.
> 
> I also change various /proc files (only when numdevs=1 is used) to
> provide extra information so they are useful with btrfs despite subvols.
> /proc/maps /proc/smaps /proc/locks /proc/X/fdinfo/Y are affected.
> The inode number becomes "XX:YY" where XX is the subvol number (tree id)
> and YY is the inode number.
> 
> An alternate might be to report a number which might use up to 128 bits.
> Which is less likely to seriously break code?
> 
> Note that code which ignores badly formatted lines is safe, because it
> will never currently find a match for a btrfs file in these files
> anyway.  The device number they report is never returned in st_dev for
> stat() on any file.
> 
> The audit subsystem and one or two other places report dev/ino and so
> need enhanced, but I haven't tried to address those.
> 
> Various trace points also report dev/ino.  I haven't tried thinking
> about those either.

I think this is a step in the right direction, but I want to figure out a way to 
accomplish this without magical mount points that users must be aware of.

I think the stat() st_dev ship as sailed, we're stuck with that.  However 
Christoph does have a valid point where it breaks the various info spit out by 
/proc.  You've done a good job with the treeid here, but it still makes it 
impossible for somebody to map the st_dev back to the correct mount.

I think we aren't going to solve that problem, at least not with stat().  I 
think with statx() spitting out treeid we have given userspace a way to 
differentiate subvolumes, and so we should fix statx() to spit out the the super 
block device, that way new userspace things can do their appropriate lookup if 
they so choose.

This leaves the problem of nfsd.  Can you just integrate this new treeid into 
nfsd, and use that to either change the ino within nfsd itself, or do something 
similar to what your first patchset did and generate a fsid based on the treeid?

Mount options are messy, and are just going to lead to distro's turning them on 
without understanding what's going on and then we have to support them forever. 
  I want to get this fixed in a way that we all hate the least with as little 
opportunity for confused users to make bad decisions.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.
  2021-08-10 20:51 ` [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness Josef Bacik
@ 2021-08-11 22:13   ` NeilBrown
  2021-08-12 13:54     ` Josef Bacik
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2021-08-11 22:13 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Chris Mason, David Sterba, linux-fsdevel, Linux NFS list, Btrfs BTRFS

On Wed, 11 Aug 2021, Josef Bacik wrote:
> 
> I think this is a step in the right direction, but I want to figure out a way to 
> accomplish this without magical mount points that users must be aware of.

magic mount *options* ???

> 
> I think the stat() st_dev ship as sailed, we're stuck with that.  However 
> Christoph does have a valid point where it breaks the various info spit out by 
> /proc.  You've done a good job with the treeid here, but it still makes it 
> impossible for somebody to map the st_dev back to the correct mount.

The ship might have sailed, but it is not water tight.  And as the world
it round, it can still come back to bite us from behind.
Anything can be transitioned away from, whether it is devfs or 32-bit
time or giving different device numbers to different file-trees.

The linkage between device number and and filesystem is quite strong.
We could modified all of /proc and /sys/ and audit and whatever else to
report the fake device number, but we cannot get the fake device number
into the mount table (without making the mount table unmanageablely
large).  
And if subtrees aren't in the mount-table for the NFS server, I don't
think they should be in the mount-table of the NFS client.  So we cannot
export them to NFS.

I understand your dislike for mount options.  An alternative with
different costs and benefits would be to introduce a new filesystem type
- btrfs2 or maybe betrfs.  This would provide numdevs=1 semantics and do
whatever we decided was best with inode numbers.  How much would you
hate that?

> 
> I think we aren't going to solve that problem, at least not with stat().  I 
> think with statx() spitting out treeid we have given userspace a way to 
> differentiate subvolumes, and so we should fix statx() to spit out the the super 
> block device, that way new userspace things can do their appropriate lookup if 
> they so choose.

I don't think we should normalize having multiple devnums per filesystem
by encoding it in statx().  It *would* make sense to add a btrfs ioctl
which reports the real device number of a file.  Tools that really need
to work with btrfs could use that, but it would always be obvious that
it was an exception.

> 
> This leaves the problem of nfsd.  Can you just integrate this new treeid into 
> nfsd, and use that to either change the ino within nfsd itself, or do something 
> similar to what your first patchset did and generate a fsid based on the treeid?

I would only want nfsd to change the inode number.  I no longer think it
is acceptable for nfsd to report different device number (as I mention
above).
I would want the new inode number to be explicitly provided by the
filesystem.  Whether that is a new export_operation or a new field in
'struct kstat' doesn't really bother me.  I'd *prefer* it to be st_ino,
but I can live without that.

On the topic of inode numbers....  I've recently learned that btrfs
never reuses inode (objectid) numbers (except possibly after an
unmount).  Equally it doesn't re-use subvol numbers.  How much does this
contribute to the 64 bits not being enough for subtree+inode?

It would be nice if we could be comfortable limiting the objectid number
to 40 bits and the root.objectid (filetree) number to 24 bits, and
combine them into a 64bit inode number.

If we added a inode number reuse scheme that was suitably performant,
would that make this possible?  That would remove the need for a treeid,
and allow us to use project-id to identify subtrees.

> 
> Mount options are messy, and are just going to lead to distro's turning them on 
> without understanding what's going on and then we have to support them forever. 
>   I want to get this fixed in a way that we all hate the least with as little 
> opportunity for confused users to make bad decisions.  Thanks,

Hence my question: how much do you hate creating a new filesystem type
to fix the problems?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.
  2021-08-11 22:13   ` NeilBrown
@ 2021-08-12 13:54     ` Josef Bacik
  2021-08-12 14:06       ` Hugo Mills
  2021-08-12 22:35       ` NeilBrown
  0 siblings, 2 replies; 11+ messages in thread
From: Josef Bacik @ 2021-08-12 13:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: Chris Mason, David Sterba, linux-fsdevel, Linux NFS list, Btrfs BTRFS

On 8/11/21 6:13 PM, NeilBrown wrote:
> On Wed, 11 Aug 2021, Josef Bacik wrote:
>>
>> I think this is a step in the right direction, but I want to figure out a way to
>> accomplish this without magical mount points that users must be aware of.
> 
> magic mount *options* ???
> 
>>
>> I think the stat() st_dev ship as sailed, we're stuck with that.  However
>> Christoph does have a valid point where it breaks the various info spit out by
>> /proc.  You've done a good job with the treeid here, but it still makes it
>> impossible for somebody to map the st_dev back to the correct mount.
> 
> The ship might have sailed, but it is not water tight.  And as the world
> it round, it can still come back to bite us from behind.
> Anything can be transitioned away from, whether it is devfs or 32-bit
> time or giving different device numbers to different file-trees.
> 
> The linkage between device number and and filesystem is quite strong.
> We could modified all of /proc and /sys/ and audit and whatever else to
> report the fake device number, but we cannot get the fake device number
> into the mount table (without making the mount table unmanageablely
> large).
> And if subtrees aren't in the mount-table for the NFS server, I don't
> think they should be in the mount-table of the NFS client.  So we cannot
> export them to NFS.
> 
> I understand your dislike for mount options.  An alternative with
> different costs and benefits would be to introduce a new filesystem type
> - btrfs2 or maybe betrfs.  This would provide numdevs=1 semantics and do
> whatever we decided was best with inode numbers.  How much would you
> hate that?
> 

A lot more ;).

>>
>> I think we aren't going to solve that problem, at least not with stat().  I
>> think with statx() spitting out treeid we have given userspace a way to
>> differentiate subvolumes, and so we should fix statx() to spit out the the super
>> block device, that way new userspace things can do their appropriate lookup if
>> they so choose.
> 
> I don't think we should normalize having multiple devnums per filesystem
> by encoding it in statx().  It *would* make sense to add a btrfs ioctl
> which reports the real device number of a file.  Tools that really need
> to work with btrfs could use that, but it would always be obvious that
> it was an exception.

That's not what I'm saying.  I'm saying that stat() continues to behave the way 
it currently does, for legacy users.

And then for statx() it returns the correct devnum like any other file system, 
with the augmentation of the treeid so that future userspace programs can use 
the treeid to decide if they want to wander into a subvolume.

This way moving forward we have a way to map back to a mount point because 
statx() will return the actual devnum for the mountpoint, and then we can use 
the treeid to be smart about when we wander into a subvolume.

And if we're going to add a treeid, I would actually like to add a parent_treeid 
as well so we could tell if we're a snapshot or just a normal subvolume.

> 
>>
>> This leaves the problem of nfsd.  Can you just integrate this new treeid into
>> nfsd, and use that to either change the ino within nfsd itself, or do something
>> similar to what your first patchset did and generate a fsid based on the treeid?
> 
> I would only want nfsd to change the inode number.  I no longer think it
> is acceptable for nfsd to report different device number (as I mention
> above).
> I would want the new inode number to be explicitly provided by the
> filesystem.  Whether that is a new export_operation or a new field in
> 'struct kstat' doesn't really bother me.  I'd *prefer* it to be st_ino,
> but I can live without that.
>

Right, I'm not saying nfsd has to propagate our dev_t thing, I'm saying that you 
could accomplish the same behavior without the mount options.  We add either a 
new SB_I_HAS_TREEID or FS_HAS_TREEID, depending on if you prefer to tag the sb 
or the fs_type, and then NFS does the inode number magic transformation 
automatically and we are good to go.

> On the topic of inode numbers....  I've recently learned that btrfs
> never reuses inode (objectid) numbers (except possibly after an
> unmount).  Equally it doesn't re-use subvol numbers.  How much does this
> contribute to the 64 bits not being enough for subtree+inode?
> 
> It would be nice if we could be comfortable limiting the objectid number
> to 40 bits and the root.objectid (filetree) number to 24 bits, and
> combine them into a 64bit inode number.
> 
> If we added a inode number reuse scheme that was suitably performant,
> would that make this possible?  That would remove the need for a treeid,
> and allow us to use project-id to identify subtrees.
> 

We had a resuse scheme, we deprecated and deleted it.  I don't want to 
arbitrarily limit objectid's to work around this issue.

>>
>> Mount options are messy, and are just going to lead to distro's turning them on
>> without understanding what's going on and then we have to support them forever.
>>    I want to get this fixed in a way that we all hate the least with as little
>> opportunity for confused users to make bad decisions.  Thanks,
> 
> Hence my question: how much do you hate creating a new filesystem type
> to fix the problems?
> 

I'm still not convinced we can't solve this without adding new options or 
fstypes.  I think flags to indicate that we're special and to use a treeid that 
we stuff into the inode would be a reasonable solution.  That being said I'm a 
little sleep deprived so I could be missing why my plan is a bad one, so I'm 
willing to be convinced that mount options are the solution to this, but I want 
to make sure we're damned certain that's the best way forward.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.
  2021-08-12 13:54     ` Josef Bacik
@ 2021-08-12 14:06       ` Hugo Mills
  2021-08-12 22:35       ` NeilBrown
  1 sibling, 0 replies; 11+ messages in thread
From: Hugo Mills @ 2021-08-12 14:06 UTC (permalink / raw)
  To: Josef Bacik
  Cc: NeilBrown, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

On Thu, Aug 12, 2021 at 09:54:54AM -0400, Josef Bacik wrote:
> On 8/11/21 6:13 PM, NeilBrown wrote:
> > On Wed, 11 Aug 2021, Josef Bacik wrote:
> > > 
> > > I think this is a step in the right direction, but I want to figure out a way to
> > > accomplish this without magical mount points that users must be aware of.
> > 
> > magic mount *options* ???
> > 
> > > 
> > > I think the stat() st_dev ship as sailed, we're stuck with that.  However
> > > Christoph does have a valid point where it breaks the various info spit out by
> > > /proc.  You've done a good job with the treeid here, but it still makes it
> > > impossible for somebody to map the st_dev back to the correct mount.
> > 
> > The ship might have sailed, but it is not water tight.  And as the world
> > it round, it can still come back to bite us from behind.
> > Anything can be transitioned away from, whether it is devfs or 32-bit
> > time or giving different device numbers to different file-trees.
> > 
> > The linkage between device number and and filesystem is quite strong.
> > We could modified all of /proc and /sys/ and audit and whatever else to
> > report the fake device number, but we cannot get the fake device number
> > into the mount table (without making the mount table unmanageablely
> > large).
> > And if subtrees aren't in the mount-table for the NFS server, I don't
> > think they should be in the mount-table of the NFS client.  So we cannot
> > export them to NFS.
> > 
> > I understand your dislike for mount options.  An alternative with
> > different costs and benefits would be to introduce a new filesystem type
> > - btrfs2 or maybe betrfs.  This would provide numdevs=1 semantics and do
> > whatever we decided was best with inode numbers.  How much would you
> > hate that?
> > 
> 
> A lot more ;).
> 
> > > 
> > > I think we aren't going to solve that problem, at least not with stat().  I
> > > think with statx() spitting out treeid we have given userspace a way to
> > > differentiate subvolumes, and so we should fix statx() to spit out the the super
> > > block device, that way new userspace things can do their appropriate lookup if
> > > they so choose.
> > 
> > I don't think we should normalize having multiple devnums per filesystem
> > by encoding it in statx().  It *would* make sense to add a btrfs ioctl
> > which reports the real device number of a file.  Tools that really need
> > to work with btrfs could use that, but it would always be obvious that
> > it was an exception.
> 
> That's not what I'm saying.  I'm saying that stat() continues to behave the
> way it currently does, for legacy users.
> 
> And then for statx() it returns the correct devnum like any other file
> system, with the augmentation of the treeid so that future userspace
> programs can use the treeid to decide if they want to wander into a
> subvolume.
> 
> This way moving forward we have a way to map back to a mount point because
> statx() will return the actual devnum for the mountpoint, and then we can
> use the treeid to be smart about when we wander into a subvolume.
> 
> And if we're going to add a treeid, I would actually like to add a
> parent_treeid as well so we could tell if we're a snapshot or just a normal
> subvolume.

   Can I make a request to call it something other than a
"parent". There's at least three different usages of "parent" for
three different concepts related to subvolumes in btrfs(*), and it'd
be nice to avoid the inevitable confusion.

(*) 1. "subvolume containing this one",
    2. "subvolume that was snapshotted to make this one", and,
    3. at least informally, "subvolume that was sent/received to make this one"

   Hugo.

[snip to end]

-- 
Hugo Mills             | Reading Mein Kampf won't make you a Nazi. Reading
hugo@... carfax.org.uk | Das Kapital won't make you a communist. But most
http://carfax.org.uk/  | trolls started out with a copy of Lord of the Rings.
PGP: E2AB1DE4          |

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.
  2021-08-12 13:54     ` Josef Bacik
  2021-08-12 14:06       ` Hugo Mills
@ 2021-08-12 22:35       ` NeilBrown
  1 sibling, 0 replies; 11+ messages in thread
From: NeilBrown @ 2021-08-12 22:35 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Chris Mason, David Sterba, linux-fsdevel, Linux NFS list, Btrfs BTRFS

On Thu, 12 Aug 2021, Josef Bacik wrote:
> On 8/11/21 6:13 PM, NeilBrown wrote:
> > On Wed, 11 Aug 2021, Josef Bacik wrote:
> >>
> >> I think this is a step in the right direction, but I want to figure out a way to
> >> accomplish this without magical mount points that users must be aware of.
> > 
> > magic mount *options* ???
> > 
> >>
> >> I think the stat() st_dev ship as sailed, we're stuck with that.  However
> >> Christoph does have a valid point where it breaks the various info spit out by
> >> /proc.  You've done a good job with the treeid here, but it still makes it
> >> impossible for somebody to map the st_dev back to the correct mount.
> > 
> > The ship might have sailed, but it is not water tight.  And as the world
> > it round, it can still come back to bite us from behind.
> > Anything can be transitioned away from, whether it is devfs or 32-bit
> > time or giving different device numbers to different file-trees.
> > 
> > The linkage between device number and and filesystem is quite strong.
> > We could modified all of /proc and /sys/ and audit and whatever else to
> > report the fake device number, but we cannot get the fake device number
> > into the mount table (without making the mount table unmanageablely
> > large).
> > And if subtrees aren't in the mount-table for the NFS server, I don't
> > think they should be in the mount-table of the NFS client.  So we cannot
> > export them to NFS.
> > 
> > I understand your dislike for mount options.  An alternative with
> > different costs and benefits would be to introduce a new filesystem type
> > - btrfs2 or maybe betrfs.  This would provide numdevs=1 semantics and do
> > whatever we decided was best with inode numbers.  How much would you
> > hate that?
> > 
> 
> A lot more ;).
> 
> >>
> >> I think we aren't going to solve that problem, at least not with stat().  I
> >> think with statx() spitting out treeid we have given userspace a way to
> >> differentiate subvolumes, and so we should fix statx() to spit out the the super
> >> block device, that way new userspace things can do their appropriate lookup if
> >> they so choose.
> > 
> > I don't think we should normalize having multiple devnums per filesystem
> > by encoding it in statx().  It *would* make sense to add a btrfs ioctl
> > which reports the real device number of a file.  Tools that really need
> > to work with btrfs could use that, but it would always be obvious that
> > it was an exception.
> 
> That's not what I'm saying.  I'm saying that stat() continues to behave the way 
> it currently does, for legacy users.
> 
> And then for statx() it returns the correct devnum like any other file system, 
> with the augmentation of the treeid so that future userspace programs can use 
> the treeid to decide if they want to wander into a subvolume.

Yes, that is what I thought you were saying.  It implies that the
possibility of a file having two different device numbers becomes
normalised in the API - one returned by stat(), the other by statx()
(presumably in a new field - the FS cannot tell what libc call the
application made).  I don't like that.

> 
> This way moving forward we have a way to map back to a mount point because 
> statx() will return the actual devnum for the mountpoint, and then we can use 
> the treeid to be smart about when we wander into a subvolume.

We already have a way to map back to a mountpoint.  statx reports a
mnt_id with result flag STATX_MNT_ID.  This is the number at the start
of the line in mountinfo.  Hmmm, this isn't in the manpage.  It has been
in the kernel since Linux 5.8.  I'll send a patch for the manpage.

So we could pursue a path where the device-id no longer defines the
filesystem (or mount), but instead it defines some arbitrary grouping of
objects within a filesystem.  So instead of my proposed
   dev-id  /  subtree-id / inode-number
we would have
   dev-id-in-mountinfo / mnt_id / dev-id-in-stat / inode-number

In some ways this would be a smoother path forward - no change to statx,
no new concepts, just formalizing some de-facto concepts.
In other ways it might be rougher - we would need to convince the
community to use the stat() dev-id in all those proc files etc.

I think having the two meanings for a device-id would cause confusion for
quite some years..... but then any change will probably cause confusion.

> 
> And if we're going to add a treeid, I would actually like to add a parent_treeid 
> as well so we could tell if we're a snapshot or just a normal subvolume.

Is this a well-defined concept? Isn't "snapshot" just one possible
use-case for the btrfs functionality of creating a reflink to a subtree?
What happens to the "parent_treeid" reference when that "parent" gets
deleted?

I understand the desire to track this sort of connection, but I wonder
if the filesystem is really the right place to track it.  Maybe having
the tools track it would be better.

> 
> > 
> >>
> >> This leaves the problem of nfsd.  Can you just integrate this new treeid into
> >> nfsd, and use that to either change the ino within nfsd itself, or do something
> >> similar to what your first patchset did and generate a fsid based on the treeid?
> > 
> > I would only want nfsd to change the inode number.  I no longer think it
> > is acceptable for nfsd to report different device number (as I mention
> > above).
> > I would want the new inode number to be explicitly provided by the
> > filesystem.  Whether that is a new export_operation or a new field in
> > 'struct kstat' doesn't really bother me.  I'd *prefer* it to be st_ino,
> > but I can live without that.
> >
> 
> Right, I'm not saying nfsd has to propagate our dev_t thing, I'm saying that you 
> could accomplish the same behavior without the mount options.  We add either a 
> new SB_I_HAS_TREEID or FS_HAS_TREEID, depending on if you prefer to tag the sb 
> or the fs_type, and then NFS does the inode number magic transformation 
> automatically and we are good to go.

I really don't want nfsd to do the magic transformations.  I want the
filesystem to do those if they need to be done.  I could cope with nfsd
xor-ing some provided number with i_ino, but I wouldn't like nfsd to
have the responsibility of doing the swab64().

> 
> > On the topic of inode numbers....  I've recently learned that btrfs
> > never reuses inode (objectid) numbers (except possibly after an
> > unmount).  Equally it doesn't re-use subvol numbers.  How much does this
> > contribute to the 64 bits not being enough for subtree+inode?
> > 
> > It would be nice if we could be comfortable limiting the objectid number
> > to 40 bits and the root.objectid (filetree) number to 24 bits, and
> > combine them into a 64bit inode number.
> > 
> > If we added a inode number reuse scheme that was suitably performant,
> > would that make this possible?  That would remove the need for a treeid,
> > and allow us to use project-id to identify subtrees.
> > 
> 
> We had a resuse scheme, we deprecated and deleted it.  I don't want to 
> arbitrarily limit objectid's to work around this issue.

These are computers we are working with.  There are always arbitrary
limits.
The syscall interface places an arbitrary limit of 64bits on the
identity of any object in a filesystem.  btrfs clearly doesn't like that
arbitrary limit, and plays games with device number to increase it to a
new arbitrary limit of 84 bits (sort-of).

I'm fully open to the possibility that last year's arbitrary limits are
no longer comfortable and that we might need to push the boundaries.
But I'd rather the justification was a bit stronger than "we cannot be
bothered reusing old inode numbers".

Are you at all aware of any site coming anywhere vaguely close to one trillion
concurrent inodes - maybe even 16 billion?
Or anything close to 16 million concurrent subvolumes?

> 
> >>
> >> Mount options are messy, and are just going to lead to distro's turning them on
> >> without understanding what's going on and then we have to support them forever.
> >>    I want to get this fixed in a way that we all hate the least with as little
> >> opportunity for confused users to make bad decisions.  Thanks,
> > 
> > Hence my question: how much do you hate creating a new filesystem type
> > to fix the problems?
> > 
> 
> I'm still not convinced we can't solve this without adding new options or 
> fstypes.  I think flags to indicate that we're special and to use a treeid that 
> we stuff into the inode would be a reasonable solution.  That being said I'm a 
> little sleep deprived so I could be missing why my plan is a bad one, so I'm 
> willing to be convinced that mount options are the solution to this, but I want 
> to make sure we're damned certain that's the best way forward.  Thanks,

I don't think "best way forward" is the appropriate goal - impossible to
assess.

What we need is a chosen way forward.  Someone - and ultimately that
someone needs to be the BTRFS maintainer team - needs to decide what
breakage they are willing to bear the cost of, and what breakage is
unacceptable to them, and to choose a way to move forward.  I cannot
make that decision for you because I'm just an interested bystander.  Al
Viro and Linus cannot either, though they are in a position to veto some
decisions.

The current choice appears to be "ignore the problem and hope it goes
away", though I appreciate that appearances can be deceiving.

You appear very keen to preserve as much of the status quo as possible.
Given that, I think you really need to push to get all the procfs files
changed to use the same device number as stat - so push the patch which
SUSE has that add inode_get_dev().

https://github.com/SUSE/kernel-source/blob/master/patches.suse/vfs-add-super_operations-get_inode_dev

(though the change to show_mountinfo() in that patch would need careful consideration).

If that lands, you have a clear way forward, and we can find some
solution for NFSd (and other network filesystems), and for user-space to
use mnt_id.
If you cannot overcome the pushback, then you know you will have to
find another path - make a 64bit inode number unique, or add more bits
to the effective inode number.  Or something.

NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-08-12 22:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-09  3:55 [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness NeilBrown
2021-08-09  3:55 ` [PATCH 4/4] Add "tree" number to "inode" number in various /proc files NeilBrown
2021-08-09  3:55 ` [PATCH 3/4] VFS/btrfs: add STATX_TREE_ID NeilBrown
2021-08-09  3:55 ` [PATCH 1/4] btrfs: include subvol identifier in inode number if -o inumbits= NeilBrown
2021-08-09  3:55 ` [PATCH 2/4] btrfs: add numdevs= mount option NeilBrown
2021-08-09  7:50   ` kernel test robot
2021-08-10 20:51 ` [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness Josef Bacik
2021-08-11 22:13   ` NeilBrown
2021-08-12 13:54     ` Josef Bacik
2021-08-12 14:06       ` Hugo Mills
2021-08-12 22:35       ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).