All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: NeilBrown <neilb@suse.de>
Cc: Wang Yugui <wangyugui@e16-tech.com>,
	Christoph Hellwig <hch@infradead.org>,
	Josef Bacik <josef@toxicpanda.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Chuck Lever <chuck.lever@oracle.com>, Chris Mason <clm@fb.com>,
	David Sterba <dsterba@suse.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export
Date: Wed, 18 Aug 2021 22:19:10 -0400	[thread overview]
Message-ID: <20210819021910.GB29026@hungrycats.org> (raw)
In-Reply-To: <162932318266.9892.13600254282844823374@noble.neil.brown.name>

On Thu, Aug 19, 2021 at 07:46:22AM +1000, NeilBrown wrote:
> On Thu, 19 Aug 2021, Wang Yugui wrote:
> > Hi,
> > 
> > We use  'swab64' to combinate 'subvol id' and 'inode' into 64bit in this
> > patch.
> > 
> > case1:
> > 'subvol id': 16bit => 64K, a little small because the subvol id is
> > always increase?
> > 'inode':	48bit * 4K per node, this is big enough.
> > 
> > case2:
> > 'subvol id': 24bit => 16M,  this is big enough.
> > 'inode':	40bit * 4K per node => 4 PB.  this is a little small?
> 
> I don't know what point you are trying to make with the above.
> 
> > 
> > Is there a way to 'bit-swap' the subvol id, rather the current byte-swap?
> 
> Sure:
>    for (i=0; i<64; i++) {
>         new = (new << 1) | (old & 1)
>         old >>= 1;
>    }
> 
> but would it gain anything significant?
> 
> Remember what the goal is.  Most apps don't care at all about duplicate
> inode numbers - only a few do, and they only care about a few inodes.
> The only bug I actually have a report of is caused by a directory having
> the same inode as an ancestor.  i.e.  in lots of cases, duplicate inode
> numbers won't be noticed.

rsync -H and cpio's hardlink detection can be badly confused.  They will
think distinct files with the same inode number are hardlinks.  This could
be bad if you were making backups (though if you're making backups over
NFS, you are probably doing something that could be done better in a
different way).

> The behaviour of btrfs over NFS RELIABLY causes exactly this behaviour
> of a directory having the same inode number as an ancestor.  The root of
> a subtree will *always* do this.  If we JUST changed the inode numbers
> of the roots of subtrees, then most observed problems would go away.  It
> would change from "trivial to reproduce" to "rarely happens".  The patch
> I actually propose makes it much more unlikely than that.  Even if
> duplicate inode numbers do happen, the chance of them being noticed is
> infinitesimal.  Given that, there is no point in minor tweaks unless
> they can make duplicate inode numbers IMPOSSIBLE.

That's a good argument.  I have a different one with the same conclusion.

40 bit inodes would take about 20 years to collide with 24-bit subvols--if
you are creating an average of 1742 inodes every second.  Also at the
same time you have to be creating a subvol every 37 seconds to occupy
the colliding 25th bit of the subvol ID.  Only the highest inode number
in any subvol counts--if your inode creation is spread out over several
different subvols, you'll need to make inodes even faster.

For reference, my high scores are 17 inodes per second and a subvol
every 595 seconds (averaged over 1 year).  Burst numbers are much higher,
but one has to spend some time _reading_ the files now and then.

I've encountered other btrfs users with two orders of magnitude higher
inode creation rates than mine.  They are barely squeaking under the
20-year line--or they would be, if they were creating snapshots 50 times
faster than they do today.

Use cases that have the highest inode creation rates (like /tmp) tend
to get more specialized storage solutions (like tmpfs).

Cloud fleets do have higher average inode creation rates, but their
filesystems have much shorter lifespans than 20 years, so the delta on
both sides of the ratio cancels out.

If this hack is only used for NFS, it gives us some time to come up
with a better solution.  (On the other hand, we had 14 years already,
and here we are...)

> > If not, maybe it is a better balance if we combinate 22bit subvol id and
> > 42 bit inode?
> 
> This would be better except when it is worse.  We cannot know which will
> happen more often.
> 
> As long as BTRFS allows object-ids and root-ids combined to use more
> than 64 bits there can be no perfect solution.  There are many possible
> solutions that will be close to perfect in practice.  swab64() is the
> simplest that I could think of.  Picking any arbitrary cut-off (22/42,
> 24/40, ...) is unlikely to be better, and could is some circumstances be
> worse.
> 
> My preference would be for btrfs to start re-using old object-ids and
> root-ids, and to enforce a limit (set at mkfs or tunefs) so that the
> total number of bits does not exceed 64.  Unfortunately the maintainers
> seem reluctant to even consider this.

It was considered, implemented in 2011, and removed in 2020.  Rationale
is in commit b547a88ea5776a8092f7f122ddc20d6720528782 "btrfs: start
deprecation of mount option inode_cache".  It made file creation slower,
and consumed disk space, iops, and memory to run.  Nobody used it.
Newer on-disk data structure versions (free space tree, 2015) didn't
bother implementing inode_cache's storage requirement.

> NeilBrown

  reply	other threads:[~2021-08-19  2:19 UTC|newest]

Thread overview: 127+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-27 22:37 [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly NeilBrown
2021-07-27 22:37 ` [PATCH 07/11] exportfs: Allow filehandle lookup to cross internal mount points NeilBrown
2021-07-28 10:13   ` Amir Goldstein
2021-07-29  0:28     ` NeilBrown
2021-07-29  5:27       ` Amir Goldstein
2021-08-06  7:52         ` Miklos Szeredi
2021-08-06  8:08           ` Amir Goldstein
2021-08-06  8:18             ` Miklos Szeredi
2021-07-28 19:17   ` J. Bruce Fields
2021-07-28 22:25     ` NeilBrown
2021-07-27 22:37 ` [PATCH 04/11] VFS: export lookup_mnt() NeilBrown
2021-07-30  0:31   ` Al Viro
2021-07-30  5:33     ` NeilBrown
2021-07-27 22:37 ` [PATCH 01/11] VFS: show correct dev num in mountinfo NeilBrown
2021-07-30  0:25   ` Al Viro
2021-07-30  5:28     ` NeilBrown
2021-07-30  5:54       ` Miklos Szeredi
2021-07-30  6:13         ` NeilBrown
2021-07-30  7:18           ` Miklos Szeredi
2021-07-30  7:33             ` NeilBrown
2021-07-30  7:59               ` Miklos Szeredi
2021-08-02  4:18                 ` A Third perspective on BTRFS nfsd subvol dev/inode number issues NeilBrown
2021-08-02  5:25                   ` Al Viro
2021-08-02  5:40                     ` NeilBrown
2021-08-02  7:54                       ` Amir Goldstein
2021-08-02 13:53                         ` Josef Bacik
2021-08-03 22:29                           ` Qu Wenruo
2021-08-02 14:47                         ` Frank Filz
2021-08-02 21:24                         ` NeilBrown
2021-08-02  7:15                   ` Martin Steigerwald
2021-08-02 21:40                     ` NeilBrown
2021-08-02 12:39                   ` J. Bruce Fields
2021-08-02 20:32                     ` Patrick Goetz
2021-08-02 20:41                       ` J. Bruce Fields
2021-08-02 21:10                     ` NeilBrown
2021-08-02 21:50                       ` J. Bruce Fields
2021-08-02 21:59                         ` NeilBrown
2021-08-02 22:14                           ` J. Bruce Fields
2021-08-02 22:36                             ` NeilBrown
2021-08-03  0:15                               ` J. Bruce Fields
2021-07-27 22:37 ` [PATCH 03/11] VFS: pass lookup_flags into follow_down() NeilBrown
2021-07-27 22:37 ` [PATCH 11/11] btrfs: use automount to bind-mount all subvol roots NeilBrown
2021-07-28  8:37   ` kernel test robot
2021-07-28  8:37     ` kernel test robot
2021-07-28  8:37   ` [RFC PATCH] btrfs: btrfs_mountpoint_expiry_timeout can be static kernel test robot
2021-07-28  8:37     ` kernel test robot
2021-07-28 13:12   ` [PATCH 11/11] btrfs: use automount to bind-mount all subvol roots Christian Brauner
2021-07-29  0:43     ` NeilBrown
2021-07-29 14:38       ` Christian Brauner
2021-07-31  6:25   ` [btrfs] 5874902268: xfstests.btrfs.202.fail kernel test robot
2021-07-31  6:25     ` kernel test robot
2021-07-27 22:37 ` [PATCH 06/11] nfsd: include a vfsmount in struct svc_fh NeilBrown
2021-07-27 22:37 ` [PATCH 10/11] btrfs: introduce mapping function from location to inum NeilBrown
2021-07-27 22:37 ` [PATCH 02/11] VFS: allow d_automount to create in-place bind-mount NeilBrown
2021-07-27 22:37 ` [PATCH 09/11] nfsd: Allow filehandle lookup to cross internal mount points NeilBrown
2021-07-28 19:15   ` J. Bruce Fields
2021-07-28 22:29     ` NeilBrown
2021-07-30  0:42   ` Al Viro
2021-07-30  5:43     ` NeilBrown
2021-07-27 22:37 ` [PATCH 08/11] nfsd: change get_parent_attributes() to nfsd_get_mounted_on() NeilBrown
2021-07-27 22:37 ` [PATCH 05/11] VFS: new function: mount_is_internal() NeilBrown
2021-07-28  2:16   ` Al Viro
2021-07-28  3:32     ` NeilBrown
2021-07-30  0:34       ` Al Viro
2021-07-28  2:19 ` [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly Al Viro
2021-07-28  4:58 ` Wang Yugui
2021-07-28  6:04   ` Wang Yugui
2021-07-28  7:01     ` NeilBrown
2021-07-28 12:26       ` Neal Gompa
2021-07-28 19:14         ` J. Bruce Fields
2021-07-29  1:29           ` Zygo Blaxell
2021-07-29  1:43             ` NeilBrown
2021-07-29 23:20               ` Zygo Blaxell
2021-07-28 22:50         ` NeilBrown
2021-07-29  2:37           ` Zygo Blaxell
2021-07-29  3:36             ` NeilBrown
2021-07-29 23:20               ` Zygo Blaxell
2021-07-30  2:36                 ` NeilBrown
2021-07-30  5:25                   ` Qu Wenruo
2021-07-30  5:31                     ` Qu Wenruo
2021-07-30  5:53                       ` Amir Goldstein
2021-07-30  6:00                       ` NeilBrown
2021-07-30  6:09                         ` Qu Wenruo
2021-07-30  5:58                     ` NeilBrown
2021-07-30  6:23                       ` Qu Wenruo
2021-07-30  6:53                         ` NeilBrown
2021-07-30  7:09                           ` Qu Wenruo
2021-07-30 18:15                             ` Zygo Blaxell
2021-07-30 15:17                         ` J. Bruce Fields
2021-07-30 15:48                           ` Josef Bacik
2021-07-30 16:25                             ` Forza
2021-07-30 17:43                             ` Zygo Blaxell
2021-07-30  5:28                   ` Amir Goldstein
2021-07-28 13:43       ` g.btrfs
2021-07-29  1:39         ` NeilBrown
2021-07-29  9:28           ` Graham Cobb
2021-07-28  7:06   ` NeilBrown
2021-07-28  9:36     ` Wang Yugui
2021-07-28 19:35 ` J. Bruce Fields
2021-07-28 21:30   ` Josef Bacik
2021-07-30  0:13     ` Al Viro
2021-07-30  6:08       ` NeilBrown
2021-08-13  1:45 ` [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export NeilBrown
2021-08-13 14:55   ` Josef Bacik
2021-08-15  7:39   ` Goffredo Baroncelli
2021-08-15 19:35     ` Roman Mamedov
2021-08-15 21:03       ` Goffredo Baroncelli
2021-08-15 21:53         ` NeilBrown
2021-08-17 19:34           ` Goffredo Baroncelli
2021-08-17 21:39             ` NeilBrown
2021-08-18 17:24               ` Goffredo Baroncelli
2021-08-15 22:17       ` NeilBrown
2021-08-19  8:01         ` Amir Goldstein
2021-08-20  3:21           ` NeilBrown
2021-08-20  6:23             ` Amir Goldstein
2021-08-23  4:05         ` [PATCH v2] BTRFS/NFSD: " NeilBrown
2021-08-23  8:17           ` kernel test robot
2021-08-23  8:17             ` kernel test robot
2021-08-18 14:54   ` [PATCH] VFS/BTRFS/NFSD: " Wang Yugui
2021-08-18 21:46     ` NeilBrown
2021-08-19  2:19       ` Zygo Blaxell [this message]
2021-08-20  2:54         ` NeilBrown
2021-08-22 19:29           ` Zygo Blaxell
2021-08-23  5:51             ` NeilBrown
2021-08-23 23:22             ` NeilBrown
2021-08-25  2:06               ` Zygo Blaxell
2021-08-23  0:57         ` Wang Yugui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210819021910.GB29026@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=hch@infradead.org \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wangyugui@e16-tech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.