linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "NeilBrown" <neilb@suse.de>
To: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>
Cc: "Wang Yugui" <wangyugui@e16-tech.com>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Josef Bacik" <josef@toxicpanda.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	"Chuck Lever" <chuck.lever@oracle.com>,
	"Chris Mason" <clm@fb.com>, "David Sterba" <dsterba@suse.com>,
	"Alexander Viro" <viro@zeniv.linux.org.uk>,
	linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export
Date: Mon, 23 Aug 2021 15:51:54 +1000	[thread overview]
Message-ID: <162969791499.9892.11536866623369257320@noble.neil.brown.name> (raw)
In-Reply-To: <20210822192917.GF29026@hungrycats.org>

On Mon, 23 Aug 2021, Zygo Blaxell wrote:
> 
> Subvol IDs are not reusable.  They are embedded in shared object ownership
> metadata, and persist for some time after subvols are deleted.

Hmmm...  that's interesting.  Makes some sense too.  I did wonder how
ownership across multiple snapshots was tracked.

> 
> > > > My preference would be for btrfs to start re-using old object-ids and
> > > > root-ids, and to enforce a limit (set at mkfs or tunefs) so that the
> > > > total number of bits does not exceed 64.  Unfortunately the maintainers
> > > > seem reluctant to even consider this.
> > > 
> > > It was considered, implemented in 2011, and removed in 2020.  Rationale
> > > is in commit b547a88ea5776a8092f7f122ddc20d6720528782 "btrfs: start
> > > deprecation of mount option inode_cache".  It made file creation slower,
> > > and consumed disk space, iops, and memory to run.  Nobody used it.
> > > Newer on-disk data structure versions (free space tree, 2015) didn't
> > > bother implementing inode_cache's storage requirement.
> > 
> > Yes, I saw that.  Providing reliable functional certainly can impact
> > performance and consume disk-space.  That isn't an excuse for not doing
> > it. 
> > I suspect that carefully tuned code could result in typical creation
> > times being unchanged, and mean creation times suffering only a tiny
> > cost.  Using "max+1" when the creation rate is particularly high might
> > be a reasonable part of managing costs.
> > Storage cost need not be worse than the cost of tracking free blocks
> > on the device.
> 
> The cost of _tracking_ free object IDs is trivial compared to the cost
> of _reusing_ an object ID on btrfs.

I hadn't thought of that.

> 
> If btrfs doesn't reuse object numbers, btrfs can append new objects
> to the last partially filled leaf.  If there are shared metadata pages
> (i.e. snapshots), btrfs unshares a handful of pages once, and then future
> writes use densely packed new pages and delayed allocation without having
> to read anything.
> 
> If btrfs reuses object numbers, the filesystem has to pack new objects
> into random previously filled metadata leaf nodes, so there are a lot
> of read-modify-writes scattered over old metadata pages, which spreads
> the working set around and reduces cache usage efficiency (i.e. uses
> more RAM).  If there are snapshots, each shared page that is modified
> for the first time after the snapshot comes with two-orders-of-magnitude
> worst-case write multipliers.

I don't really follow that .... but I'll take your word for it for now.

> 
> The two-algorithm scheme (switching from "reuse freed inode" to "max+1"
> under load) would be forced into the "max+1" mode half the time by a
> daily workload of alternating git checkouts and builds.  It would save
> only one bit of inode namespace over the lifetime of the filesystem.
> 
> > "Nobody used it" is odd.  It implies it would have to be explicitly
> > enabled, and all it would provide anyone is sane behaviour.  Who would
> > imagine that to be an optional extra.
> 
> It always had to be explicitly enabled.  It was initially a workaround
> for 32-bit ino_t that was limiting a few users, but ino_t got better
> and the need for inode_cache went away.
> 
> NFS (particularly NFSv2) might be the use case inode_cache has been
> waiting for.  btrfs has an i_version field for NFSv4, so it's not like
> there's no precedent for adding features in btrfs to support NFS.

NFSv2 is not worth any effort.  NFSv4 is.  NFSv3 ... some, but not a lot.

> 
> On the other hand, the cost of ino_cache gets worse with snapshots,
> and the benefit in practice takes years to decades to become relevant.
> Users who are exporting snapshots over NFS are likely to be especially
> averse to using inode_cache.

That's the real killer.  Everything will work fine for years until it
doesn't.  And once it doesn't ....  what do you do?

Thanks for lot for all this background info.  I've found it to be very
helpful for my general understanding.

Thanks,
NeilBrown

  reply	other threads:[~2021-08-23  5:52 UTC|newest]

Thread overview: 122+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-27 22:37 [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly NeilBrown
2021-07-27 22:37 ` [PATCH 07/11] exportfs: Allow filehandle lookup to cross internal mount points NeilBrown
2021-07-28 10:13   ` Amir Goldstein
2021-07-29  0:28     ` NeilBrown
2021-07-29  5:27       ` Amir Goldstein
2021-08-06  7:52         ` Miklos Szeredi
2021-08-06  8:08           ` Amir Goldstein
2021-08-06  8:18             ` Miklos Szeredi
2021-07-28 19:17   ` J. Bruce Fields
2021-07-28 22:25     ` NeilBrown
2021-07-27 22:37 ` [PATCH 04/11] VFS: export lookup_mnt() NeilBrown
2021-07-30  0:31   ` Al Viro
2021-07-30  5:33     ` NeilBrown
2021-07-27 22:37 ` [PATCH 01/11] VFS: show correct dev num in mountinfo NeilBrown
2021-07-30  0:25   ` Al Viro
2021-07-30  5:28     ` NeilBrown
2021-07-30  5:54       ` Miklos Szeredi
2021-07-30  6:13         ` NeilBrown
2021-07-30  7:18           ` Miklos Szeredi
2021-07-30  7:33             ` NeilBrown
2021-07-30  7:59               ` Miklos Szeredi
2021-08-02  4:18                 ` A Third perspective on BTRFS nfsd subvol dev/inode number issues NeilBrown
2021-08-02  5:25                   ` Al Viro
2021-08-02  5:40                     ` NeilBrown
2021-08-02  7:54                       ` Amir Goldstein
2021-08-02 13:53                         ` Josef Bacik
2021-08-03 22:29                           ` Qu Wenruo
2021-08-02 14:47                         ` Frank Filz
2021-08-02 21:24                         ` NeilBrown
2021-08-02  7:15                   ` Martin Steigerwald
2021-08-02 21:40                     ` NeilBrown
2021-08-02 12:39                   ` J. Bruce Fields
2021-08-02 20:32                     ` Patrick Goetz
2021-08-02 20:41                       ` J. Bruce Fields
2021-08-02 21:10                     ` NeilBrown
2021-08-02 21:50                       ` J. Bruce Fields
2021-08-02 21:59                         ` NeilBrown
2021-08-02 22:14                           ` J. Bruce Fields
2021-08-02 22:36                             ` NeilBrown
2021-08-03  0:15                               ` J. Bruce Fields
2021-07-27 22:37 ` [PATCH 03/11] VFS: pass lookup_flags into follow_down() NeilBrown
2021-07-27 22:37 ` [PATCH 11/11] btrfs: use automount to bind-mount all subvol roots NeilBrown
2021-07-28  8:37   ` kernel test robot
2021-07-28  8:37   ` [RFC PATCH] btrfs: btrfs_mountpoint_expiry_timeout can be static kernel test robot
2021-07-28 13:12   ` [PATCH 11/11] btrfs: use automount to bind-mount all subvol roots Christian Brauner
2021-07-29  0:43     ` NeilBrown
2021-07-29 14:38       ` Christian Brauner
2021-07-31  6:25   ` [btrfs] 5874902268: xfstests.btrfs.202.fail kernel test robot
2021-07-27 22:37 ` [PATCH 06/11] nfsd: include a vfsmount in struct svc_fh NeilBrown
2021-07-27 22:37 ` [PATCH 10/11] btrfs: introduce mapping function from location to inum NeilBrown
2021-07-27 22:37 ` [PATCH 02/11] VFS: allow d_automount to create in-place bind-mount NeilBrown
2021-07-27 22:37 ` [PATCH 09/11] nfsd: Allow filehandle lookup to cross internal mount points NeilBrown
2021-07-28 19:15   ` J. Bruce Fields
2021-07-28 22:29     ` NeilBrown
2021-07-30  0:42   ` Al Viro
2021-07-30  5:43     ` NeilBrown
2021-07-27 22:37 ` [PATCH 08/11] nfsd: change get_parent_attributes() to nfsd_get_mounted_on() NeilBrown
2021-07-27 22:37 ` [PATCH 05/11] VFS: new function: mount_is_internal() NeilBrown
2021-07-28  2:16   ` Al Viro
2021-07-28  3:32     ` NeilBrown
2021-07-30  0:34       ` Al Viro
2021-07-28  2:19 ` [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly Al Viro
2021-07-28  4:58 ` Wang Yugui
2021-07-28  6:04   ` Wang Yugui
2021-07-28  7:01     ` NeilBrown
2021-07-28 12:26       ` Neal Gompa
2021-07-28 19:14         ` J. Bruce Fields
2021-07-29  1:29           ` Zygo Blaxell
2021-07-29  1:43             ` NeilBrown
2021-07-29 23:20               ` Zygo Blaxell
2021-07-28 22:50         ` NeilBrown
2021-07-29  2:37           ` Zygo Blaxell
2021-07-29  3:36             ` NeilBrown
2021-07-29 23:20               ` Zygo Blaxell
2021-07-30  2:36                 ` NeilBrown
2021-07-30  5:25                   ` Qu Wenruo
2021-07-30  5:31                     ` Qu Wenruo
2021-07-30  5:53                       ` Amir Goldstein
2021-07-30  6:00                       ` NeilBrown
2021-07-30  6:09                         ` Qu Wenruo
2021-07-30  5:58                     ` NeilBrown
2021-07-30  6:23                       ` Qu Wenruo
2021-07-30  6:53                         ` NeilBrown
2021-07-30  7:09                           ` Qu Wenruo
2021-07-30 18:15                             ` Zygo Blaxell
2021-07-30 15:17                         ` J. Bruce Fields
2021-07-30 15:48                           ` Josef Bacik
2021-07-30 16:25                             ` Forza
2021-07-30 17:43                             ` Zygo Blaxell
2021-07-30  5:28                   ` Amir Goldstein
2021-07-28 13:43       ` g.btrfs
2021-07-29  1:39         ` NeilBrown
2021-07-29  9:28           ` Graham Cobb
2021-07-28  7:06   ` NeilBrown
2021-07-28  9:36     ` Wang Yugui
2021-07-28 19:35 ` J. Bruce Fields
2021-07-28 21:30   ` Josef Bacik
2021-07-30  0:13     ` Al Viro
2021-07-30  6:08       ` NeilBrown
2021-08-13  1:45 ` [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export NeilBrown
2021-08-13 14:55   ` Josef Bacik
2021-08-15  7:39   ` Goffredo Baroncelli
2021-08-15 19:35     ` Roman Mamedov
2021-08-15 21:03       ` Goffredo Baroncelli
2021-08-15 21:53         ` NeilBrown
2021-08-17 19:34           ` Goffredo Baroncelli
2021-08-17 21:39             ` NeilBrown
2021-08-18 17:24               ` Goffredo Baroncelli
2021-08-15 22:17       ` NeilBrown
2021-08-19  8:01         ` Amir Goldstein
2021-08-20  3:21           ` NeilBrown
2021-08-20  6:23             ` Amir Goldstein
2021-08-23  4:05         ` [PATCH v2] BTRFS/NFSD: " NeilBrown
2021-08-18 14:54   ` [PATCH] VFS/BTRFS/NFSD: " Wang Yugui
2021-08-18 21:46     ` NeilBrown
2021-08-19  2:19       ` Zygo Blaxell
2021-08-20  2:54         ` NeilBrown
2021-08-22 19:29           ` Zygo Blaxell
2021-08-23  5:51             ` NeilBrown [this message]
2021-08-23 23:22             ` NeilBrown
2021-08-25  2:06               ` Zygo Blaxell
2021-08-23  0:57         ` Wang Yugui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=162969791499.9892.11536866623369257320@noble.neil.brown.name \
    --to=neilb@suse.de \
    --cc=bfields@fieldses.org \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=chuck.lever@oracle.com \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=hch@infradead.org \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wangyugui@e16-tech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).