All of lore.kernel.org
 help / color / mirror / Atom feed
From: "NeilBrown" <neilb@suse.de>
To: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Josef Bacik" <josef@toxicpanda.com>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Chuck Lever" <chuck.lever@oracle.com>,
	"Chris Mason" <clm@fb.com>, "David Sterba" <dsterba@suse.com>,
	linux-nfs@vger.kernel.org, "Wang Yugui" <wangyugui@e16-tech.com>,
	"Ulli Horlacher" <framstag@rus.uni-stuttgart.de>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
Date: Tue, 20 Jul 2021 10:02:48 +1000	[thread overview]
Message-ID: <162673936876.4136.15592386101064503795@noble.neil.brown.name> (raw)
In-Reply-To: <20210719154907.GA28482@fieldses.org>

On Tue, 20 Jul 2021, J. Bruce Fields wrote:
> On Fri, Jul 16, 2021 at 08:37:07AM +1000, NeilBrown wrote:
> > On Fri, 16 Jul 2021, Josef Bacik wrote:
> > > On 7/15/21 1:24 PM, Christoph Hellwig wrote:
> > > > On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
> > > >> Because there's no alternative.  We need a way to tell userspace they've
> > > >> wandered into a different inode namespace.  There's no argument that what
> > > >> we're doing is ugly, but there's never been a clear "do X instead".  Just a
> > > >> lot of whinging that btrfs is broken.  This makes userspace happy and is
> > > >> simple and straightforward.  I'm open to alternatives, but there have been 0
> > > >> workable alternatives proposed in the last decade of complaining about it.
> > > > 
> > > > Make sure we cross a vfsmount when crossing the "st_dev" domain so
> > > > that it is properly reported.   Suggested many times and ignored all
> > > > the time beause it requires a bit of work.
> > > > 
> > > 
> > > You keep telling me this but forgetting that I did all this work when you 
> > > originally suggested it.  The problem I ran into was the automount stuff 
> > > requires that we have a completely different superblock for every vfsmount. 
> > > This is fine for things like nfs or samba where the automount literally points 
> > > to a completely different mount, but doesn't work for btrfs where it's on the 
> > > same file system.  If you have 1000 subvolumes and run sync() you're going to 
> > > write the superblock 1000 times for the same file system.  You are going to 
> > > reclaim inodes on the same file system 1000 times.  You are going to reclaim 
> > > dcache on the same filesytem 1000 times.  You are also going to pin 1000 
> > > dentries/inodes into memory whenever you wander into these things because the 
> > > super is going to hold them open.
> > > 
> > > This is not a workable solution.  It's not a matter of simply tying into 
> > > existing infrastructure, we'd have to completely rework how the VFS deals with 
> > > this stuff in order to be reasonable.  And when I brought this up to Al he told 
> > > me I was insane and we absolutely had to have a different SB for every vfsmount, 
> > > which means we can't use vfsmount for this, which means we don't have any other 
> > > options.  Thanks,
> > 
> > When I was first looking at this, I thought that separate vfsmnts
> > and auto-mounting was the way to go "just like NFS".  NFS still shares a
> > lot between the multiple superblock - certainly it shares the same
> > connection to the server.
> > 
> > But I dropped the idea when Bruce pointed out that nfsd is not set up to
> > export auto-mounted filesystems.
> 
> Yes.  I wish it was....  But we'd need some way to look a
> not-currently-mounted filesystem by filehandle:
> 
> > It needs to be able to find a
> > filesystem given a UUID (extracted from a filehandle), and it does this
> > by walking through the mount table to find one that matches.  So unless
> > all btrfs subvols were mounted all the time (which I wouldn't propose),
> > it would need major work to fix.
> > 
> > NFSv4 describes the fsid as having a "major" and "minor" component.
> > We've never treated these as having an important meaning - just extra
> > bits to encode uniqueness in.  Maybe we should have used "major" for the
> > vfsmnt, and kept "minor" for the subvol.....
> 
> So nfsd would use the "major" ID to find the parent export, and then
> btrfs would use the "minor" ID to identify the subvolume?

Maybe, though I don't think it would be really useful - just a
thought-bubble.

As the spec doesn't define any behaviour of these two numbers, there is
no point trying to impose any.
But (as described in another email) I think we do need to clearly
differentiate between "volume" and "subvolume" in the Linux API.
We cannot really use "different mount point" to mean "different volume"
as bind mounts broke that model long ago.

I think that "different st_dev" means "different subvolume" is a core
requirement as many applications assume that.  So the question is "how
to determine if two objects in different subvolumes are still in the
same volume".  This is something that nfsd needs to know.

NeilBrown

  reply	other threads:[~2021-07-20  2:19 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-13  3:53 any idea about auto export multiple btrfs snapshots? Wang Yugui
2021-03-10  7:46 ` nfs subvolume access? Ulli Horlacher
2021-03-10  7:59   ` Hugo Mills
2021-03-10  8:09     ` Ulli Horlacher
2021-03-10  9:35       ` Graham Cobb
2021-03-10 15:55         ` Ulli Horlacher
2021-03-10 17:29           ` Forza
2021-03-10 17:46             ` Ulli Horlacher
2021-03-10  8:17   ` Ulli Horlacher
2021-03-11  7:46   ` Ulli Horlacher
2021-07-08 22:17     ` cannot use btrfs for nfs server Ulli Horlacher
2021-07-09  0:05       ` Graham Cobb
2021-07-09  4:05         ` NeilBrown
2021-07-09  6:53         ` Ulli Horlacher
2021-07-09  7:23           ` Forza
2021-07-09  7:24             ` Hugo Mills
2021-07-09  7:34             ` Ulli Horlacher
2021-07-09 16:30               ` Chris Murphy
2021-07-10  6:35                 ` Ulli Horlacher
2021-07-11 11:41                   ` Forza
2021-07-12  7:17                     ` Ulli Horlacher
2021-07-09 16:35           ` Chris Murphy
2021-07-10  6:56             ` Ulli Horlacher
2021-07-10 22:17               ` Chris Murphy
2021-07-12  7:25                 ` Ulli Horlacher
2021-07-12 13:06                   ` Graham Cobb
2021-07-12 16:16                     ` Ulli Horlacher
2021-07-12 22:56                       ` g.btrfs
2021-07-13  7:37                         ` Ulli Horlacher
2021-07-19 12:06                           ` Forza
2021-07-19 13:07                             ` Forza
2021-07-19 13:35                               ` Forza
2021-07-27 11:27                             ` Ulli Horlacher
2021-07-09 16:06       ` Lord Vader
2021-07-10  7:03         ` Ulli Horlacher
     [not found]   ` <162632387205.13764.6196748476850020429@noble.neil.brown.name>
2021-07-15 14:09     ` [PATCH/RFC] NFSD: handle BTRFS subvolumes better Josef Bacik
2021-07-15 16:45       ` Christoph Hellwig
2021-07-15 17:11         ` Josef Bacik
2021-07-15 17:24           ` Christoph Hellwig
2021-07-15 18:01             ` Josef Bacik
2021-07-15 22:37               ` NeilBrown
2021-07-19 15:40                 ` Josef Bacik
2021-07-19 20:00                   ` J. Bruce Fields
2021-07-19 20:44                     ` Josef Bacik
2021-07-19 23:53                       ` NeilBrown
2021-07-19 15:49                 ` J. Bruce Fields
2021-07-20  0:02                   ` NeilBrown [this message]
2021-07-19  9:16               ` Christoph Hellwig
2021-07-19 23:54                 ` NeilBrown
2021-07-20  6:23                   ` Christoph Hellwig
2021-07-20  7:17                     ` NeilBrown
2021-07-20  8:00                       ` Christoph Hellwig
2021-07-20 23:11                         ` NeilBrown
2021-07-20 22:10               ` J. Bruce Fields
2021-07-15 23:02       ` NeilBrown
2021-07-15 15:45     ` J. Bruce Fields
2021-07-15 23:08       ` NeilBrown
2021-06-14 22:50 ` any idea about auto export multiple btrfs snapshots? NeilBrown
2021-06-15 15:13   ` Wang Yugui
2021-06-15 15:41     ` Wang Yugui
2021-06-16  5:47     ` Wang Yugui
2021-06-17  3:02     ` NeilBrown
2021-06-17  4:28       ` Wang Yugui
2021-06-18  0:32         ` NeilBrown
2021-06-18  7:26           ` Wang Yugui
2021-06-18 13:34             ` Wang Yugui
2021-06-19  6:47               ` Wang Yugui
2021-06-20 12:27             ` Wang Yugui
2021-06-21  4:52             ` NeilBrown
2021-06-21  5:13               ` NeilBrown
2021-06-21  8:34                 ` Wang Yugui
2021-06-22  1:28                   ` NeilBrown
2021-06-22  3:22                     ` Wang Yugui
2021-06-22  7:14                       ` Wang Yugui
2021-06-23  0:59                         ` NeilBrown
2021-06-23  6:14                           ` Wang Yugui
2021-06-23  6:29                             ` NeilBrown
2021-06-23  9:34                               ` Wang Yugui
2021-06-23 23:38                                 ` NeilBrown
2021-06-23 15:35                           ` J. Bruce Fields
2021-06-23 22:04                             ` NeilBrown
2021-06-23 22:25                               ` J. Bruce Fields
2021-06-23 23:29                                 ` NeilBrown
2021-06-23 23:41                                   ` Frank Filz
2021-06-24  0:01                                   ` J. Bruce Fields
2021-06-24 21:58                               ` Patrick Goetz
2021-06-24 23:27                                 ` NeilBrown
2021-06-21 14:35               ` Frank Filz
2021-06-21 14:55                 ` Wang Yugui
2021-06-21 17:49                   ` Frank Filz
2021-06-21 22:41                     ` Wang Yugui
2021-06-22 17:34                       ` Frank Filz
2021-06-22 22:48                         ` Wang Yugui
2021-06-17  2:15   ` Wang Yugui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=162673936876.4136.15592386101064503795@noble.neil.brown.name \
    --to=neilb@suse.de \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=framstag@rus.uni-stuttgart.de \
    --cc=hch@infradead.org \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=wangyugui@e16-tech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.