Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
       [not found]         ` <163038594541.7591.11109978693705593957@noble.neil.brown.name>
@ 2021-09-01  7:20           ` Christoph Hellwig
  2021-09-01 15:22             ` J. Bruce Fields
  2021-09-02  4:06             ` NeilBrown
  0 siblings, 2 replies; 19+ messages in thread
From: Christoph Hellwig @ 2021-09-01  7:20 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christoph Hellwig, J. Bruce Fields, Chuck Lever, linux-nfs,
	Josef Bacik, linux-fsdevel

On Tue, Aug 31, 2021 at 02:59:05PM +1000, NeilBrown wrote:
> Making the change purely in btrfs is simply not possible.  There is no
> way for btrfs to provide nfsd with a different inode number.  To move
> the bulk of the change into btrfs code we would need - at the very least
> - some way for nfsd to provide the filehandle when requesting stat
> information.  We would also need to provide a reference filehandle when
> requesting a dentry->filehandle conversion.  Cluttering the
> export_operations like that just for btrfs doesn't seem like the right
> balance.  I agree that cluttering kstat is not ideal, but it was a case
> of choosing the minimum change for the maximum effect.

So you're papering over a btrfs bug by piling up cludges in the nsdd
code that has not business even knowing about this btrfs bug, while
leaving other users of inodes numbers and file handles broken?

If you only care about file handles:  this is what the export operations
are for.  If you care about inode numbers:  well, it is up to btrfs
to generate uniqueue inode numbers.  It currently doesn't do that, and
no amount of papering over that in nfsd is going to fix the issue.

If XORing a little more entropy into the inode number is a good enough
band aid (and I strongly disagree with that), do it inside btrfs for
every place they report the inode number.  There is nothing NFS-specific
about that.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-01  7:20           ` [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export Christoph Hellwig
@ 2021-09-01 15:22             ` J. Bruce Fields
  2021-09-02  4:14               ` NeilBrown
  2021-09-02  7:11               ` Christoph Hellwig
  2021-09-02  4:06             ` NeilBrown
  1 sibling, 2 replies; 19+ messages in thread
From: J. Bruce Fields @ 2021-09-01 15:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, Chuck Lever, linux-nfs, Josef Bacik, linux-fsdevel

On Wed, Sep 01, 2021 at 08:20:06AM +0100, Christoph Hellwig wrote:
> On Tue, Aug 31, 2021 at 02:59:05PM +1000, NeilBrown wrote:
> > Making the change purely in btrfs is simply not possible.  There is no
> > way for btrfs to provide nfsd with a different inode number.  To move
> > the bulk of the change into btrfs code we would need - at the very least
> > - some way for nfsd to provide the filehandle when requesting stat
> > information.  We would also need to provide a reference filehandle when
> > requesting a dentry->filehandle conversion.  Cluttering the
> > export_operations like that just for btrfs doesn't seem like the right
> > balance.  I agree that cluttering kstat is not ideal, but it was a case
> > of choosing the minimum change for the maximum effect.
> 
> So you're papering over a btrfs bug by piling up cludges in the nsdd
> code that has not business even knowing about this btrfs bug, while
> leaving other users of inodes numbers and file handles broken?
> 
> If you only care about file handles:  this is what the export operations
> are for.  If you care about inode numbers:  well, it is up to btrfs
> to generate uniqueue inode numbers.  It currently doesn't do that, and
> no amount of papering over that in nfsd is going to fix the issue.
> 
> If XORing a little more entropy

It's stronger than "a little more entropy".  We know enough about how
the numbers being XOR'd grow to know that collisions are only going to
happen in some extreme use cases.  (If I understand correctly.)

> into the inode number is a good enough band aid (and I strongly
> disagree with that), do it inside btrfs for every place they report
> the inode number.  There is nothing NFS-specific about that.

Neil tried something like that:

	https://lore.kernel.org/linux-nfs/162761259105.21659.4838403432058511846@noble.neil.brown.name/

	"The patch below, which is just a proof-of-concept, changes
	btrfs to report a uniform st_dev, and different (64bit) st_ino
	in different subvols."

(Though actually you're proposing keeping separate st_dev?)

I looked back through a couple threads to try to understand why we
couldn't do that (on new filesystems, with a mkfs option to choose new
or old behavior) and still don't understand.  But the threads are long.

There are objections to a new mount option (which seem obviously wrong;
this should be a persistent feature of the on-disk filesystem).

--b.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-01  7:20           ` [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export Christoph Hellwig
  2021-09-01 15:22             ` J. Bruce Fields
@ 2021-09-02  4:06             ` NeilBrown
  2021-09-02  7:16               ` Christoph Hellwig
  1 sibling, 1 reply; 19+ messages in thread
From: NeilBrown @ 2021-09-02  4:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, J. Bruce Fields, Chuck Lever, linux-nfs,
	Josef Bacik, linux-fsdevel

On Wed, 01 Sep 2021, Christoph Hellwig wrote:
> On Tue, Aug 31, 2021 at 02:59:05PM +1000, NeilBrown wrote:
> > Making the change purely in btrfs is simply not possible.  There is no
> > way for btrfs to provide nfsd with a different inode number.  To move
> > the bulk of the change into btrfs code we would need - at the very least
> > - some way for nfsd to provide the filehandle when requesting stat
> > information.  We would also need to provide a reference filehandle when
> > requesting a dentry->filehandle conversion.  Cluttering the
> > export_operations like that just for btrfs doesn't seem like the right
> > balance.  I agree that cluttering kstat is not ideal, but it was a case
> > of choosing the minimum change for the maximum effect.
> 
> So you're papering over a btrfs bug by piling up cludges in the nsdd
> code that has not business even knowing about this btrfs bug, while
> leaving other users of inodes numbers and file handles broken?
> 
> If you only care about file handles:  this is what the export operations
> are for.  If you care about inode numbers:  well, it is up to btrfs
> to generate uniqueue inode numbers.  It currently doesn't do that, and
> no amount of papering over that in nfsd is going to fix the issue.
> 
> If XORing a little more entropy into the inode number is a good enough
> band aid (and I strongly disagree with that), do it inside btrfs for
> every place they report the inode number.  There is nothing NFS-specific
> about that.
> 

Hi Christoph,
 I have to say that I struggle with some of these conversations with
 you.
 I don't know if it is deliberate on your part, or inadvertent, or
 purely in my imagination, but your attitude often seems combative.  I
 find that to be a disincentive to continuing to engage, which I need to
 work hard to overcome.  If I'm misunderstanding you, I apologise and
 simply ask that you do what you can to compensate for my apparent
 sensitivity.

 Your attitude seems to be that this is a btrfs problem and must be
 fixed in btrfs.  I agree about the source of the problem - specifically:
  Commit 3394e1607eaf ("Btrfs: Give each subvol and snapshot their own anonymous devid")
 took a wrong turn.  But I don't think we can completely isolate any
 part of the kernel, and we need to work together to solve problems that
 affect us, no matter the cause.  Similarly our code needs to work
 together.

 Highlighting the various problems with the proposed solution doesn't
 help a lot - they are fairly obvious.  Proposing solutions would be
 much more helpful, and I have no doubt that your different experience
 and perspective could help me see things that I have missed.   Any help
 that you can provide would certainly be appreciated.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-01 15:22             ` J. Bruce Fields
@ 2021-09-02  4:14               ` NeilBrown
  2021-09-05 16:07                 ` J. Bruce Fields
  2021-09-02  7:11               ` Christoph Hellwig
  1 sibling, 1 reply; 19+ messages in thread
From: NeilBrown @ 2021-09-02  4:14 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Christoph Hellwig, Chuck Lever, linux-nfs, Josef Bacik, linux-fsdevel

On Thu, 02 Sep 2021, J. Bruce Fields wrote:
> I looked back through a couple threads to try to understand why we
> couldn't do that (on new filesystems, with a mkfs option to choose new
> or old behavior) and still don't understand.  But the threads are long.
> 
> There are objections to a new mount option (which seem obviously wrong;
> this should be a persistent feature of the on-disk filesystem).

I hadn't thought much (if at all) about a persistent filesystem feature
flag.  I'll try that now.

There are two features of interest.  One is completely unique inode
numbers, the other is reporting different st_dev for different
subvolumes.  I think these need to be kept separate, though the second
would depend on the first.  They would be similar to my "inumbits" and
"numdevs" mount options, though with less flexibility.  I think that
they would need strong semantics to be acceptable - "mostly unique"
isn't really acceptable once we are changing the on-disk data.

The "unique inode numbers" bit (UIN) would require that file object-ids
fit in some number of bits (maybe 40) and that subvolume numbers fit in
the remaining bits (24) and would then combine them together for the
inode number.  This could obviously be set at mkfs time.  Could it be
set on an unmounted filesystem?

The "single-dev" flag (SD) could be toggled any time that UIN was set,
and mkfs would default it on if UIN was selected.

If UIN was in effect, then creating a subvol beyond the permitted max
would have to fail.  24 bits is small enough that we would probably want
a warning of impending doom - maybe at 23 bits? The current 48bits
doesn't need that.
Similarly creating an inode beyond 40bits would have to fail.  This is
probably more problematic and so might need more warnings.  Do we want a
warning each time any subvol crosses some limit?  If not we would need a
flag for each warning.

What should a sysadmin do when they see the warning? If 40 bit an
unacceptable limit of the total number of inodes in a subvol, or is it
only a problem because of btrfs' practice of never reusing object-ids?

Backup-and-restore would compact object-ids, but would be a big cost.
Off-line reindexing would be cheaper (does anyone else remember using
"renum" programs with BASIC??).  Online lazy re-indexing might be
possible if the inode number was maintained separately from the
object-id and an atomic "switch which inode number to use" could be done
at mount time.

Setting UIN on an existing filesystem would require checking that only
24bit are used for subvolumes (easy) and that only 40 were usgd for
objects in any individual subvolume (presumably that would require
checking all subvolumes, which might take a little while, but shouldn't
take more than a few minutes.

Doing this would break any indexes that might be created over files, and
would probably upset any active NFS mounts, and would likely have other
problems.  Se it would need to be a well-documented step with clear
rewards.

An alternative to renumbering would be to maintain file-ids and
subvolume-ids which are separate from the object-id.  Apparently reusing
subvolume object-ids is not possible and reusing file object-ids is
quite costly.  If the file-id were separate from the object-id, these
problems would vanish.

This would require extra space in the inode (there are several reserved
u64s, so that isn't a problem) and space in each directory entry (might
be more of a problem).  It would also require some way to keep track of
used (or unused) id numbers.  This avoids the cost of renumbering, by
spreading it out over every creation.  I suspect the average
inode-creation overhead could be kept quite low, but not quite zero.

I believe that some code *knows* that the root of any btrfs subvolumes
has inode number 256.  systemd seems to use this.  I have no idea what
else might depend on inode numbers in some way.

I suspect that if we tried to roll out a change like this, either almost
no-one would use it (if it wasn't the default), or things would start
breaking (if it was).  I'm not against breaking things, but we need to
be sure there is a solution for fixing them, and I'm certainly not up to
doing that myself.

So yes - I think that using a mkfs option would open up other avenues
for a solution.  There would still be a lot of work to find something
that continues to meet everyone's needs.

The advantage of an nfsd-focusses solution is that we can have working
code today with minimal down-sides.  I'm certainly not prepared to go
digging through btrfs code to determine how to implement a btrfs-only
solution without strong buy-in from btrfs maintainers.

NeilBrown

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-01 15:22             ` J. Bruce Fields
  2021-09-02  4:14               ` NeilBrown
@ 2021-09-02  7:11               ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2021-09-02  7:11 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Christoph Hellwig, NeilBrown, Chuck Lever, linux-nfs,
	Josef Bacik, linux-fsdevel

On Wed, Sep 01, 2021 at 11:22:51AM -0400, J. Bruce Fields wrote:
> It's stronger than "a little more entropy".  We know enough about how
> the numbers being XOR'd grow to know that collisions are only going to
> happen in some extreme use cases.  (If I understand correctly.)

Do we know that a malicious attacker can't reproduce the collisions?
Because that is the case to worry about.

> > into the inode number is a good enough band aid (and I strongly
> > disagree with that), do it inside btrfs for every place they report
> > the inode number.  There is nothing NFS-specific about that.
> 
> Neil tried something like that:
> 
> 	https://lore.kernel.org/linux-nfs/162761259105.21659.4838403432058511846@noble.neil.brown.name/
> 
> 	"The patch below, which is just a proof-of-concept, changes
> 	btrfs to report a uniform st_dev, and different (64bit) st_ino
> 	in different subvols."
> 
> (Though actually you're proposing keeping separate st_dev?)

No, I'm not suggestion to keep a separate st_dev in that case.  So the
above scheme looks like the most reasonable (or least unreasonable) of
the approaches I've seen so far.  I have to admit I've only noticed it
now given how deep it was hidden in a thread that I only followed bit
while on vacation.

> I looked back through a couple threads to try to understand why we
> couldn't do that (on new filesystems, with a mkfs option to choose new
> or old behavior) and still don't understand.  But the threads are long.
> 
> There are objections to a new mount option (which seem obviously wrong;
> this should be a persistent feature of the on-disk filesystem).

Yes.  Anything like this needs to be persisted.  But a mount option
might still be a reasonable way to set that persistent flag.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-02  4:06             ` NeilBrown
@ 2021-09-02  7:16               ` Christoph Hellwig
  2021-09-02  7:53                 ` Miklos Szeredi
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2021-09-02  7:16 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christoph Hellwig, J. Bruce Fields, Chuck Lever, linux-nfs,
	Josef Bacik, linux-fsdevel

On Thu, Sep 02, 2021 at 02:06:54PM +1000, NeilBrown wrote:
> Hi Christoph,
>  I have to say that I struggle with some of these conversations with
>  you.
>  I don't know if it is deliberate on your part, or inadvertent, or
>  purely in my imagination, but your attitude often seems combative.  I
>  find that to be a disincentive to continuing to engage, which I need to
>  work hard to overcome.  If I'm misunderstanding you, I apologise and
>  simply ask that you do what you can to compensate for my apparent
>  sensitivity.

I would not call it combative.  It is sticking to the points and the red
lines.

>  Your attitude seems to be that this is a btrfs problem and must be
>  fixed in btrfs.

Yes.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-02  7:16               ` Christoph Hellwig
@ 2021-09-02  7:53                 ` Miklos Szeredi
  2021-09-02 14:16                   ` Frank Filz
  0 siblings, 1 reply; 19+ messages in thread
From: Miklos Szeredi @ 2021-09-02  7:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, J. Bruce Fields, Chuck Lever, Linux NFS list,
	Josef Bacik, linux-fsdevel

On Thu, 2 Sept 2021 at 09:18, Christoph Hellwig <hch@infradead.org> wrote:

> >  Your attitude seems to be that this is a btrfs problem and must be
> >  fixed in btrfs.
>
> Yes.

st_ino space issues affect overlayfs as well.   The two problems are
not the same, but do share some characteristics.  And this limitation
will likely come up again in the future.

I suspect the long term solution would involve introducing new
userspace API (variable length inode numbers) and deprecating st_ino.
E.g. make stat return an error if the inode number doesn't fit into
st_ino and add a statx extension to return the full number.  But this
would be a long process...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-02  7:53                 ` Miklos Szeredi
@ 2021-09-02 14:16                   ` Frank Filz
  2021-09-02 23:02                     ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Frank Filz @ 2021-09-02 14:16 UTC (permalink / raw)
  To: 'Miklos Szeredi', 'Christoph Hellwig'
  Cc: 'NeilBrown', 'J. Bruce Fields',
	'Chuck Lever', 'Linux NFS list',
	'Josef Bacik',
	linux-fsdevel

> On Thu, 2 Sept 2021 at 09:18, Christoph Hellwig <hch@infradead.org> wrote:
> 
> > >  Your attitude seems to be that this is a btrfs problem and must be
> > > fixed in btrfs.
> >
> > Yes.
> 
> st_ino space issues affect overlayfs as well.   The two problems are
> not the same, but do share some characteristics.  And this limitation will likely
> come up again in the future.
> 
> I suspect the long term solution would involve introducing new userspace API
> (variable length inode numbers) and deprecating st_ino.
> E.g. make stat return an error if the inode number doesn't fit into st_ino and add
> a statx extension to return the full number.  But this would be a long process...

But then what do we do where fileid in NFS is only 64 bits?

The solution of giving each subvol a separate fsid is the only real solution to enlarging the NFS fileid space, however that has downsides on the client side.

Frank


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-02 14:16                   ` Frank Filz
@ 2021-09-02 23:02                     ` NeilBrown
  0 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2021-09-02 23:02 UTC (permalink / raw)
  To: Frank Filz
  Cc: 'Miklos Szeredi', 'Christoph Hellwig',
	'J. Bruce Fields', 'Chuck Lever',
	'Linux NFS list', 'Josef Bacik',
	linux-fsdevel

On Fri, 03 Sep 2021, Frank Filz wrote:
> > On Thu, 2 Sept 2021 at 09:18, Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > > >  Your attitude seems to be that this is a btrfs problem and must be
> > > > fixed in btrfs.
> > >
> > > Yes.
> > 
> > st_ino space issues affect overlayfs as well.   The two problems are
> > not the same, but do share some characteristics.  And this limitation will likely
> > come up again in the future.
> > 
> > I suspect the long term solution would involve introducing new userspace API
> > (variable length inode numbers) and deprecating st_ino.
> > E.g. make stat return an error if the inode number doesn't fit into st_ino and add
> > a statx extension to return the full number.  But this would be a long process...
> 
> But then what do we do where fileid in NFS is only 64 bits?

Hence my suggestion that user-space should move to using the filehandle.

Two file handles being different doesn't guarantee that the two objects
are different, but the problems caused by incorrectly assuming two
things are different are much less than the problems caused by
incorrectly assuming two things are the same.

> 
> The solution of giving each subvol a separate fsid is the only real
> solution to enlarging the NFS fileid space, however that has downsides
> on the client side.

And this doesn't help overlayfs...  

NeilBrown

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-02  4:14               ` NeilBrown
@ 2021-09-05 16:07                 ` J. Bruce Fields
  2021-09-06  1:29                   ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: J. Bruce Fields @ 2021-09-05 16:07 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christoph Hellwig, Chuck Lever, linux-nfs, Josef Bacik, linux-fsdevel

On Thu, Sep 02, 2021 at 02:14:17PM +1000, NeilBrown wrote:
> On Thu, 02 Sep 2021, J. Bruce Fields wrote:
> > I looked back through a couple threads to try to understand why we
> > couldn't do that (on new filesystems, with a mkfs option to choose new
> > or old behavior) and still don't understand.  But the threads are long.
> > 
> > There are objections to a new mount option (which seem obviously wrong;
> > this should be a persistent feature of the on-disk filesystem).
> 
> I hadn't thought much (if at all) about a persistent filesystem feature
> flag.  I'll try that now.
> 
> There are two features of interest.  One is completely unique inode
> numbers, the other is reporting different st_dev for different
> subvolumes.  I think these need to be kept separate, though the second
> would depend on the first.  They would be similar to my "inumbits" and
> "numdevs" mount options, though with less flexibility.  I think that
> they would need strong semantics to be acceptable - "mostly unique"
> isn't really acceptable once we are changing the on-disk data.

I don't quite follow that.

Also the "on-disk data" here is literally just one more flag bit in some
superblock field, right?

> I believe that some code *knows* that the root of any btrfs subvolumes
> has inode number 256.  systemd seems to use this.  I have no idea what
> else might depend on inode numbers in some way.

Looking.  Ugh, yes, there's  abtrfs_might_be_subvol that takes a struct
stat and returns:

        return S_ISDIR(st->st_mode) && st->st_ino == 256;

I wonder why it does that?  Are there situations where all it has is a
file descriptor (so it can't easily compare st_dev with the parent?)
And if you NFS-export and wanted to answer the same question on the
client side, I wonder what you'd do.

--b.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-05 16:07                 ` J. Bruce Fields
@ 2021-09-06  1:29                   ` NeilBrown
  2021-09-11 14:12                     ` Amir Goldstein
  0 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2021-09-06  1:29 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Christoph Hellwig, Chuck Lever, linux-nfs, Josef Bacik, linux-fsdevel

On Mon, 06 Sep 2021, J. Bruce Fields wrote:
> On Thu, Sep 02, 2021 at 02:14:17PM +1000, NeilBrown wrote:
> > On Thu, 02 Sep 2021, J. Bruce Fields wrote:
> > > I looked back through a couple threads to try to understand why we
> > > couldn't do that (on new filesystems, with a mkfs option to choose new
> > > or old behavior) and still don't understand.  But the threads are long.
> > > 
> > > There are objections to a new mount option (which seem obviously wrong;
> > > this should be a persistent feature of the on-disk filesystem).
> > 
> > I hadn't thought much (if at all) about a persistent filesystem feature
> > flag.  I'll try that now.
> > 
> > There are two features of interest.  One is completely unique inode
> > numbers, the other is reporting different st_dev for different
> > subvolumes.  I think these need to be kept separate, though the second
> > would depend on the first.  They would be similar to my "inumbits" and
> > "numdevs" mount options, though with less flexibility.  I think that
> > they would need strong semantics to be acceptable - "mostly unique"
> > isn't really acceptable once we are changing the on-disk data.
> 
> I don't quite follow that.

I agree it is a bit of a leap.

> 
> Also the "on-disk data" here is literally just one more flag bit in some
> superblock field, right?

Maybe.  I *could* be just one bit.
But even "just one bit" is, I think, more of a support commitement than
adding a mount option.
Mount options are fairly obvious to the user.  super-blocks not as much.
So "just one bit" might still be "one more question" than the supoort
people need to ask when handling a problem report.

When I wrote that I was thinking about how I would be comfortable with
if I were a btrfs maintainer.  And I don't think I'd like to spend and
on-disk change and only gain a "mostly harmless" solution.

Christoph's comment about possible vulnerabilities are probably part of
this.  I think that over NFS, concern about a user being able to
synthesise an inode number conflict is probably "mountain out of mole
hill" territory.  However for local access, I cannot convince myself
that it won't be a problem.  I can imagine (incautiously written)
auditing scans getting confused, and while auditing over NFS doesn't
make much sense, auditing locally does.

> 
> > I believe that some code *knows* that the root of any btrfs subvolumes
> > has inode number 256.  systemd seems to use this.  I have no idea what
> > else might depend on inode numbers in some way.
> 
> Looking.  Ugh, yes, there's  abtrfs_might_be_subvol that takes a struct
> stat and returns:
> 
>         return S_ISDIR(st->st_mode) && st->st_ino == 256;
> 
> I wonder why it does that?  Are there situations where all it has is a
> file descriptor (so it can't easily compare st_dev with the parent?)
> And if you NFS-export and wanted to answer the same question on the
> client side, I wonder what you'd do.

There are also a few references to BTRFS_FIRST_FREE_OBJECTID which is
256.

Uses seem to include:
 - managing quotas, which fits with my idea that subvols are like
   project-quota trees.
 - optionally preventing "rm -r" from removing subvols
 - some switching to/from "readonly" which I cannot follow
 - some special handling of user home-directories when they are
   subvols

These are probably reasonable and do point to subvols being a little bit
like separate filesytems.  These would break if we changed local inode
numbers. 

The project-quota management and the read-only setting are not available
via NFS so changing the inode number seen that way is not likely to
matter as much.  In any case, detecting "256" is only useful if you can
also detect "is btrfs", and you cannot do that of NFS.

Once upon a time ext[234] had a set of inode flags and xfs separately
had a bunch of inode flags.  These are now unified (to a degree) in
'struct fsattr' accessed by FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR.

btrfs supports that interface, but doesn't appear to have extended it
for subvol-specific things - preferring to create btrfs-specific ioctls
instead.   Maybe they weren't designed to be extensible enough.

Maybe what we really need is for a bunch of diverse filesystem
developers to get together and agree on some new common interface for
subvolume management, including coming up with some sort of definition
of what a subvolume "is".

Until that happens (and the new interfaces are implemented and widely
used) I can only see two possible solutions to the current
NFS-export-of-btrfs problem:

1/ change nfsd to export a different inode number to the one btrfs uses
   (or maybe a different fsid, but that is problematic in other ways)
2/ change userspace to check filehandles and not assume two things are
   the same if their filehandles are different.

Maybe I should write a patch for fts_read() and see how much glibc folk
will hate it.

NeilBrown

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-06  1:29                   ` NeilBrown
@ 2021-09-11 14:12                     ` Amir Goldstein
  2021-09-13  0:43                       ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2021-09-11 14:12 UTC (permalink / raw)
  To: NeilBrown
  Cc: J. Bruce Fields, Christoph Hellwig, Chuck Lever,
	Linux NFS Mailing List, Josef Bacik, linux-fsdevel, Theodore Tso,
	Jan Kara

> Maybe what we really need is for a bunch of diverse filesystem
> developers to get together and agree on some new common interface for
> subvolume management, including coming up with some sort of definition
> of what a subvolume "is".

Neil,

Seeing that LSF/MM is not expected to gather in the foreseen future, would
you like to submit this as a topic for discussion in LPC Filesystem MC [1]?
I know this is last minute, but we've just extended the CFP deadline
until Sep 15 (MC is on Sep 21), so if you post a proposal, I think we will
be able to fit this session in the final schedule.

Granted, I don't know how many of the stakeholders plan to attend
the LPC Filesystem MC, but at least Josef should be there ;)

I do have one general question about the expected behavior -
In his comment to the LWN article [2], Josef writes:

"The st_dev thing is unfortunate, but again is the result of a lack of
interfaces.
 Very early on we had problems with rsync wandering into snapshots and
 copying loads of stuff. Find as well would get tripped up.
 The way these tools figure out if they've wandered into another file system
 is if the st_dev is different..."

If your plan goes through to export the main btrfs filesystem and
subvolumes as a uniform st_dev namespace to the NFS client,
what's to stop those old issues from remerging on NFS exported btrfs?

IOW, the user experience you are trying to solve is inability of 'find'
to traverse the unified btrfs namespace, but Josef's comment indicates
that some users were explicitly unhappy from 'find' trying to traverse
into subvolumes to begin with.

So is there really a globally expected user experience?
If not, then I really don't see how an nfs export option can be avoided.

Thanks,
Amir.

[1] https://www.linuxplumbersconf.org/event/11/page/104-accepted-microconferences#cont-filesys
[2] https://lwn.net/Articles/867509/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-11 14:12                     ` Amir Goldstein
@ 2021-09-13  0:43                       ` NeilBrown
  2021-09-13 10:04                         ` Amir Goldstein
  0 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2021-09-13  0:43 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: J. Bruce Fields, Christoph Hellwig, Chuck Lever,
	Linux NFS Mailing List, Josef Bacik, linux-fsdevel, Theodore Tso,
	Jan Kara

On Sun, 12 Sep 2021, Amir Goldstein wrote:
> > Maybe what we really need is for a bunch of diverse filesystem
> > developers to get together and agree on some new common interface for
> > subvolume management, including coming up with some sort of definition
> > of what a subvolume "is".
> 
> Neil,
> 
> Seeing that LSF/MM is not expected to gather in the foreseen future, would
> you like to submit this as a topic for discussion in LPC Filesystem MC [1]?
> I know this is last minute, but we've just extended the CFP deadline
> until Sep 15 (MC is on Sep 21), so if you post a proposal, I think we will
> be able to fit this session in the final schedule.

Thanks for the suggestion.  Maybe that is a good idea...  But I don't
personally find face-to-face interactions particularly useful - though
other people obviously do.  I need thinking time after receiving new
ideas, so I can be sure that I understand them properly.  Face-to-face
doesn't allow me that thinking time.

So: no, I won't be proposing anything for LPC.

> 
> Granted, I don't know how many of the stakeholders plan to attend
> the LPC Filesystem MC, but at least Josef should be there ;)
> 
> I do have one general question about the expected behavior -
> In his comment to the LWN article [2], Josef writes:
> 
> "The st_dev thing is unfortunate, but again is the result of a lack of
> interfaces.
>  Very early on we had problems with rsync wandering into snapshots and
>  copying loads of stuff. Find as well would get tripped up.
>  The way these tools figure out if they've wandered into another file system
>  is if the st_dev is different..."
> 
> If your plan goes through to export the main btrfs filesystem and
> subvolumes as a uniform st_dev namespace to the NFS client,
> what's to stop those old issues from remerging on NFS exported btrfs?

That comment from Josef was interesting.... It doesn't align with
Commit 3394e1607eaf ("Btrfs: Give each subvol and snapshot their own anonymous devid")
when Chris Mason introduced the per-subvol device number with the
justification that:
    Each subvolume has its own private inode number space, and so we need
    to fill in different device numbers for each subvolume to avoid confusing
    applications.

But I understand that history can be messy and maybe there were several
justifications of which Josef remembers one and Chris reported
another.

If rsync did, in fact, wander into subvols and didn't get put off by the
duplicate inode numbers (like 'find' does), then it would still do that
when accessing btrfs over NFS.  This has always been the case.  Chris'
"fix" only affected local access, it didn't change NFS access at all.

> 
> IOW, the user experience you are trying to solve is inability of 'find'
> to traverse the unified btrfs namespace, but Josef's comment indicates
> that some users were explicitly unhappy from 'find' trying to traverse
> into subvolumes to begin with.

I believe that even 12 years ago, find would have complained if it saw a
directory with the same inode as an ancestor.  Chris's fix wouldn't
prevent find from entering in that case, because it wouldn't enter
anyway.

> 
> So is there really a globally expected user experience?

No.  Everybody wants what they want.  There is some overlap, not no
guarantees.  That is the unavoidable consequence of ignoring standards
when implementing functionality.

> If not, then I really don't see how an nfs export option can be avoided.

And I really don't see how an nfs export option would help...  Different
people within and organisation and using the same export might have
different expectations.

Thanks,
NeilBrown

> 
> Thanks,
> Amir.
> 
> [1] https://www.linuxplumbersconf.org/event/11/page/104-accepted-microconferences#cont-filesys
> [2] https://lwn.net/Articles/867509/
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-13  0:43                       ` NeilBrown
@ 2021-09-13 10:04                         ` Amir Goldstein
  2021-09-13 22:59                           ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2021-09-13 10:04 UTC (permalink / raw)
  To: NeilBrown
  Cc: J. Bruce Fields, Christoph Hellwig, Chuck Lever,
	Linux NFS Mailing List, Josef Bacik, linux-fsdevel, Theodore Tso,
	Jan Kara

> > I do have one general question about the expected behavior -
> > In his comment to the LWN article [2], Josef writes:
> >
> > "The st_dev thing is unfortunate, but again is the result of a lack of
> > interfaces.
> >  Very early on we had problems with rsync wandering into snapshots and
> >  copying loads of stuff. Find as well would get tripped up.
> >  The way these tools figure out if they've wandered into another file system
> >  is if the st_dev is different..."
> >
> > If your plan goes through to export the main btrfs filesystem and
> > subvolumes as a uniform st_dev namespace to the NFS client,
> > what's to stop those old issues from remerging on NFS exported btrfs?
>
> That comment from Josef was interesting.... It doesn't align with
> Commit 3394e1607eaf ("Btrfs: Give each subvol and snapshot their own anonymous devid")
> when Chris Mason introduced the per-subvol device number with the
> justification that:
>     Each subvolume has its own private inode number space, and so we need
>     to fill in different device numbers for each subvolume to avoid confusing
>     applications.
>
> But I understand that history can be messy and maybe there were several
> justifications of which Josef remembers one and Chris reported
> another.
>

I don't see a contradiction between the two reasons.
Reporting two different objects with the same st_ino;st_dev is a problem
and so is rsync wandering into subvolumes is another problem.

Separate st_dev solves the first problem and leaves the behavior
or rsync in the hands of the user (i.e. rsync --one-file-system).

> If rsync did, in fact, wander into subvols and didn't get put off by the
> duplicate inode numbers (like 'find' does), then it would still do that
> when accessing btrfs over NFS.  This has always been the case.  Chris'
> "fix" only affected local access, it didn't change NFS access at all.
>

Right, so the right fix IMO would be to provide similar semantics
to the NFS client, like your first patch set tried to do.

> >
> > IOW, the user experience you are trying to solve is inability of 'find'
> > to traverse the unified btrfs namespace, but Josef's comment indicates
> > that some users were explicitly unhappy from 'find' trying to traverse
> > into subvolumes to begin with.
>
> I believe that even 12 years ago, find would have complained if it saw a
> directory with the same inode as an ancestor.  Chris's fix wouldn't
> prevent find from entering in that case, because it wouldn't enter
> anyway.
>
> >
> > So is there really a globally expected user experience?
>
> No.  Everybody wants what they want.  There is some overlap, not no
> guarantees.  That is the unavoidable consequence of ignoring standards
> when implementing functionality.
>
> > If not, then I really don't see how an nfs export option can be avoided.
>
> And I really don't see how an nfs export option would help...  Different
> people within and organisation and using the same export might have
> different expectations.

That's true.
But if admin decides to export a specific btrfs mount as a non-unified
filesystem, then NFS clients can decide whether ot not to auto-mount the
exported subvolumes and different users on the client machine can decide
if they want to rsync or rsync --one-file-system, just as they would with
local btrfs.

And maybe I am wrong, but I don't see how the decision on whether to
export a non-unified btrfs can be made a btrfs option or a nfsd global
option, that's why I ended up with export option.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-13 10:04                         ` Amir Goldstein
@ 2021-09-13 22:59                           ` NeilBrown
  2021-09-14  5:45                             ` Amir Goldstein
  0 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2021-09-13 22:59 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: J. Bruce Fields, Christoph Hellwig, Chuck Lever,
	Linux NFS Mailing List, Josef Bacik, linux-fsdevel, Theodore Tso,
	Jan Kara

On Mon, 13 Sep 2021, Amir Goldstein wrote:
> 
> Right, so the right fix IMO would be to provide similar semantics
> to the NFS client, like your first patch set tried to do.
> 

Like every other approach, this sounds good and sensible ...  until
you examine the details.

For NFSv3 (RFC1813) this would be a protocol violation.
Section 3.3.3 (LOOKUP) says:
  A server will not allow a LOOKUP operation to cross a mountpoint to
  the root of a different filesystem, even if the filesystem is
  exported.

The filesystem is represented by the fsid, so this implies that the fsid
of an object reported by LOOKUP must be the same as the fsid of the
directory used in the LOOKUP.

Linux NFS does allow this restriction to be bypassed with the "crossmnt"
export option.  Maybe if crossmnt were given it would be defensible to
change the fsid - if crossmnt is not given, we leave the current
behaviour.  Note that this is a hack and while it is extremely useful,
it does not produce a seemly experience.  You can get exactly the same
problems with "find" - just not as uniformly (mounting with "-o noac"
makes them uniform).

For NFSv4, we need to provide a "mounted-on" fileid for any mountpoint. 
btrfs doesn't have a mounted-on fileid that can be used.  We can fake
something that might work reasonably well - but it would be fake.  (but
then ... btrfs already provided bogus information in getdents when there
is a subvol root in the directory).

But these are relatively minor.  The bigger problem is /proc/mounts.  If
btrfs maintainers were willing to have every active subvolume appear in
/proc/mounts, then I would be happy to fiddle the NFS fsid and allow
every active NFS/btrfs subvolume to appear in /proc/mounts on the NFS
client.  But they aren't.  So I am not.

> > And I really don't see how an nfs export option would help...  Different
> > people within and organisation and using the same export might have
> > different expectations.
> 
> That's true.
> But if admin decides to export a specific btrfs mount as a non-unified
> filesystem, then NFS clients can decide whether ot not to auto-mount the
> exported subvolumes and different users on the client machine can decide
> if they want to rsync or rsync --one-file-system, just as they would with
> local btrfs.
> 
> And maybe I am wrong, but I don't see how the decision on whether to
> export a non-unified btrfs can be made a btrfs option or a nfsd global
> option, that's why I ended up with export option.

Just because a btrfs option and global nfsd option are bad, that doesn't
mean an export option must be good.  It needs to be presented and
defended on its own merits.

My current opinion (and I must admit I am feeling rather jaded about the
whole thing), is that while btrfs is a very interesting and valuable
experiment in fs design, it contains core mistakes that cannot be
incrementally fixed.  It should be marked as legacy with all current
behaviour declared as intentional and not subject to change.  This would
make way for a new "betrfs" which was designed based on all that we have
learned.  It would use the same code base, but present a more coherent
interface.  Exactly what that interface would be has yet to be decided,
but we would not be bound to maintain anything just because btrfs
supports it.

NeilBrown

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-13 22:59                           ` NeilBrown
@ 2021-09-14  5:45                             ` Amir Goldstein
  2021-09-20 22:09                               ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2021-09-14  5:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: J. Bruce Fields, Christoph Hellwig, Chuck Lever,
	Linux NFS Mailing List, Josef Bacik, linux-fsdevel, Theodore Tso,
	Jan Kara

On Tue, Sep 14, 2021 at 1:59 AM NeilBrown <neilb@suse.de> wrote:
>
> On Mon, 13 Sep 2021, Amir Goldstein wrote:
> >
> > Right, so the right fix IMO would be to provide similar semantics
> > to the NFS client, like your first patch set tried to do.
> >
>
> Like every other approach, this sounds good and sensible ...  until
> you examine the details.
>
> For NFSv3 (RFC1813) this would be a protocol violation.
> Section 3.3.3 (LOOKUP) says:
>   A server will not allow a LOOKUP operation to cross a mountpoint to
>   the root of a different filesystem, even if the filesystem is
>   exported.
>
> The filesystem is represented by the fsid, so this implies that the fsid
> of an object reported by LOOKUP must be the same as the fsid of the
> directory used in the LOOKUP.
>
> Linux NFS does allow this restriction to be bypassed with the "crossmnt"
> export option.  Maybe if crossmnt were given it would be defensible to
> change the fsid - if crossmnt is not given, we leave the current
> behaviour.  Note that this is a hack and while it is extremely useful,
> it does not produce a seemly experience.  You can get exactly the same
> problems with "find" - just not as uniformly (mounting with "-o noac"
> makes them uniform).
>

I don't understand why we would need to talk about NFSv3.
This btrfs export issue has been with us for a while.
I see no reason to address it for old protocols if we can address
it with a new protocol with better support for the concept of fsid hierarchy.

> For NFSv4, we need to provide a "mounted-on" fileid for any mountpoint.
> btrfs doesn't have a mounted-on fileid that can be used.  We can fake
> something that might work reasonably well - but it would be fake.  (but
> then ... btrfs already provided bogus information in getdents when there
> is a subvol root in the directory).
>

That seems easy to solve by passing some flag to ->encode_fh()
or if the behavior is persistent in btrfs by some mkfs/module/mount option
then btrfs_encode_fh() will always encode the subvol root inode as
resident of the parent tree-id, because nfsd anyway does not ->encode_fh()
for export roots, right?

> But these are relatively minor.  The bigger problem is /proc/mounts.  If
> btrfs maintainers were willing to have every active subvolume appear in
> /proc/mounts, then I would be happy to fiddle the NFS fsid and allow
> every active NFS/btrfs subvolume to appear in /proc/mounts on the NFS
> client.  But they aren't.  So I am not.
>

I don't understand why you need to tie the two together.
I would suggest:
1. Export different fsid's per subvols to NFSv4 based on statx()
exported tree-id
2. NFS client side uses user configuration to determine which subvols
to auto-mount
3. [optional] Provide a way to configure btrfs using mkfs/module/mount option
    to behave locally the same as the NFS client, which will allow
user configuration
    to determine with subvols to auto-mount locally

I admit that my understanding of the full picture is limited, but I don't
understand why #3 is a strict dependency for #1 and #2.

> > > And I really don't see how an nfs export option would help...  Different
> > > people within and organisation and using the same export might have
> > > different expectations.
> >
> > That's true.
> > But if admin decides to export a specific btrfs mount as a non-unified
> > filesystem, then NFS clients can decide whether ot not to auto-mount the
> > exported subvolumes and different users on the client machine can decide
> > if they want to rsync or rsync --one-file-system, just as they would with
> > local btrfs.
> >
> > And maybe I am wrong, but I don't see how the decision on whether to
> > export a non-unified btrfs can be made a btrfs option or a nfsd global
> > option, that's why I ended up with export option.
>
> Just because a btrfs option and global nfsd option are bad, that doesn't
> mean an export option must be good.  It needs to be presented and
> defended on its own merits.
>
> My current opinion (and I must admit I am feeling rather jaded about the
> whole thing), is that while btrfs is a very interesting and valuable
> experiment in fs design, it contains core mistakes that cannot be
> incrementally fixed.  It should be marked as legacy with all current
> behaviour declared as intentional and not subject to change.  This would
> make way for a new "betrfs" which was designed based on all that we have
> learned.  It would use the same code base, but present a more coherent
> interface.  Exactly what that interface would be has yet to be decided,
> but we would not be bound to maintain anything just because btrfs
> supports it.
>

There is no need for a new driver name (like ext3=>ext4)
Both ext4 and xfs have features that can be determined in mkfs time.
This user experience change does not involve on-disk format changes,
so it is a much easier case, because at least technically, there is nothing
preventing an administrator from turning the user experience feature
on and off with proper care of the consequences.

Which brings me to another point.
This discussion presents several technical challenges and you have
been very creative in presenting technical solutions, but I think that
the nature of the problem is more on the administrative side.

I see this as an unfortunate flaw in our design process, when
filesystem developers have long discussions about issues where
some of the material stakeholders (i.e. administrators) are not in the loop.
But I do not have very good ideas on how to address this flaw.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-09-14  5:45                             ` Amir Goldstein
@ 2021-09-20 22:09                               ` NeilBrown
  0 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2021-09-20 22:09 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: J. Bruce Fields, Christoph Hellwig, Chuck Lever,
	Linux NFS Mailing List, Josef Bacik, linux-fsdevel, Theodore Tso,
	Jan Kara

On Tue, 14 Sep 2021, Amir Goldstein wrote:
> On Tue, Sep 14, 2021 at 1:59 AM NeilBrown <neilb@suse.de> wrote:
> >
> > On Mon, 13 Sep 2021, Amir Goldstein wrote:
> > >
> > > Right, so the right fix IMO would be to provide similar semantics
> > > to the NFS client, like your first patch set tried to do.
> > >
> >
> > Like every other approach, this sounds good and sensible ...  until
> > you examine the details.
> >
> > For NFSv3 (RFC1813) this would be a protocol violation.
> > Section 3.3.3 (LOOKUP) says:
> >   A server will not allow a LOOKUP operation to cross a mountpoint to
> >   the root of a different filesystem, even if the filesystem is
> >   exported.
> >
> > The filesystem is represented by the fsid, so this implies that the fsid
> > of an object reported by LOOKUP must be the same as the fsid of the
> > directory used in the LOOKUP.
> >
> > Linux NFS does allow this restriction to be bypassed with the "crossmnt"
> > export option.  Maybe if crossmnt were given it would be defensible to
> > change the fsid - if crossmnt is not given, we leave the current
> > behaviour.  Note that this is a hack and while it is extremely useful,
> > it does not produce a seemly experience.  You can get exactly the same
> > problems with "find" - just not as uniformly (mounting with "-o noac"
> > makes them uniform).
> >
> 
> I don't understand why we would need to talk about NFSv3.
> This btrfs export issue has been with us for a while.
> I see no reason to address it for old protocols if we can address
> it with a new protocol with better support for the concept of fsid hierarchy.
> 
> > For NFSv4, we need to provide a "mounted-on" fileid for any mountpoint.
> > btrfs doesn't have a mounted-on fileid that can be used.  We can fake
> > something that might work reasonably well - but it would be fake.  (but
> > then ... btrfs already provided bogus information in getdents when there
> > is a subvol root in the directory).
> >
> 
> That seems easy to solve by passing some flag to ->encode_fh()
> or if the behavior is persistent in btrfs by some mkfs/module/mount option
> then btrfs_encode_fh() will always encode the subvol root inode as
> resident of the parent tree-id, because nfsd anyway does not ->encode_fh()
> for export roots, right?

->encode_fh has nothing to do with getting the mounted-on fileid.
With a normal mount point, there are two inodes, one in each vfsmount.
We can call ->getattr to get kstat info including the inode number.
nfsd does that for the underlying vfsmnt/inode to get the mounted-on
fileid.  What should it do for btrfs "subvols"?

> 
> > But these are relatively minor.  The bigger problem is /proc/mounts.  If
> > btrfs maintainers were willing to have every active subvolume appear in
> > /proc/mounts, then I would be happy to fiddle the NFS fsid and allow
> > every active NFS/btrfs subvolume to appear in /proc/mounts on the NFS
> > client.  But they aren't.  So I am not.
> >
> 
> I don't understand why you need to tie the two together.

Because they are the same thing.
The most concrete reason is that any name that appears in /proc/mounts is
public.  People understand that when they mount filesystems.  People
don't need to understand that when creating private subvols.
There is anecdotal evidence that people might expect subvol paths to be
private.  If they then access those subvols via NFS, the names suddenly
become public.

> I would suggest:
> 1. Export different fsid's per subvols to NFSv4 based on statx()
> exported tree-id
> 2. NFS client side uses user configuration to determine which subvols
> to auto-mount

That is a non-started.  Subvols currently don't need mounting, they
transparently appear.  Requiring client-side configuration would be a
major cost for some users.

> 3. [optional] Provide a way to configure btrfs using mkfs/module/mount option
>     to behave locally the same as the NFS client, which will allow
> user configuration
>     to determine with subvols to auto-mount locally
> 
> I admit that my understanding of the full picture is limited, but I don't
> understand why #3 is a strict dependency for #1 and #2.
> 
> > > > And I really don't see how an nfs export option would help...  Different
> > > > people within and organisation and using the same export might have
> > > > different expectations.
> > >
> > > That's true.
> > > But if admin decides to export a specific btrfs mount as a non-unified
> > > filesystem, then NFS clients can decide whether ot not to auto-mount the
> > > exported subvolumes and different users on the client machine can decide
> > > if they want to rsync or rsync --one-file-system, just as they would with
> > > local btrfs.
> > >
> > > And maybe I am wrong, but I don't see how the decision on whether to
> > > export a non-unified btrfs can be made a btrfs option or a nfsd global
> > > option, that's why I ended up with export option.
> >
> > Just because a btrfs option and global nfsd option are bad, that doesn't
> > mean an export option must be good.  It needs to be presented and
> > defended on its own merits.
> >
> > My current opinion (and I must admit I am feeling rather jaded about the
> > whole thing), is that while btrfs is a very interesting and valuable
> > experiment in fs design, it contains core mistakes that cannot be
> > incrementally fixed.  It should be marked as legacy with all current
> > behaviour declared as intentional and not subject to change.  This would
> > make way for a new "betrfs" which was designed based on all that we have
> > learned.  It would use the same code base, but present a more coherent
> > interface.  Exactly what that interface would be has yet to be decided,
> > but we would not be bound to maintain anything just because btrfs
> > supports it.
> >
> 
> There is no need for a new driver name (like ext3=>ext4)
> Both ext4 and xfs have features that can be determined in mkfs time.
> This user experience change does not involve on-disk format changes,
> so it is a much easier case, because at least technically, there is nothing
> preventing an administrator from turning the user experience feature
> on and off with proper care of the consequences.
> 
> Which brings me to another point.
> This discussion presents several technical challenges and you have
> been very creative in presenting technical solutions, but I think that
> the nature of the problem is more on the administrative side.
> 
> I see this as an unfortunate flaw in our design process, when
> filesystem developers have long discussions about issues where
> some of the material stakeholders (i.e. administrators) are not in the loop.
> But I do not have very good ideas on how to address this flaw.

I agree this is more than a technical question.  I don't see it as
particularly an admin issue, because non-admin users can create subvols.

I see it as a conceptual problem.  What is a "subvol"? What do we want
it to be.  Does it make sense for the subvol namespace to align with the
filesystem namespace?
Subvols are more than directories, but less than filesystems.  How can
be best characterise them and thing about them? Are they directories
with extra features, or filesystems with some limitation (and some extra
features)?  Or are they something completely new?
What sort of identity information do applications *need* about files an
filesystems and how can we best provide that within the context of
existing APIs?

NeilBrown

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-08-23  4:05         ` [PATCH v2] BTRFS/NFSD: " NeilBrown
@ 2021-08-23  8:17           ` kernel test robot
  0 siblings, 0 replies; 19+ messages in thread
From: kernel test robot @ 2021-08-23  8:17 UTC (permalink / raw)
  To: NeilBrown, Chris Mason, David Sterba, Christoph Hellwig,
	Josef Bacik, J. Bruce Fields, Chuck Lever
  Cc: clang-built-linux, kbuild-all, Roman Mamedov,
	Goffredo Baroncelli, Alexander Viro, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 5472 bytes --]

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on nfs/linux-next]
[also build test ERROR on hch-configfs/for-next linus/master v5.14-rc7 next-20210820]
[cannot apply to kdave/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/BTRFS-NFSD-provide-more-unique-inode-number-for-btrfs-export/20210823-120718
base:   git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next
config: hexagon-randconfig-r045-20210822 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 79b55e5038324e61a3abf4e6a9a949c473edd858)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/e99ff00e4055532e35c592b50809761d82f87595
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review NeilBrown/BTRFS-NFSD-provide-more-unique-inode-number-for-btrfs-export/20210823-120718
        git checkout e99ff00e4055532e35c592b50809761d82f87595
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross O=build_dir ARCH=hexagon SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> fs/nfsd/nfsfh.c:593:44: error: use of undeclared identifier 'BTRFS_SUPER_MAGIC'
                   if (exp->ex_path.mnt->mnt_sb->s_magic == BTRFS_SUPER_MAGIC)
                                                            ^
>> fs/nfsd/nfsfh.c:593:44: error: use of undeclared identifier 'BTRFS_SUPER_MAGIC'
>> fs/nfsd/nfsfh.c:593:44: error: use of undeclared identifier 'BTRFS_SUPER_MAGIC'
   3 errors generated.


vim +/BTRFS_SUPER_MAGIC +593 fs/nfsd/nfsfh.c

   557	
   558	__be32
   559	fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
   560		   struct svc_fh *ref_fh)
   561	{
   562		/* ref_fh is a reference file handle.
   563		 * if it is non-null and for the same filesystem, then we should compose
   564		 * a filehandle which is of the same version, where possible.
   565		 * Currently, that means that if ref_fh->fh_handle.fh_version == 0xca
   566		 * Then create a 32byte filehandle using nfs_fhbase_old
   567		 *
   568		 */
   569	
   570		struct inode * inode = d_inode(dentry);
   571		dev_t ex_dev = exp_sb(exp)->s_dev;
   572		u8 options = 0;
   573	
   574		dprintk("nfsd: fh_compose(exp %02x:%02x/%ld %pd2, ino=%ld)\n",
   575			MAJOR(ex_dev), MINOR(ex_dev),
   576			(long) d_inode(exp->ex_path.dentry)->i_ino,
   577			dentry,
   578			(inode ? inode->i_ino : 0));
   579	
   580		/* Choose filehandle version and fsid type based on
   581		 * the reference filehandle (if it is in the same export)
   582		 * or the export options.
   583		 */
   584		set_version_and_fsid_type(fhp, exp, ref_fh);
   585	
   586		/* If we have a ref_fh, then copy the fh_no_wcc setting from it. */
   587		fhp->fh_no_wcc = ref_fh ? ref_fh->fh_no_wcc : false;
   588	
   589		if (ref_fh && ref_fh->fh_export == exp) {
   590			options = ref_fh->fh_handle.fh_options;
   591		} else {
   592			/* Set options as needed */
 > 593			if (exp->ex_path.mnt->mnt_sb->s_magic == BTRFS_SUPER_MAGIC)
   594				options |= NFSD_FH_OPTION_INO_UNIQUIFY;
   595		}
   596	
   597		if (ref_fh == fhp)
   598			fh_put(ref_fh);
   599	
   600		if (fhp->fh_locked || fhp->fh_dentry) {
   601			printk(KERN_ERR "fh_compose: fh %pd2 not initialized!\n",
   602			       dentry);
   603		}
   604		if (fhp->fh_maxsize < NFS_FHSIZE)
   605			printk(KERN_ERR "fh_compose: called with maxsize %d! %pd2\n",
   606			       fhp->fh_maxsize,
   607			       dentry);
   608	
   609		fhp->fh_dentry = dget(dentry); /* our internal copy */
   610		fhp->fh_export = exp_get(exp);
   611	
   612		if (fhp->fh_handle.fh_version == 0xca) {
   613			/* old style filehandle please */
   614			memset(&fhp->fh_handle.fh_base, 0, NFS_FHSIZE);
   615			fhp->fh_handle.fh_size = NFS_FHSIZE;
   616			fhp->fh_handle.ofh_dcookie = 0xfeebbaca;
   617			fhp->fh_handle.ofh_dev =  old_encode_dev(ex_dev);
   618			fhp->fh_handle.ofh_xdev = fhp->fh_handle.ofh_dev;
   619			fhp->fh_handle.ofh_xino =
   620				ino_t_to_u32(d_inode(exp->ex_path.dentry)->i_ino);
   621			fhp->fh_handle.ofh_dirino = ino_t_to_u32(parent_ino(dentry));
   622			if (inode)
   623				_fh_update_old(dentry, exp, &fhp->fh_handle);
   624		} else {
   625			fhp->fh_handle.fh_size =
   626				key_len(fhp->fh_handle.fh_fsid_type) + 4;
   627			fhp->fh_handle.fh_options = options;
   628	
   629			mk_fsid(fhp->fh_handle.fh_fsid_type,
   630				fhp->fh_handle.fh_fsid,
   631				ex_dev,
   632				d_inode(exp->ex_path.dentry)->i_ino,
   633				exp->ex_fsid, exp->ex_uuid);
   634	
   635			if (inode)
   636				_fh_update(fhp, exp, dentry);
   637			if (fhp->fh_handle.fh_fileid_type == FILEID_INVALID) {
   638				fh_put(fhp);
   639				return nfserr_opnotsupp;
   640			}
   641		}
   642	
   643		return 0;
   644	}
   645	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29470 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
  2021-08-15 22:17       ` NeilBrown
@ 2021-08-23  4:05         ` NeilBrown
  2021-08-23  8:17           ` kernel test robot
  0 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2021-08-23  4:05 UTC (permalink / raw)
  To: Chris Mason, David Sterba, Christoph Hellwig, Josef Bacik,
	J. Bruce Fields, Chuck Lever
  Cc: Roman Mamedov, Goffredo Baroncelli, Alexander Viro,
	linux-fsdevel, linux-nfs, linux-btrfs


BTRFS does not provide unique inode numbers across a filesystem.
It only provide unique inode numbers within a subvolume and
uses synthetic device numbers for different subvolumes to ensure
uniqueness for device+inode.

nfsd cannot use these varying synthetic device numbers.  If nfsd were to
synthesise different stable filesystem ids to give to the client, that
would cause subvolumes to appear in the mount table on the client, even
though they don't appear in the mount table on the server.  Also, NFSv3
doesn't support changing the filesystem id without a new explicit mount
on the client (this is partially supported in practice, but violates the
protocol specification and has problems in some edge cases).

So currently, the roots of all subvolumes report the same inode number
in the same filesystem to NFS clients and tools like 'find' notice that
a directory has the same identity as an ancestor, and so refuse to
enter that directory.

This patch allows btrfs (or any filesystem) to provide a 64bit number
that can be xored with the inode number to make the number more unique.
Rather than the client being certain to see duplicates, with this patch
it is possible but extremely rare.

The number that btrfs provides is a swab64() version of the subvolume
identifier.  This has most entropy in the high bits (the low bits of the
subvolume identifer), while the inode has most entropy in the low bits.
The result will always be unique within a subvolume, and will almost
always be unique across the filesystem.

If an upgrade of the NFS server caused all inode numbers in an exportfs
BTRFS filesystem to appear to the client to change, the client may not
handle this well.  The Linux client will cause any open files to become
'stale'.  If the mount point changed inode number, the whole mount would
become inaccessible.

To avoid this, an unused byte in the filehandle (fh_auth) has been
repurposed as "fh_options".  (The use of #defines make fh_flags a
problematic choice).  The new behaviour of uniquifying inode number is
only activated when this bit is set.

NFSD will only set this bit in filehandles it reports if the filehandle
of the parent (provided by the client) contains the bit, or if
 - the filehandle for the parent is not provided or is for a different
   export and
 - the filehandle refers to a BTRFS filesystem.

Thus if you have a BTRFS filesystem originally mounted from a server
without this patch, the flag will never be set and the current behaviour
will continue.  Only once you re-mount the filesystem (or the filesystem
is re-auto-mounted) will the inode numbers change.  When that happens,
it is likely that the filesystem st_dev number seen on the client will
change anyway.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/btrfs/inode.c                |  4 ++++
 fs/nfsd/nfs3xdr.c               | 15 ++++++++++++++-
 fs/nfsd/nfs4xdr.c               |  7 ++++---
 fs/nfsd/nfsfh.c                 | 13 +++++++++++--
 fs/nfsd/nfsfh.h                 | 22 ++++++++++++++++++++++
 fs/nfsd/xdr3.h                  |  2 ++
 include/linux/stat.h            | 18 ++++++++++++++++++
 include/uapi/linux/nfsd/nfsfh.h | 18 ++++++++++++------
 8 files changed, 87 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0117d867ecf8..989fdf2032d5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9195,6 +9195,10 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
 	generic_fillattr(&init_user_ns, inode, stat);
 	stat->dev = BTRFS_I(inode)->root->anon_dev;
 
+	if (BTRFS_I(inode)->root->root_key.objectid != BTRFS_FS_TREE_OBJECTID)
+		stat->ino_uniquifier =
+			swab64(BTRFS_I(inode)->root->root_key.objectid);
+
 	spin_lock(&BTRFS_I(inode)->lock);
 	delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
 	inode_bytes = inode_get_bytes(inode);
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 0a5ebc52e6a9..19d14f11f79a 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -340,6 +340,7 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
 {
 	struct user_namespace *userns = nfsd_user_namespace(rqstp);
 	__be32 *p;
+	u64 ino;
 	u64 fsid;
 
 	p = xdr_reserve_space(xdr, XDR_UNIT * 21);
@@ -377,7 +378,8 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
 	p = xdr_encode_hyper(p, fsid);
 
 	/* fileid */
-	p = xdr_encode_hyper(p, stat->ino);
+	ino = nfsd_uniquify_ino(fhp, stat);
+	p = xdr_encode_hyper(p, ino);
 
 	p = encode_nfstime3(p, &stat->atime);
 	p = encode_nfstime3(p, &stat->mtime);
@@ -1151,6 +1153,17 @@ svcxdr_encode_entry3_common(struct nfsd3_readdirres *resp, const char *name,
 	if (xdr_stream_encode_item_present(xdr) < 0)
 		return false;
 	/* fileid */
+	if (!resp->dir_have_uniquifier) {
+		struct kstat stat;
+		if (fh_getattr(&resp->fh, &stat) == nfs_ok)
+			resp->dir_ino_uniquifier =
+				nfsd_ino_uniquifier(&resp->fh, &stat);
+		else
+			resp->dir_ino_uniquifier = 0;
+		resp->dir_have_uniquifier = true;
+	}
+	if (resp->dir_ino_uniquifier != ino)
+		ino ^= resp->dir_ino_uniquifier;
 	if (xdr_stream_encode_u64(xdr, ino) < 0)
 		return false;
 	/* name */
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7abeccb975b2..5ed894ceebb0 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -3114,10 +3114,11 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 					fhp->fh_handle.fh_size);
 	}
 	if (bmval0 & FATTR4_WORD0_FILEID) {
+		u64 ino = nfsd_uniquify_ino(fhp, &stat);
 		p = xdr_reserve_space(xdr, 8);
 		if (!p)
 			goto out_resource;
-		p = xdr_encode_hyper(p, stat.ino);
+		p = xdr_encode_hyper(p, ino);
 	}
 	if (bmval0 & FATTR4_WORD0_FILES_AVAIL) {
 		p = xdr_reserve_space(xdr, 8);
@@ -3274,7 +3275,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 
 		p = xdr_reserve_space(xdr, 8);
 		if (!p)
-                	goto out_resource;
+			goto out_resource;
 		/*
 		 * Get parent's attributes if not ignoring crossmount
 		 * and this is the root of a cross-mounted filesystem.
@@ -3284,7 +3285,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 			err = get_parent_attributes(exp, &parent_stat);
 			if (err)
 				goto out_nfserr;
-			ino = parent_stat.ino;
+			ino = nfsd_uniquify_ino(fhp, &parent_stat);
 		}
 		p = xdr_encode_hyper(p, ino);
 	}
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index c475d2271f9c..e97ed957a379 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -172,7 +172,7 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp)
 
 		if (--data_left < 0)
 			return error;
-		if (fh->fh_auth_type != 0)
+		if ((fh->fh_options & ~NFSD_FH_OPTION_ALL) != 0)
 			return error;
 		len = key_len(fh->fh_fsid_type) / 4;
 		if (len == 0)
@@ -569,6 +569,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
 
 	struct inode * inode = d_inode(dentry);
 	dev_t ex_dev = exp_sb(exp)->s_dev;
+	u8 options = 0;
 
 	dprintk("nfsd: fh_compose(exp %02x:%02x/%ld %pd2, ino=%ld)\n",
 		MAJOR(ex_dev), MINOR(ex_dev),
@@ -585,6 +586,14 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
 	/* If we have a ref_fh, then copy the fh_no_wcc setting from it. */
 	fhp->fh_no_wcc = ref_fh ? ref_fh->fh_no_wcc : false;
 
+	if (ref_fh && ref_fh->fh_export == exp) {
+		options = ref_fh->fh_handle.fh_options;
+	} else {
+		/* Set options as needed */
+		if (exp->ex_path.mnt->mnt_sb->s_magic == BTRFS_SUPER_MAGIC)
+			options |= NFSD_FH_OPTION_INO_UNIQUIFY;
+	}
+
 	if (ref_fh == fhp)
 		fh_put(ref_fh);
 
@@ -615,7 +624,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
 	} else {
 		fhp->fh_handle.fh_size =
 			key_len(fhp->fh_handle.fh_fsid_type) + 4;
-		fhp->fh_handle.fh_auth_type = 0;
+		fhp->fh_handle.fh_options = options;
 
 		mk_fsid(fhp->fh_handle.fh_fsid_type,
 			fhp->fh_handle.fh_fsid,
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 6106697adc04..1144a98c2951 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -84,6 +84,28 @@ enum fsid_source {
 };
 extern enum fsid_source fsid_source(const struct svc_fh *fhp);
 
+enum nfsd_fh_options {
+	NFSD_FH_OPTION_INO_UNIQUIFY = 1,	/* BTRFS only */
+
+	NFSD_FH_OPTION_ALL = 1
+};
+
+static inline u64 nfsd_ino_uniquifier(const struct svc_fh *fhp,
+				      const struct kstat *stat)
+{
+	if (fhp->fh_handle.fh_options & NFSD_FH_OPTION_INO_UNIQUIFY)
+		return stat->ino_uniquifier;
+	return 0;
+}
+
+static inline u64 nfsd_uniquify_ino(const struct svc_fh *fhp,
+				    const struct kstat *stat)
+{
+	u64 u = nfsd_ino_uniquifier(fhp, stat);
+	if (u != stat->ino)
+		return stat->ino ^ u;
+	return stat->ino;
+}
 
 /*
  * This might look a little large to "inline" but in all calls except
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 933008382bbe..d9b6c8314bbb 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -179,6 +179,8 @@ struct nfsd3_readdirres {
 	struct xdr_buf		dirlist;
 	struct svc_fh		scratch;
 	struct readdir_cd	common;
+	u64			dir_ino_uniquifier;
+	bool			dir_have_uniquifier;
 	unsigned int		cookie_offset;
 	struct svc_rqst *	rqstp;
 
diff --git a/include/linux/stat.h b/include/linux/stat.h
index fff27e603814..0f3f74d302f8 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -46,6 +46,24 @@ struct kstat {
 	struct timespec64 btime;			/* File creation time */
 	u64		blocks;
 	u64		mnt_id;
+	/*
+	 * BTRFS does not provide unique inode numbers within a filesystem,
+	 * depending on a synthetic 'dev' to provide uniqueness.
+	 * NFSd cannot make use of this 'dev' number so clients often see
+	 * duplicate inode numbers.
+	 * For BTRFS, 'ino' is unlikely to use the high bits until the filesystem
+	 * has created a great many inodes.
+	 * It puts another number in ino_uniquifier which:
+	 * - has most entropy in the high bits
+	 * - is different precisely when 'dev' is different
+	 * - is stable across unmount/remount
+	 * NFSd can xor this with 'ino' to get a substantially more unique
+	 * number for reporting to the client.
+	 * The ino_uniquifier for a directory can reasonably be applied
+	 * to inode numbers reported by the readdir filldir callback.
+	 * It is NOT currently exported to user-space.
+	 */
+	u64		ino_uniquifier;
 };
 
 #endif
diff --git a/include/uapi/linux/nfsd/nfsfh.h b/include/uapi/linux/nfsd/nfsfh.h
index 427294dd56a1..59311df4b476 100644
--- a/include/uapi/linux/nfsd/nfsfh.h
+++ b/include/uapi/linux/nfsd/nfsfh.h
@@ -38,11 +38,17 @@ struct nfs_fhbase_old {
  * The file handle starts with a sequence of four-byte words.
  * The first word contains a version number (1) and three descriptor bytes
  * that tell how the remaining 3 variable length fields should be handled.
- * These three bytes are auth_type, fsid_type and fileid_type.
+ * These three bytes are options, fsid_type and fileid_type.
  *
  * All four-byte values are in host-byte-order.
  *
- * The auth_type field is deprecated and must be set to 0.
+ * The options field (previously auth_type) can be used when nfsd behaviour
+ * needs to change in a non-compatible way, usually for some specific
+ * filesystem.  Options should only be set in filehandles for filesystems which
+ * need them.
+ * Current values:
+ *   1  -  BTRFS only.  Cause stat->ino_uniquifier to be used to improve inode
+ *         number uniqueness.
  *
  * The fsid_type identifies how the filesystem (or export point) is
  *    encoded.
@@ -67,7 +73,7 @@ struct nfs_fhbase_new {
 	union {
 		struct {
 			__u8		fb_version_aux;	/* == 1, even => nfs_fhbase_old */
-			__u8		fb_auth_type_aux;
+			__u8		fb_options_aux;
 			__u8		fb_fsid_type_aux;
 			__u8		fb_fileid_type_aux;
 			__u32		fb_auth[1];
@@ -76,7 +82,7 @@ struct nfs_fhbase_new {
 		};
 		struct {
 			__u8		fb_version;	/* == 1, even => nfs_fhbase_old */
-			__u8		fb_auth_type;
+			__u8		fb_options;
 			__u8		fb_fsid_type;
 			__u8		fb_fileid_type;
 			__u32		fb_auth_flex[]; /* flexible-array member */
@@ -106,11 +112,11 @@ struct knfsd_fh {
 
 #define	fh_version		fh_base.fh_new.fb_version
 #define	fh_fsid_type		fh_base.fh_new.fb_fsid_type
-#define	fh_auth_type		fh_base.fh_new.fb_auth_type
+#define	fh_options		fh_base.fh_new.fb_options
 #define	fh_fileid_type		fh_base.fh_new.fb_fileid_type
 #define	fh_fsid			fh_base.fh_new.fb_auth_flex
 
 /* Do not use, provided for userspace compatiblity. */
-#define	fh_auth			fh_base.fh_new.fb_auth
+#define	fh_auth			fh_base.fh_new.fb_options
 
 #endif /* _UAPI_LINUX_NFSD_FH_H */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-09-20 22:11 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <162995209561.7591.4202079352301963089@noble.neil.brown.name>
     [not found] ` <162995778427.7591.11743795294299207756@noble.neil.brown.name>
     [not found]   ` <YSkQ31UTVDtBavOO@infradead.org>
     [not found]     ` <163010550851.7591.9342822614202739406@noble.neil.brown.name>
     [not found]       ` <YSnhHl0HDOgg07U5@infradead.org>
     [not found]         ` <163038594541.7591.11109978693705593957@noble.neil.brown.name>
2021-09-01  7:20           ` [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export Christoph Hellwig
2021-09-01 15:22             ` J. Bruce Fields
2021-09-02  4:14               ` NeilBrown
2021-09-05 16:07                 ` J. Bruce Fields
2021-09-06  1:29                   ` NeilBrown
2021-09-11 14:12                     ` Amir Goldstein
2021-09-13  0:43                       ` NeilBrown
2021-09-13 10:04                         ` Amir Goldstein
2021-09-13 22:59                           ` NeilBrown
2021-09-14  5:45                             ` Amir Goldstein
2021-09-20 22:09                               ` NeilBrown
2021-09-02  7:11               ` Christoph Hellwig
2021-09-02  4:06             ` NeilBrown
2021-09-02  7:16               ` Christoph Hellwig
2021-09-02  7:53                 ` Miklos Szeredi
2021-09-02 14:16                   ` Frank Filz
2021-09-02 23:02                     ` NeilBrown
2021-07-27 22:37 [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly NeilBrown
2021-08-13  1:45 ` [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export NeilBrown
2021-08-15  7:39   ` Goffredo Baroncelli
2021-08-15 19:35     ` Roman Mamedov
2021-08-15 22:17       ` NeilBrown
2021-08-23  4:05         ` [PATCH v2] BTRFS/NFSD: " NeilBrown
2021-08-23  8:17           ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).