How to cope with subvolumes and snapshots on muti-user systems?

All of lore.kernel.org
 help / color / mirror / Atom feed

* How to cope with subvolumes and snapshots on muti-user systems?
@ 2023-11-28  7:49 Donald Buczek
  2023-11-29 21:43 ` Kent Overstreet
  0 siblings, 1 reply; 92+ messages in thread
From: Donald Buczek @ 2023-11-28  7:49 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Stefan Krueger

Hi,

I'm happy, bcachefs finally made it to mainline, Kudos to Kent!

As far as I know, there is an ongoing discussion about the problems of non-unique inode numbers exposed by snapshots an no real conclusion yet, on how filesystems should expose snapshots.

The current behavior is kind of a showstopper for us, because we are still running multi-user systems. The problem is, that any unprivileged user can create subvolumes and snapshots:

     buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
     buczek@dose:/scratch/local3$ mkdir vol1/dir1
     buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
     buczek@dose:/scratch/local3$ ls -li vol1/
     total 0
     1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
     1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
     buczek@dose:/scratch/local3$ ls -li vol1/snp1/
     total 0
     1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
     buczek@dose:/scratch/local3$ find .
     .
     ./vol1
     find: File system loop detected; ‘./vol1/snp1’ is part of the same file system loop as ‘./vol1’.
     ./vol1/dir1
     buczek@dose:/scratch/local3$

We have a few tools which walk over the filesystem (backup, mirror, accounting) and these are just not prepared for non-unique inode numbers aside from regular hardlinks. I'm concerned that experiments of a unprivileged users could make important tools to fail.

Questions:

- Would it be a workaround to make creation of subvolumes and snapshots privileged operations?

- If we want to evolve our tools: What is the best way for userspace to recognize subvolumes and snapshots and tell them apart from ordinary directories?

Thanks

   Donald

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-11-28  7:49 How to cope with subvolumes and snapshots on muti-user systems? Donald Buczek
@ 2023-11-29 21:43 ` Kent Overstreet
  2023-11-30  7:35   ` Donald Buczek
  2023-11-30 20:36   ` NeilBrown
  0 siblings, 2 replies; 92+ messages in thread
From: Kent Overstreet @ 2023-11-29 21:43 UTC (permalink / raw)
  To: Donald Buczek; +Cc: linux-bcachefs, Stefan Krueger, Neil Brown

On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote:
> Hi,
> 
> I'm happy, bcachefs finally made it to mainline, Kudos to Kent!
> 
> As far as I know, there is an ongoing discussion about the problems of non-unique inode numbers exposed by snapshots an no real conclusion yet, on how filesystems should expose snapshots.
> 
> The current behavior is kind of a showstopper for us, because we are still running multi-user systems. The problem is, that any unprivileged user can create subvolumes and snapshots:
> 
>     buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
>     buczek@dose:/scratch/local3$ mkdir vol1/dir1
>     buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
>     buczek@dose:/scratch/local3$ ls -li vol1/
>     total 0
>     1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
>     1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
>     buczek@dose:/scratch/local3$ ls -li vol1/snp1/
>     total 0
>     1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
>     buczek@dose:/scratch/local3$ find .
>     .
>     ./vol1
>     find: File system loop detected; ‘./vol1/snp1’ is part of the same file system loop as ‘./vol1’.
>     ./vol1/dir1
>     buczek@dose:/scratch/local3$
> 
> We have a few tools which walk over the filesystem (backup, mirror, accounting) and these are just not prepared for non-unique inode numbers aside from regular hardlinks. I'm concerned that experiments of a unprivileged users could make important tools to fail.
> 
> Questions:
> 
> - Would it be a workaround to make creation of subvolumes and snapshots privileged operations?
> 
> - If we want to evolve our tools: What is the best way for userspace to recognize subvolumes and snapshots and tell them apart from ordinary directories?

I'm considering writing an "integer identifiers considered harmful"
paper. I haven't dug into it yet - just skimmed the lwn coverage - but
adding in the overlayfs issue makes this problem sound fundamentally
unsolvable given the current constraints.

Recognizing subvolume boundaries: we don't have anything for this yet,
and given what happened with btrfs I'm not sure we want to go that
route.

We could (probably) expose unique inode numbers, filesystem wide, but at
a cost. We'd have to restrict ourselves internally to 32 bit inode
numbers, and then report inode numbers that include the subvolume ID as
the high 32 bits.

That won't leave enough bits to shard inode numbers at create time based
on CPU id, which isn't the _most_ important optimization, but would hurt
to lose.

Going forward, we really need - as you allude to - better userspace
APIs. Inode number probably can't be an integer anymore, it needs to be
a string if it's going to be able to do all the things we want it to do.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-11-29 21:43 ` Kent Overstreet
@ 2023-11-30  7:35   ` Donald Buczek
  2023-11-30  7:39     ` Kent Overstreet
  2023-11-30 20:36   ` NeilBrown
  1 sibling, 1 reply; 92+ messages in thread
From: Donald Buczek @ 2023-11-30  7:35 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, Stefan Krueger, Neil Brown

On 11/29/23 22:43, Kent Overstreet wrote:
> On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote:
>> Hi,
>>
>> I'm happy, bcachefs finally made it to mainline, Kudos to Kent!
>>
>> As far as I know, there is an ongoing discussion about the problems of non-unique inode numbers exposed by snapshots an no real conclusion yet, on how filesystems should expose snapshots.
>>
>> The current behavior is kind of a showstopper for us, because we are still running multi-user systems. The problem is, that any unprivileged user can create subvolumes and snapshots:
>>
>>      buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
>>      buczek@dose:/scratch/local3$ mkdir vol1/dir1
>>      buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
>>      buczek@dose:/scratch/local3$ ls -li vol1/
>>      total 0
>>      1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
>>      1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
>>      buczek@dose:/scratch/local3$ ls -li vol1/snp1/
>>      total 0
>>      1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
>>      buczek@dose:/scratch/local3$ find .
>>      .
>>      ./vol1
>>      find: File system loop detected; ‘./vol1/snp1’ is part of the same file system loop as ‘./vol1’.
>>      ./vol1/dir1
>>      buczek@dose:/scratch/local3$
>>
>> We have a few tools which walk over the filesystem (backup, mirror, accounting) and these are just not prepared for non-unique inode numbers aside from regular hardlinks. I'm concerned that experiments of a unprivileged users could make important tools to fail.
>>
>> Questions:
>>
>> - Would it be a workaround to make creation of subvolumes and snapshots privileged operations?
>>
>> - If we want to evolve our tools: What is the best way for userspace to recognize subvolumes and snapshots and tell them apart from ordinary directories?
> 
> I'm considering writing an "integer identifiers considered harmful"
> paper. I haven't dug into it yet - just skimmed the lwn coverage - but
> adding in the overlayfs issue makes this problem sound fundamentally
> unsolvable given the current constraints.
> 
> Recognizing subvolume boundaries: we don't have anything for this yet,
> and given what happened with btrfs I'm not sure we want to go that
> route.

A fabricated fsid - I think, that is what btrfs does - wouldn't help us either. The mentioned tools would just stop at the assumed mount points.

> We could (probably) expose unique inode numbers, filesystem wide, but at
> a cost. We'd have to restrict ourselves internally to 32 bit inode
> numbers, and then report inode numbers that include the subvolume ID as
> the high 32 bits.
> 
> That won't leave enough bits to shard inode numbers at create time based
> on CPU id, which isn't the _most_ important optimization, but would hurt
> to lose.
> 
> Going forward, we really need - as you allude to - better userspace
> APIs. Inode number probably can't be an integer anymore, it needs to be
> a string if it's going to be able to do all the things we want it to do.

Should be an uuid then. Expose inode uuid (and/or snapshot uuid) via xattr? Anyway, I know, people more competent than me are thinking about the problem.

What to you think about the suggestion to make creation of snapshots (and possible volumes, not sure about that) a privileged operation (optional, of course) ? I know, multi-user is niche nowadays, but for me it looks like a rather cheap workaround.

Thanks

   Donald

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-11-30  7:35   ` Donald Buczek
@ 2023-11-30  7:39     ` Kent Overstreet
  2023-11-30 20:37       ` NeilBrown
  0 siblings, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-11-30  7:39 UTC (permalink / raw)
  To: Donald Buczek; +Cc: linux-bcachefs, Stefan Krueger, Neil Brown

On Thu, Nov 30, 2023 at 08:35:12AM +0100, Donald Buczek wrote:
> On 11/29/23 22:43, Kent Overstreet wrote:
> > On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote:
> > > Hi,
> > > 
> > > I'm happy, bcachefs finally made it to mainline, Kudos to Kent!
> > > 
> > > As far as I know, there is an ongoing discussion about the problems of non-unique inode numbers exposed by snapshots an no real conclusion yet, on how filesystems should expose snapshots.
> > > 
> > > The current behavior is kind of a showstopper for us, because we are still running multi-user systems. The problem is, that any unprivileged user can create subvolumes and snapshots:
> > > 
> > >      buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
> > >      buczek@dose:/scratch/local3$ mkdir vol1/dir1
> > >      buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
> > >      buczek@dose:/scratch/local3$ ls -li vol1/
> > >      total 0
> > >      1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> > >      1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
> > >      buczek@dose:/scratch/local3$ ls -li vol1/snp1/
> > >      total 0
> > >      1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> > >      buczek@dose:/scratch/local3$ find .
> > >      .
> > >      ./vol1
> > >      find: File system loop detected; ‘./vol1/snp1’ is part of the same file system loop as ‘./vol1’.
> > >      ./vol1/dir1
> > >      buczek@dose:/scratch/local3$
> > > 
> > > We have a few tools which walk over the filesystem (backup, mirror, accounting) and these are just not prepared for non-unique inode numbers aside from regular hardlinks. I'm concerned that experiments of a unprivileged users could make important tools to fail.
> > > 
> > > Questions:
> > > 
> > > - Would it be a workaround to make creation of subvolumes and snapshots privileged operations?
> > > 
> > > - If we want to evolve our tools: What is the best way for userspace to recognize subvolumes and snapshots and tell them apart from ordinary directories?
> > 
> > I'm considering writing an "integer identifiers considered harmful"
> > paper. I haven't dug into it yet - just skimmed the lwn coverage - but
> > adding in the overlayfs issue makes this problem sound fundamentally
> > unsolvable given the current constraints.
> > 
> > Recognizing subvolume boundaries: we don't have anything for this yet,
> > and given what happened with btrfs I'm not sure we want to go that
> > route.
> 
> A fabricated fsid - I think, that is what btrfs does - wouldn't help
> us either. The mentioned tools would just stop at the assumed mount
> points.

*nod* - nor for NFS.

> > We could (probably) expose unique inode numbers, filesystem wide, but at
> > a cost. We'd have to restrict ourselves internally to 32 bit inode
> > numbers, and then report inode numbers that include the subvolume ID as
> > the high 32 bits.
> > 
> > That won't leave enough bits to shard inode numbers at create time based
> > on CPU id, which isn't the _most_ important optimization, but would hurt
> > to lose.
> > 
> > Going forward, we really need - as you allude to - better userspace
> > APIs. Inode number probably can't be an integer anymore, it needs to be
> > a string if it's going to be able to do all the things we want it to do.
> 
> Should be an uuid then. Expose inode uuid (and/or snapshot uuid) via
> xattr? Anyway, I know, people more competent than me are thinking
> about the problem.

Exposing the subvolume ID via an xattr would be easy, if that works for
you.

> What to you think about the suggestion to make creation of snapshots
> (and possible volumes, not sure about that) a privileged operation
> (optional, of course) ? I know, multi-user is niche nowadays, but for
> me it looks like a rather cheap workaround.

Yeah, that's not an unreasonable ask.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-11-29 21:43 ` Kent Overstreet
  2023-11-30  7:35   ` Donald Buczek
@ 2023-11-30 20:36   ` NeilBrown
  1 sibling, 0 replies; 92+ messages in thread
From: NeilBrown @ 2023-11-30 20:36 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Donald Buczek, linux-bcachefs, Stefan Krueger

On Thu, 30 Nov 2023, Kent Overstreet wrote:
> On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote:
> > Hi,
> > 
> > I'm happy, bcachefs finally made it to mainline, Kudos to Kent!
> > 
> > As far as I know, there is an ongoing discussion about the problems of non-unique inode numbers exposed by snapshots an no real conclusion yet, on how filesystems should expose snapshots.
> > 
> > The current behavior is kind of a showstopper for us, because we are still running multi-user systems. The problem is, that any unprivileged user can create subvolumes and snapshots:
> > 
> >     buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
> >     buczek@dose:/scratch/local3$ mkdir vol1/dir1
> >     buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
> >     buczek@dose:/scratch/local3$ ls -li vol1/
> >     total 0
> >     1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> >     1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
> >     buczek@dose:/scratch/local3$ ls -li vol1/snp1/
> >     total 0
> >     1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> >     buczek@dose:/scratch/local3$ find .
> >     .
> >     ./vol1
> >     find: File system loop detected; ‘./vol1/snp1’ is part of the same file system loop as ‘./vol1’.
> >     ./vol1/dir1
> >     buczek@dose:/scratch/local3$
> > 
> > We have a few tools which walk over the filesystem (backup, mirror, accounting) and these are just not prepared for non-unique inode numbers aside from regular hardlinks. I'm concerned that experiments of a unprivileged users could make important tools to fail.
> > 
> > Questions:
> > 
> > - Would it be a workaround to make creation of subvolumes and snapshots privileged operations?
> > 
> > - If we want to evolve our tools: What is the best way for userspace to recognize subvolumes and snapshots and tell them apart from ordinary directories?
> 
> I'm considering writing an "integer identifiers considered harmful"
> paper. I haven't dug into it yet - just skimmed the lwn coverage - but
> adding in the overlayfs issue makes this problem sound fundamentally
> unsolvable given the current constraints.
> 
> Recognizing subvolume boundaries: we don't have anything for this yet,
> and given what happened with btrfs I'm not sure we want to go that
> route.
> 
> We could (probably) expose unique inode numbers, filesystem wide, but at
> a cost. We'd have to restrict ourselves internally to 32 bit inode
> numbers, and then report inode numbers that include the subvolume ID as
> the high 32 bits.
> 
> That won't leave enough bits to shard inode numbers at create time based
> on CPU id, which isn't the _most_ important optimization, but would hurt
> to lose.
> 
> Going forward, we really need - as you allude to - better userspace
> APIs. Inode number probably can't be an integer anymore, it needs to be
> a string if it's going to be able to do all the things we want it to do.
> 

If you want to give up on Posix comparability - and with it any hope
that your fs will be used - then using a string as the inode identifier
might work.  But in the real world, inodes have 64 bit numbers.

The only credible alternative is to use the fhandle as a primary
identifier.  You still need to support inode numbers, and they still
need to be stable and to be as unique as you can possibly make them
across the filesystem.

My preference would be to make the inode number reported to userspace be
a strong hash of the fhandle (which absolutely must be unique - but can be
quite large).  Then systemic clashes - like the one with results in find
misdetecting filesystem loops - will be extremely unlikely.

For code that really really needs to detect if two filesystems are the
same (tar being the obvious example) we should encourage the use of
name_to_handle_at().

Experience with NFSv2 and exclusive opens shows that if a filesystem
breaks something in a way that it still works most of the time but
occasionally still fails randomly, then people will still use it
(providing it provides real benefits of course) and they will find a
work-around to avoid the breakage.  So thanks to NFSv2, people stopped
using O_CREAT|O_EXCL as a locking mechanism, and did things with hard
links instead.

NeilBrown

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-11-30  7:39     ` Kent Overstreet
@ 2023-11-30 20:37       ` NeilBrown
  2023-12-04 10:47         ` Donald Buczek
  0 siblings, 1 reply; 92+ messages in thread
From: NeilBrown @ 2023-11-30 20:37 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Donald Buczek, linux-bcachefs, Stefan Krueger

On Thu, 30 Nov 2023, Kent Overstreet wrote:
> On Thu, Nov 30, 2023 at 08:35:12AM +0100, Donald Buczek wrote:
> > On 11/29/23 22:43, Kent Overstreet wrote:
> > > On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote:
> > > > Hi,
> > > > 
> > > > I'm happy, bcachefs finally made it to mainline, Kudos to Kent!
> > > > 
> > > > As far as I know, there is an ongoing discussion about the problems of non-unique inode numbers exposed by snapshots an no real conclusion yet, on how filesystems should expose snapshots.
> > > > 
> > > > The current behavior is kind of a showstopper for us, because we are still running multi-user systems. The problem is, that any unprivileged user can create subvolumes and snapshots:
> > > > 
> > > >      buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
> > > >      buczek@dose:/scratch/local3$ mkdir vol1/dir1
> > > >      buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
> > > >      buczek@dose:/scratch/local3$ ls -li vol1/
> > > >      total 0
> > > >      1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> > > >      1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
> > > >      buczek@dose:/scratch/local3$ ls -li vol1/snp1/
> > > >      total 0
> > > >      1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> > > >      buczek@dose:/scratch/local3$ find .
> > > >      .
> > > >      ./vol1
> > > >      find: File system loop detected; ‘./vol1/snp1’ is part of the same file system loop as ‘./vol1’.
> > > >      ./vol1/dir1
> > > >      buczek@dose:/scratch/local3$
> > > > 
> > > > We have a few tools which walk over the filesystem (backup, mirror, accounting) and these are just not prepared for non-unique inode numbers aside from regular hardlinks. I'm concerned that experiments of a unprivileged users could make important tools to fail.
> > > > 
> > > > Questions:
> > > > 
> > > > - Would it be a workaround to make creation of subvolumes and snapshots privileged operations?
> > > > 
> > > > - If we want to evolve our tools: What is the best way for userspace to recognize subvolumes and snapshots and tell them apart from ordinary directories?
> > > 
> > > I'm considering writing an "integer identifiers considered harmful"
> > > paper. I haven't dug into it yet - just skimmed the lwn coverage - but
> > > adding in the overlayfs issue makes this problem sound fundamentally
> > > unsolvable given the current constraints.
> > > 
> > > Recognizing subvolume boundaries: we don't have anything for this yet,
> > > and given what happened with btrfs I'm not sure we want to go that
> > > route.
> > 
> > A fabricated fsid - I think, that is what btrfs does - wouldn't help
> > us either. The mentioned tools would just stop at the assumed mount
> > points.
> 
> *nod* - nor for NFS.
> 
> > > We could (probably) expose unique inode numbers, filesystem wide, but at
> > > a cost. We'd have to restrict ourselves internally to 32 bit inode
> > > numbers, and then report inode numbers that include the subvolume ID as
> > > the high 32 bits.
> > > 
> > > That won't leave enough bits to shard inode numbers at create time based
> > > on CPU id, which isn't the _most_ important optimization, but would hurt
> > > to lose.
> > > 
> > > Going forward, we really need - as you allude to - better userspace
> > > APIs. Inode number probably can't be an integer anymore, it needs to be
> > > a string if it's going to be able to do all the things we want it to do.
> > 
> > Should be an uuid then. Expose inode uuid (and/or snapshot uuid) via
> > xattr? Anyway, I know, people more competent than me are thinking
> > about the problem.
> 
> Exposing the subvolume ID via an xattr would be easy, if that works for
> you.

I think that would be a bad choice.  Stick with the filehandle.  It is
an existing interface supported by many filesystems and it works over
NFS.

NeilBrown


> 
> > What to you think about the suggestion to make creation of snapshots
> > (and possible volumes, not sure about that) a privileged operation
> > (optional, of course) ? I know, multi-user is niche nowadays, but for
> > me it looks like a rather cheap workaround.
> 
> Yeah, that's not an unreasonable ask.
> 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-11-30 20:37       ` NeilBrown
@ 2023-12-04 10:47         ` Donald Buczek
  2023-12-04 22:45           ` NeilBrown
  0 siblings, 1 reply; 92+ messages in thread
From: Donald Buczek @ 2023-12-04 10:47 UTC (permalink / raw)
  To: NeilBrown, Kent Overstreet; +Cc: linux-bcachefs, Stefan Krueger



On 11/30/23 21:37, NeilBrown wrote:
> On Thu, 30 Nov 2023, Kent Overstreet wrote:
>> On Thu, Nov 30, 2023 at 08:35:12AM +0100, Donald Buczek wrote:
>>> On 11/29/23 22:43, Kent Overstreet wrote:
>>>> On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote:
>>>>> Hi,
>>>>>
>>>>> I'm happy, bcachefs finally made it to mainline, Kudos to Kent!
>>>>>
>>>>> As far as I know, there is an ongoing discussion about the problems of non-unique inode numbers exposed by snapshots an no real conclusion yet, on how filesystems should expose snapshots.
>>>>>
>>>>> The current behavior is kind of a showstopper for us, because we are still running multi-user systems. The problem is, that any unprivileged user can create subvolumes and snapshots:
>>>>>
>>>>>       buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
>>>>>       buczek@dose:/scratch/local3$ mkdir vol1/dir1
>>>>>       buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
>>>>>       buczek@dose:/scratch/local3$ ls -li vol1/
>>>>>       total 0
>>>>>       1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
>>>>>       1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
>>>>>       buczek@dose:/scratch/local3$ ls -li vol1/snp1/
>>>>>       total 0
>>>>>       1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
>>>>>       buczek@dose:/scratch/local3$ find .
>>>>>       .
>>>>>       ./vol1
>>>>>       find: File system loop detected; ‘./vol1/snp1’ is part of the same file system loop as ‘./vol1’.
>>>>>       ./vol1/dir1
>>>>>       buczek@dose:/scratch/local3$
>>>>>
>>>>> We have a few tools which walk over the filesystem (backup, mirror, accounting) and these are just not prepared for non-unique inode numbers aside from regular hardlinks. I'm concerned that experiments of a unprivileged users could make important tools to fail.
>>>>>
>>>>> Questions:
>>>>>
>>>>> - Would it be a workaround to make creation of subvolumes and snapshots privileged operations?
>>>>>
>>>>> - If we want to evolve our tools: What is the best way for userspace to recognize subvolumes and snapshots and tell them apart from ordinary directories?
>>>>
>>>> I'm considering writing an "integer identifiers considered harmful"
>>>> paper. I haven't dug into it yet - just skimmed the lwn coverage - but
>>>> adding in the overlayfs issue makes this problem sound fundamentally
>>>> unsolvable given the current constraints.
>>>>
>>>> Recognizing subvolume boundaries: we don't have anything for this yet,
>>>> and given what happened with btrfs I'm not sure we want to go that
>>>> route.
>>>
>>> A fabricated fsid - I think, that is what btrfs does - wouldn't help
>>> us either. The mentioned tools would just stop at the assumed mount
>>> points.
>>
>> *nod* - nor for NFS.
>>
>>>> We could (probably) expose unique inode numbers, filesystem wide, but at
>>>> a cost. We'd have to restrict ourselves internally to 32 bit inode
>>>> numbers, and then report inode numbers that include the subvolume ID as
>>>> the high 32 bits.
>>>>
>>>> That won't leave enough bits to shard inode numbers at create time based
>>>> on CPU id, which isn't the _most_ important optimization, but would hurt
>>>> to lose.
>>>>
>>>> Going forward, we really need - as you allude to - better userspace
>>>> APIs. Inode number probably can't be an integer anymore, it needs to be
>>>> a string if it's going to be able to do all the things we want it to do.
>>>
>>> Should be an uuid then. Expose inode uuid (and/or snapshot uuid) via
>>> xattr? Anyway, I know, people more competent than me are thinking
>>> about the problem.
>>
>> Exposing the subvolume ID via an xattr would be easy, if that works for
>> you.
> 
> I think that would be a bad choice.  Stick with the filehandle.  It is
> an existing interface supported by many filesystems and it works over
> NFS.


Hmm, isn't a (struct file_handle) filehandle supposed to be opaque?  I'm not sure if it is a good idea to stick filesystem-specific meta-information (like a subvolume ident) into special bits of opaque numbers (whether inode number or filehandle) at all.

The information could also be made available via specialized ioctl(s) as an alternative to xattr.

One disadvantage of embedded in filehandle, xattr or ioctl is that userspace would need an extra syscall for each directory just to see if this is a subvolume or a snapshot, which most of the time it is not. Maybe, even an otherwise unnecessary open is needed to get a consistent view between multiple information-retrieving syscalls. If only statx() was extendable.

You don't have this problem with a crafted fsid. Then creating a subvolume or a snapshot could be seen as an implied mount. Maybe that's not the worst solution.

Best

   Donald

> 
> NeilBrown
> 
> 
>>
>>> What to you think about the suggestion to make creation of snapshots
>>> (and possible volumes, not sure about that) a privileged operation
>>> (optional, of course) ? I know, multi-user is niche nowadays, but for
>>> me it looks like a rather cheap workaround.
>>
>> Yeah, that's not an unreasonable ask.
>>
> 

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-12-04 10:47         ` Donald Buczek
@ 2023-12-04 22:45           ` NeilBrown
  2023-12-05 21:35             ` Donald Buczek
  0 siblings, 1 reply; 92+ messages in thread
From: NeilBrown @ 2023-12-04 22:45 UTC (permalink / raw)
  To: Donald Buczek; +Cc: Kent Overstreet, linux-bcachefs, Stefan Krueger

On Mon, 04 Dec 2023, Donald Buczek wrote:
> 
> On 11/30/23 21:37, NeilBrown wrote:
> > On Thu, 30 Nov 2023, Kent Overstreet wrote:
> >> On Thu, Nov 30, 2023 at 08:35:12AM +0100, Donald Buczek wrote:
> >>> On 11/29/23 22:43, Kent Overstreet wrote:
> >>>> On Tue, Nov 28, 2023 at 08:49:23AM +0100, Donald Buczek wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I'm happy, bcachefs finally made it to mainline, Kudos to Kent!
> >>>>>
> >>>>> As far as I know, there is an ongoing discussion about the problems of non-unique inode numbers exposed by snapshots an no real conclusion yet, on how filesystems should expose snapshots.
> >>>>>
> >>>>> The current behavior is kind of a showstopper for us, because we are still running multi-user systems. The problem is, that any unprivileged user can create subvolumes and snapshots:
> >>>>>
> >>>>>       buczek@dose:/scratch/local3$ bcachefs subvolume create vol1
> >>>>>       buczek@dose:/scratch/local3$ mkdir vol1/dir1
> >>>>>       buczek@dose:/scratch/local3$ bcachefs subvolume snapshot vol1/snp1
> >>>>>       buczek@dose:/scratch/local3$ ls -li vol1/
> >>>>>       total 0
> >>>>>       1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> >>>>>       1476413180 drwxrwxr-x 3 buczek buczek 0 Nov 20 15:01 snp1
> >>>>>       buczek@dose:/scratch/local3$ ls -li vol1/snp1/
> >>>>>       total 0
> >>>>>       1342189197 drwxrwxr-x 2 buczek buczek 0 Nov 20 15:01 dir1
> >>>>>       buczek@dose:/scratch/local3$ find .
> >>>>>       .
> >>>>>       ./vol1
> >>>>>       find: File system loop detected; ‘./vol1/snp1’ is part of the same file system loop as ‘./vol1’.
> >>>>>       ./vol1/dir1
> >>>>>       buczek@dose:/scratch/local3$
> >>>>>
> >>>>> We have a few tools which walk over the filesystem (backup, mirror, accounting) and these are just not prepared for non-unique inode numbers aside from regular hardlinks. I'm concerned that experiments of a unprivileged users could make important tools to fail.
> >>>>>
> >>>>> Questions:
> >>>>>
> >>>>> - Would it be a workaround to make creation of subvolumes and snapshots privileged operations?
> >>>>>
> >>>>> - If we want to evolve our tools: What is the best way for userspace to recognize subvolumes and snapshots and tell them apart from ordinary directories?
> >>>>
> >>>> I'm considering writing an "integer identifiers considered harmful"
> >>>> paper. I haven't dug into it yet - just skimmed the lwn coverage - but
> >>>> adding in the overlayfs issue makes this problem sound fundamentally
> >>>> unsolvable given the current constraints.
> >>>>
> >>>> Recognizing subvolume boundaries: we don't have anything for this yet,
> >>>> and given what happened with btrfs I'm not sure we want to go that
> >>>> route.
> >>>
> >>> A fabricated fsid - I think, that is what btrfs does - wouldn't help
> >>> us either. The mentioned tools would just stop at the assumed mount
> >>> points.
> >>
> >> *nod* - nor for NFS.
> >>
> >>>> We could (probably) expose unique inode numbers, filesystem wide, but at
> >>>> a cost. We'd have to restrict ourselves internally to 32 bit inode
> >>>> numbers, and then report inode numbers that include the subvolume ID as
> >>>> the high 32 bits.
> >>>>
> >>>> That won't leave enough bits to shard inode numbers at create time based
> >>>> on CPU id, which isn't the _most_ important optimization, but would hurt
> >>>> to lose.
> >>>>
> >>>> Going forward, we really need - as you allude to - better userspace
> >>>> APIs. Inode number probably can't be an integer anymore, it needs to be
> >>>> a string if it's going to be able to do all the things we want it to do.
> >>>
> >>> Should be an uuid then. Expose inode uuid (and/or snapshot uuid) via
> >>> xattr? Anyway, I know, people more competent than me are thinking
> >>> about the problem.
> >>
> >> Exposing the subvolume ID via an xattr would be easy, if that works for
> >> you.
> > 
> > I think that would be a bad choice.  Stick with the filehandle.  It is
> > an existing interface supported by many filesystems and it works over
> > NFS.
> 
> 
> Hmm, isn't a (struct file_handle) filehandle supposed to be opaque?
> I'm not sure if it is a good idea to stick filesystem-specific
> meta-information (like a subvolume ident) into special bits of opaque
> numbers (whether inode number or filehandle) at all. 

Yes, a filehandle is opaque.  Digging around inside it would be wrong.

My understanding was that your primary need was not to know where
subvolumes were, but to know when two files were different even though
they had the same inode number.  filehandles can be used to tell if two
files are different with total reliability.

> 
> The information could also be made available via specialized ioctl(s)
> as an alternative to xattr.

ioctl would be worse than xattr.  statx extension would be best if you
need something that filehandle cannot provide.

> 
> One disadvantage of embedded in filehandle, xattr or ioctl is that
> userspace would need an extra syscall for each directory just to see
> if this is a subvolume or a snapshot, which most of the time it is
> not.  Maybe, even an otherwise unnecessary open is needed to get a
> consistent view between multiple information-retrieving syscalls.  If
> only statx() was extendable.

uhmm... statx is extendable...
Adding a STATX_ATTR_SUBVOLUME_ROOT attribute flag shouldn't be too
controversial, if a clear need and use-case were presented.

> 
> You don't have this problem with a crafted fsid.  Then creating a
> subvolume or a snapshot could be seen as an implied mount.  Maybe
> that's not the worst solution. 

No, probably not the *worst*, but not far from it.  It is a really bad
non-solution.

Have a look at
 https://github.com/SUSE/kernel-source/blob/master/patches.suse/vfs-add-super_operations-get_inode_dev

This is a monstrosity that SUSE needs to carry so that the fake fsid
that btrfs generates for subvolumes is used consistently.  There are
various interfaces in the kernel that use fsid/inode - some just for
error reporting, others for more substantial things.  In mainline linux
these do the wrong thing for btrfs subvolumes.

Please don't copy this mistake into bcachefs.

Thanks,
NeilBrown

> 
> Best
> 
>    Donald
> 
> > 
> > NeilBrown
> > 
> > 
> >>
> >>> What to you think about the suggestion to make creation of snapshots
> >>> (and possible volumes, not sure about that) a privileged operation
> >>> (optional, of course) ? I know, multi-user is niche nowadays, but for
> >>> me it looks like a rather cheap workaround.
> >>
> >> Yeah, that's not an unreasonable ask.
> >>
> > 
> 
> -- 
> Donald Buczek
> buczek@molgen.mpg.de
> Tel: +49 30 8413 1433
> 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-12-04 22:45           ` NeilBrown
@ 2023-12-05 21:35             ` Donald Buczek
  2023-12-05 22:01               ` NeilBrown
  0 siblings, 1 reply; 92+ messages in thread
From: Donald Buczek @ 2023-12-05 21:35 UTC (permalink / raw)
  To: NeilBrown; +Cc: Kent Overstreet, linux-bcachefs, Stefan Krueger

On 12/4/23 23:45, NeilBrown wrote:

>> Hmm, isn't a (struct file_handle) filehandle supposed to be opaque?
>> I'm not sure if it is a good idea to stick filesystem-specific
>> meta-information (like a subvolume ident) into special bits of opaque
>> numbers (whether inode number or filehandle) at all.
> 
> Yes, a filehandle is opaque.  Digging around inside it would be wrong.
> 
> My understanding was that your primary need was not to know where
> subvolumes were, but to know when two files were different even though
> they had the same inode number.  filehandles can be used to tell if two
> files are different with total reliability.

You are right, these are two orthogonal requests and they were not clearly separated in this thread, thus the misunderstanding.

I think, unique identifiers for different objects are a must. I think in bcachefs, the filehandles returned by name_to_handle_at() already are unique, so this could be used by userspace to work around duplicate inode numbers right away. Even the hashing you proposed could be done in userspace if it wants to fold the data into to some existing format with inode numbers.

>> The information could also be made available via specialized ioctl(s)
>> as an alternative to xattr.
> 
> ioctl would be worse than xattr.  statx extension would be best if you
> need something that filehandle cannot provide.
> 
>>
>> One disadvantage of embedded in filehandle, xattr or ioctl is that
>> userspace would need an extra syscall for each directory just to see
>> if this is a subvolume or a snapshot, which most of the time it is
>> not.  Maybe, even an otherwise unnecessary open is needed to get a
>> consistent view between multiple information-retrieving syscalls.  If
>> only statx() was extendable.
> 
> uhmm... statx is extendable...
> Adding a STATX_ATTR_SUBVOLUME_ROOT attribute flag shouldn't be too
> controversial, if a clear need and use-case were presented.

When the inode numbers are made unique one way or the other but tools like backup or mirror have no way to identify snapshots, they would just copy the snapshots removing the deduplication on the destination. If they had a way to figure out, that they are about to walk into a new volume, they could instead do a more sensible thing like stop and cry for help or whatever is explicitly configured.

statx() would be very nice, because you can avoid additional system calls. I wish, we could get the file_handle in the same call, but I assume that is not possible to be added to statx() because of the potential big and unknown length. However, I assume the combination of volume-ident and inode number, which both would be available from one statx() call, are also unique, so this should do, too.

Best

   Donald

>> You don't have this problem with a crafted fsid.  Then creating a
>> subvolume or a snapshot could be seen as an implied mount.  Maybe
>> that's not the worst solution.
> 
> No, probably not the *worst*, but not far from it.  It is a really bad
> non-solution.
> 
> Have a look at
>   https://github.com/SUSE/kernel-source/blob/master/patches.suse/vfs-add-super_operations-get_inode_dev
> 
> This is a monstrosity that SUSE needs to carry so that the fake fsid
> that btrfs generates for subvolumes is used consistently.  There are
> various interfaces in the kernel that use fsid/inode - some just for
> error reporting, others for more substantial things.  In mainline linux
> these do the wrong thing for btrfs subvolumes.
> 
> Please don't copy this mistake into bcachefs.
> 
> Thanks,
> NeilBrown
> 
>>
>> Best
>>
>>     Donald
>>
>>>
>>> NeilBrown
>>>
>>>
>>>>
>>>>> What to you think about the suggestion to make creation of snapshots
>>>>> (and possible volumes, not sure about that) a privileged operation
>>>>> (optional, of course) ? I know, multi-user is niche nowadays, but for
>>>>> me it looks like a rather cheap workaround.
>>>>
>>>> Yeah, that's not an unreasonable ask.
>>>>
>>>
>>
>> -- 
>> Donald Buczek
>> buczek@molgen.mpg.de
>> Tel: +49 30 8413 1433
>>
> 

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-12-05 21:35             ` Donald Buczek
@ 2023-12-05 22:01               ` NeilBrown
  2023-12-07 11:53                 ` Donald Buczek
  0 siblings, 1 reply; 92+ messages in thread
From: NeilBrown @ 2023-12-05 22:01 UTC (permalink / raw)
  To: Donald Buczek; +Cc: Kent Overstreet, linux-bcachefs, Stefan Krueger

On Wed, 06 Dec 2023, Donald Buczek wrote:
> On 12/4/23 23:45, NeilBrown wrote:
> 
> >> Hmm, isn't a (struct file_handle) filehandle supposed to be opaque?
> >> I'm not sure if it is a good idea to stick filesystem-specific
> >> meta-information (like a subvolume ident) into special bits of opaque
> >> numbers (whether inode number or filehandle) at all.
> > 
> > Yes, a filehandle is opaque.  Digging around inside it would be wrong.
> > 
> > My understanding was that your primary need was not to know where
> > subvolumes were, but to know when two files were different even though
> > they had the same inode number.  filehandles can be used to tell if two
> > files are different with total reliability.
> 
> You are right, these are two orthogonal requests and they were not
> clearly separated in this thread, thus the misunderstanding. 
> 
> I think, unique identifiers for different objects are a must.  I think
> in bcachefs, the filehandles returned by name_to_handle_at() already
> are unique, so this could be used by userspace to work around
> duplicate inode numbers right away.  Even the hashing you proposed
> could be done in userspace if it wants to fold the data into to some
> existing format with inode numbers. 
> 
> >> The information could also be made available via specialized ioctl(s)
> >> as an alternative to xattr.
> > 
> > ioctl would be worse than xattr.  statx extension would be best if you
> > need something that filehandle cannot provide.
> > 
> >>
> >> One disadvantage of embedded in filehandle, xattr or ioctl is that
> >> userspace would need an extra syscall for each directory just to see
> >> if this is a subvolume or a snapshot, which most of the time it is
> >> not.  Maybe, even an otherwise unnecessary open is needed to get a
> >> consistent view between multiple information-retrieving syscalls.  If
> >> only statx() was extendable.
> > 
> > uhmm... statx is extendable...
> > Adding a STATX_ATTR_SUBVOLUME_ROOT attribute flag shouldn't be too
> > controversial, if a clear need and use-case were presented.
> 
> When the inode numbers are made unique one way or the other but tools
> like backup or mirror have no way to identify snapshots, they would
> just copy the snapshots removing the deduplication on the destination.
> If they had a way to figure out, that they are about to walk into a
> new volume, they could instead do a more sensible thing like stop and
> cry for help or whatever is explicitly configured.
> 
> statx() would be very nice, because you can avoid additional system
> calls.  I wish, we could get the file_handle in the same call, but I
> assume that is not possible to be added to statx() because of the
> potential big and unknown length.  However, I assume the combination
> of volume-ident and inode number, which both would be available from
> one statx() call, are also unique, so this should do, too. 

The file handle is limited in size to 128 bytes.  The current statx
structure doesn't have that much spare space, but I don't see a problem
with having it return a larger structure if that is explicitly
requested.  Of course I don't know what other developers might think of
that.

Extending statx to return a "root-of-subvol" flag or possibly a
"subvol-id" u64 should be uncontroversial and would allow backup tools
and similar to easily ignore subvols.  This would need to be driven the
the bcachefs team of course.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-12-05 22:01               ` NeilBrown
@ 2023-12-07 11:53                 ` Donald Buczek
  2023-12-08  1:16                   ` NeilBrown
  0 siblings, 1 reply; 92+ messages in thread
From: Donald Buczek @ 2023-12-07 11:53 UTC (permalink / raw)
  To: NeilBrown; +Cc: Kent Overstreet, linux-bcachefs, Stefan Krueger

On 12/5/23 23:01, NeilBrown wrote:
> On Wed, 06 Dec 2023, Donald Buczek wrote:
>
>> statx() would be very nice, because you can avoid additional system
>> calls.  I wish, we could get the file_handle in the same call, but I
>> assume that is not possible to be added to statx() because of the
>> potential big and unknown length.  However, I assume the combination
>> of volume-ident and inode number, which both would be available from
>> one statx() call, are also unique, so this should do, too. 
> 

> The file handle is limited in size to 128 bytes.  The current statx
> structure doesn't have that much spare space, but I don't see a problem
> with having it return a larger structure if that is explicitly
> requested.  Of course I don't know what other developers might think of
> that.
> 
> Extending statx to return a "root-of-subvol" flag or possibly a
> "subvol-id" u64 should be uncontroversial and would allow backup tools
> and similar to easily ignore subvols.  This would need to be driven the
> the bcachefs team of course.

I'd vote for the subvol-id. I think, with bcachefs a u32 would do, but it could be return as u64 for future growth or other filesystems.

128 additional bytes for the space for (any) filehandle seems like a lot. And it is more than the current 96 bytes reserve of `struct statx`. I don't know if a user-provided pointer field (optional, to another 128 bytes user buffer) would be considerable at all.

It's not really needed for bcachefs (its file_handle is just 16 byte anyway: u64 inode, u32 volume, u32 generation) or for my usecase (don't blindly walk into snapshots). But other tools might profit from the ability to get any file_handle with statx(), e.g. userspace nfs server, software with some an external file registration database.

Best
  Donald

> 
> Thanks,
> NeilBrown

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-12-07 11:53                 ` Donald Buczek
@ 2023-12-08  1:16                   ` NeilBrown
  2023-12-08  1:37                     ` Kent Overstreet
  0 siblings, 1 reply; 92+ messages in thread
From: NeilBrown @ 2023-12-08  1:16 UTC (permalink / raw)
  To: Donald Buczek; +Cc: Kent Overstreet, linux-bcachefs, Stefan Krueger

On Thu, 07 Dec 2023, Donald Buczek wrote:
> 
> I'd vote for the subvol-id.  I think, with bcachefs a u32 would do,
> but it could be return as u64 for future growth or other filesystems.

Probably a good choice.  Though I'd like to re-emphasise that even doing
this doesn't remove the need to make sure inode numbers are as unique as
they can possibly be made.  There are multiple places in /proc (for
example) where a file is identified by fsid and inode number.  It isn't
really practical to insert a volume-id in there (though maybe we could
duplicate all those interfaces....).

Inode numbers would need to be absolutely unique with a volume, and
mostly unique within a filesystem.  xor-ing the internal inode number
with a hash of the volume number is one way to achieve this.

i.e.: while enabling processes that use stat() family function is
clearly import, it does not solve all the problems caused by the
addition of a volume/subvol level between "filesystem" and "inode".

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-12-08  1:16                   ` NeilBrown
@ 2023-12-08  1:37                     ` Kent Overstreet
  2023-12-08  2:13                       ` NeilBrown
  0 siblings, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-08  1:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: Donald Buczek, linux-bcachefs, Stefan Krueger

On Fri, Dec 08, 2023 at 12:16:53PM +1100, NeilBrown wrote:
> On Thu, 07 Dec 2023, Donald Buczek wrote:
> > 
> > I'd vote for the subvol-id.  I think, with bcachefs a u32 would do,
> > but it could be return as u64 for future growth or other filesystems.
> 
> Probably a good choice.  Though I'd like to re-emphasise that even doing
> this doesn't remove the need to make sure inode numbers are as unique as
> they can possibly be made.  There are multiple places in /proc (for
> example) where a file is identified by fsid and inode number.  It isn't
> really practical to insert a volume-id in there (though maybe we could
> duplicate all those interfaces....).
> 
> Inode numbers would need to be absolutely unique with a volume, and
> mostly unique within a filesystem.  xor-ing the internal inode number
> with a hash of the volume number is one way to achieve this.
> 
> i.e.: while enabling processes that use stat() family function is
> clearly import, it does not solve all the problems caused by the
> addition of a volume/subvol level between "filesystem" and "inode".

So, I'm hoping we can start pushing the NFS file handle as a standard
interface for unique file ID, but since we clearly need something
interim/for existing programs:

xoring in the subvolume ID sounds like it shouldn't have any real
downside, do we want to make that the default?

Another possibility: we do have the ability to restrict inode numbers to
32 bits, I've been contemplating using this to report the subvol ID in
the top 32 bits. Not my preferred approach, since we currently use the
top bits of the inode number for sharding by cpu ID and this would be
incompatible with that option, but sounds like it might be worth adding.

For plumbing in an option for how to include the subvolume ID in the
inode number - we'll add it to fs/bcachefs/opts.h as an OPT_STR() option
(i.e. an enum), reserve a few bits in the superblock for it, specify in
opts.h when it can be set (at mount, runtime, format time).

statx sounds like a no brainer. I don't have the bandwidth to get into
that right now, but: inode_inum().subvol is the bcachefs interface for
getting it. (Not bi_subvolume in the on disk inode, that's only set on
subvolume roots).

How does that sound on the bcachefs side, Neil?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-12-08  1:37                     ` Kent Overstreet
@ 2023-12-08  2:13                       ` NeilBrown
  2023-12-08  2:49                         ` Kent Overstreet
  0 siblings, 1 reply; 92+ messages in thread
From: NeilBrown @ 2023-12-08  2:13 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Donald Buczek, linux-bcachefs, Stefan Krueger

On Fri, 08 Dec 2023, Kent Overstreet wrote:
> On Fri, Dec 08, 2023 at 12:16:53PM +1100, NeilBrown wrote:
> > On Thu, 07 Dec 2023, Donald Buczek wrote:
> > > 
> > > I'd vote for the subvol-id.  I think, with bcachefs a u32 would do,
> > > but it could be return as u64 for future growth or other filesystems.
> > 
> > Probably a good choice.  Though I'd like to re-emphasise that even doing
> > this doesn't remove the need to make sure inode numbers are as unique as
> > they can possibly be made.  There are multiple places in /proc (for
> > example) where a file is identified by fsid and inode number.  It isn't
> > really practical to insert a volume-id in there (though maybe we could
> > duplicate all those interfaces....).
> > 
> > Inode numbers would need to be absolutely unique with a volume, and
> > mostly unique within a filesystem.  xor-ing the internal inode number
> > with a hash of the volume number is one way to achieve this.
> > 
> > i.e.: while enabling processes that use stat() family function is
> > clearly import, it does not solve all the problems caused by the
> > addition of a volume/subvol level between "filesystem" and "inode".
> 
> So, I'm hoping we can start pushing the NFS file handle as a standard
> interface for unique file ID, but since we clearly need something
> interim/for existing programs:
> 
> xoring in the subvolume ID sounds like it shouldn't have any real
> downside, do we want to make that the default?

The only downside is that it does not provide 100% uniqueness so it
could give people a false sense of security.
You would need to be careful that the xor doesn't result in any reserved
inode numbers - 0 in particular.  That isn't hard though.
If you are going to do it, then don't even make it an option - always do
it.

> 
> Another possibility: we do have the ability to restrict inode numbers to
> 32 bits, I've been contemplating using this to report the subvol ID in
> the top 32 bits. Not my preferred approach, since we currently use the
> top bits of the inode number for sharding by cpu ID and this would be
> incompatible with that option, but sounds like it might be worth adding.

I think this is far-and-away the best option - for bcachefs.  All the
problems disappear.
Can you really not use other bits to shard by cpu ID?

For btrfs, it would probably be better to stick with 64bit inode numbers
so that bcachefs and btrfs have similar problems and can present a joint
front in trying to solve them.

(The only reason btrfs cannot use just 32bits for inode numbers is that
 they never re-use inode numbers.  !?!?!?! )

> 
> For plumbing in an option for how to include the subvolume ID in the
> inode number - we'll add it to fs/bcachefs/opts.h as an OPT_STR() option
> (i.e. an enum), reserve a few bits in the superblock for it, specify in
> opts.h when it can be set (at mount, runtime, format time).
> 
> statx sounds like a no brainer. I don't have the bandwidth to get into
> that right now, but: inode_inum().subvol is the bcachefs interface for
> getting it. (Not bi_subvolume in the on disk inode, that's only set on
> subvolume roots).
> 
> How does that sound on the bcachefs side, Neil?
> 

I'm against making things optional, and don't think sharding (or not
reusing inode numbers) is a good excuse to cause Posix incompatibility. 
But other than that, it makes sense.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-12-08  2:13                       ` NeilBrown
@ 2023-12-08  2:49                         ` Kent Overstreet
  2023-12-08 11:34                           ` Donald Buczek
  0 siblings, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-08  2:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: Donald Buczek, linux-bcachefs, Stefan Krueger

On Fri, Dec 08, 2023 at 01:13:48PM +1100, NeilBrown wrote:
> On Fri, 08 Dec 2023, Kent Overstreet wrote:
> > On Fri, Dec 08, 2023 at 12:16:53PM +1100, NeilBrown wrote:
> > > On Thu, 07 Dec 2023, Donald Buczek wrote:
> > > > 
> > > > I'd vote for the subvol-id.  I think, with bcachefs a u32 would do,
> > > > but it could be return as u64 for future growth or other filesystems.
> > > 
> > > Probably a good choice.  Though I'd like to re-emphasise that even doing
> > > this doesn't remove the need to make sure inode numbers are as unique as
> > > they can possibly be made.  There are multiple places in /proc (for
> > > example) where a file is identified by fsid and inode number.  It isn't
> > > really practical to insert a volume-id in there (though maybe we could
> > > duplicate all those interfaces....).
> > > 
> > > Inode numbers would need to be absolutely unique with a volume, and
> > > mostly unique within a filesystem.  xor-ing the internal inode number
> > > with a hash of the volume number is one way to achieve this.
> > > 
> > > i.e.: while enabling processes that use stat() family function is
> > > clearly import, it does not solve all the problems caused by the
> > > addition of a volume/subvol level between "filesystem" and "inode".
> > 
> > So, I'm hoping we can start pushing the NFS file handle as a standard
> > interface for unique file ID, but since we clearly need something
> > interim/for existing programs:
> > 
> > xoring in the subvolume ID sounds like it shouldn't have any real
> > downside, do we want to make that the default?
> 
> The only downside is that it does not provide 100% uniqueness so it
> could give people a false sense of security.
> You would need to be careful that the xor doesn't result in any reserved
> inode numbers - 0 in particular.  That isn't hard though.
> If you are going to do it, then don't even make it an option - always do
> it.

Well, see below.

> > Another possibility: we do have the ability to restrict inode numbers to
> > 32 bits, I've been contemplating using this to report the subvol ID in
> > the top 32 bits. Not my preferred approach, since we currently use the
> > top bits of the inode number for sharding by cpu ID and this would be
> > incompatible with that option, but sounds like it might be worth adding.
> 
> I think this is far-and-away the best option - for bcachefs.  All the
> problems disappear.
> Can you really not use other bits to shard by cpu ID?

Hmm. Actually doing the math, maybe we can make this work...

We really only need 6 or 7 bits out of the inode number for sharding;
then 20-32 bits (nobody's going to have a billion snapshots; a million
is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
bits for actually allocating inodes out of.

That'll be enough for the vast, vast majority of users, but exceeding
that limit is already something we're technically capable of: we're
currently seeing filesystems well over 100 TB, petabyte range expected
as fsck gets more optimized and online fsck comes.

So I can't bake in a limit like that, we need to keep our options open
:)

> For btrfs, it would probably be better to stick with 64bit inode numbers
> so that bcachefs and btrfs have similar problems and can present a joint
> front in trying to solve them.
> 
> (The only reason btrfs cannot use just 32bits for inode numbers is that
>  they never re-use inode numbers.  !?!?!?! )

That is a !?!?!?. bcachefs certainly can, and it's not hard, we leave
around an inode_generation key when we delete an inode.

> I'm against making things optional, and don't think sharding (or not
> reusing inode numbers) is a good excuse to cause Posix incompatibility. 
> But other than that, it makes sense.

I just don't think posix compatibility is realistically possible in all
situations we've got coming - especially with overlayfs to consider.
People are also starting to want to do fun things with container
filesystems, that's becoming a hot topic as well and depending on how
it's done the problem of combining inode numbers from multiple
filesystems into one coherent "view" may come up again.

So I think we also need to be designing something that we know is going
to work and relaxing our constraints a bit.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-12-08  2:49                         ` Kent Overstreet
@ 2023-12-08 11:34                           ` Donald Buczek
  2023-12-08 20:02                             ` Kent Overstreet
  0 siblings, 1 reply; 92+ messages in thread
From: Donald Buczek @ 2023-12-08 11:34 UTC (permalink / raw)
  To: Kent Overstreet, NeilBrown; +Cc: linux-bcachefs, Stefan Krueger

On 12/8/23 03:49, Kent Overstreet wrote:

> We really only need 6 or 7 bits out of the inode number for sharding;
> then 20-32 bits (nobody's going to have a billion snapshots; a million
> is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
> bits for actually allocating inodes out of.
> 
> That'll be enough for the vast, vast majority of users, but exceeding
> that limit is already something we're technically capable of: we're
> currently seeing filesystems well over 100 TB, petabyte range expected
> as fsck gets more optimized and online fsck comes.

30 bits would not be enough even today:

buczek@done:~$ df -i /amd/done/C/C8024
Filesystem         Inodes     IUsed      IFree IUse% Mounted on
/dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024

So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).

And if the idea to produce unique inode numbers by hashing the filehandle into 64 is followed, collisions definitely need to be addressed. With 618857441 objects, the probability of a hash collision with 64 bit is already over 1% [1].

[1] https://en.wikipedia.org/wiki/Birthday_problem

    from math import exp
    n = 618857441
    d = 2**64
    1-exp(-(n**2)/(2*d))

0.010327121831036457

D.


> So I can't bake in a limit like that, we need to keep our options open
> :)
> 
>> For btrfs, it would probably be better to stick with 64bit inode numbers
>> so that bcachefs and btrfs have similar problems and can present a joint
>> front in trying to solve them.
>>
>> (The only reason btrfs cannot use just 32bits for inode numbers is that
>>  they never re-use inode numbers.  !?!?!?! )
> 
> That is a !?!?!?. bcachefs certainly can, and it's not hard, we leave
> around an inode_generation key when we delete an inode.
> 
>> I'm against making things optional, and don't think sharding (or not
>> reusing inode numbers) is a good excuse to cause Posix incompatibility. 
>> But other than that, it makes sense.
> 
> I just don't think posix compatibility is realistically possible in all
> situations we've got coming - especially with overlayfs to consider.
> People are also starting to want to do fun things with container
> filesystems, that's becoming a hot topic as well and depending on how
> it's done the problem of combining inode numbers from multiple
> filesystems into one coherent "view" may come up again.
> 
> So I think we also need to be designing something that we know is going
> to work and relaxing our constraints a bit.

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-12-08 11:34                           ` Donald Buczek
@ 2023-12-08 20:02                             ` Kent Overstreet
  2023-12-11 22:43                               ` NeilBrown
  0 siblings, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-08 20:02 UTC (permalink / raw)
  To: Donald Buczek; +Cc: NeilBrown, linux-bcachefs, Stefan Krueger

On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote:
> On 12/8/23 03:49, Kent Overstreet wrote:
> 
> > We really only need 6 or 7 bits out of the inode number for sharding;
> > then 20-32 bits (nobody's going to have a billion snapshots; a million
> > is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
> > bits for actually allocating inodes out of.
> > 
> > That'll be enough for the vast, vast majority of users, but exceeding
> > that limit is already something we're technically capable of: we're
> > currently seeing filesystems well over 100 TB, petabyte range expected
> > as fsck gets more optimized and online fsck comes.
> 
> 30 bits would not be enough even today:
> 
> buczek@done:~$ df -i /amd/done/C/C8024
> Filesystem         Inodes     IUsed      IFree IUse% Mounted on
> /dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024
> 
> So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).
> 
> And if the idea to produce unique inode numbers by hashing the filehandle into 64 is followed, collisions definitely need to be addressed. With 618857441 objects, the probability of a hash collision with 64 bit is already over 1% [1].

Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for
a unique identifier; time to start looking at how to extend statx.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-12-08 20:02                             ` Kent Overstreet
@ 2023-12-11 22:43                               ` NeilBrown
  2023-12-11 23:32                                 ` file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?) Kent Overstreet
                                                   ` (2 more replies)
  0 siblings, 3 replies; 92+ messages in thread
From: NeilBrown @ 2023-12-11 22:43 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Donald Buczek, linux-bcachefs, Stefan Krueger

On Sat, 09 Dec 2023, Kent Overstreet wrote:
> On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote:
> > On 12/8/23 03:49, Kent Overstreet wrote:
> > 
> > > We really only need 6 or 7 bits out of the inode number for sharding;
> > > then 20-32 bits (nobody's going to have a billion snapshots; a million
> > > is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
> > > bits for actually allocating inodes out of.
> > > 
> > > That'll be enough for the vast, vast majority of users, but exceeding
> > > that limit is already something we're technically capable of: we're
> > > currently seeing filesystems well over 100 TB, petabyte range expected
> > > as fsck gets more optimized and online fsck comes.
> > 
> > 30 bits would not be enough even today:
> > 
> > buczek@done:~$ df -i /amd/done/C/C8024
> > Filesystem         Inodes     IUsed      IFree IUse% Mounted on
> > /dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024
> > 
> > So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).

only 30 bits though.  So it is a long way before you use all 32 bits.
How many volumes do you have?

> > 
> > And if the idea to produce unique inode numbers by hashing the filehandle into 64 is followed, collisions definitely need to be addressed. With 618857441 objects, the probability of a hash collision with 64 bit is already over 1% [1].
> 
> Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for
> a unique identifier; time to start looking at how to extend statx.
> 

64 should be plenty...

If you have 32 bits for free allocation, and 7 bits for sharding across
128 CPUs, then you can allocate many more than 4 billion inodes.  Maybe
not the full 500 billion for 39 bits, but if you actually spread the
load over all the shards, then certainly tens of billions.

If you use 22 bits for volume number and 42 bits for inodes in a volume,
then you can spend 7 on sharding and still have room for 55 of Donald's
filesystems to be allocated by each CPU.

And if Donald only needs thousands of volumes, not millions, then he
could configure for a whole lot more headroom.

In fact, if you use the 64 bits of vfs_inode number by filling in bits from
the fs-inode number from one end, and bits from the volume number from
the other end, then you don't need to pre-configure how the 64 bits are
shared.
You record inum-bits and volnum bits in the filesystem metadata, and
increase either as needed.  Once the sum hits 64, you start returning
ENOSPC for new files or new volumes.

There will come a day when 64 bits is not enough for inodes in a single
filesystem.  Today is not that day.

NeilBrown

^ permalink raw reply	[flat|nested] 92+ messages in thread

* file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-11 22:43                               ` NeilBrown
@ 2023-12-11 23:32                                 ` Kent Overstreet
  2023-12-11 23:53                                   ` NeilBrown
  2023-12-12  0:25                                   ` David Howells
  2023-12-11 23:40                                 ` David Howells
  2023-12-13 12:43                                 ` How to cope with subvolumes and snapshots on muti-user systems? Donald Buczek
  2 siblings, 2 replies; 92+ messages in thread
From: Kent Overstreet @ 2023-12-11 23:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel

On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote:
> On Sat, 09 Dec 2023, Kent Overstreet wrote:
> > On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote:
> > > On 12/8/23 03:49, Kent Overstreet wrote:
> > > 
> > > > We really only need 6 or 7 bits out of the inode number for sharding;
> > > > then 20-32 bits (nobody's going to have a billion snapshots; a million
> > > > is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
> > > > bits for actually allocating inodes out of.
> > > > 
> > > > That'll be enough for the vast, vast majority of users, but exceeding
> > > > that limit is already something we're technically capable of: we're
> > > > currently seeing filesystems well over 100 TB, petabyte range expected
> > > > as fsck gets more optimized and online fsck comes.
> > > 
> > > 30 bits would not be enough even today:
> > > 
> > > buczek@done:~$ df -i /amd/done/C/C8024
> > > Filesystem         Inodes     IUsed      IFree IUse% Mounted on
> > > /dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024
> > > 
> > > So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).
> 
> only 30 bits though.  So it is a long way before you use all 32 bits.
> How many volumes do you have?
> 
> > > 
> > > And if the idea to produce unique inode numbers by hashing the filehandle into 64 is followed, collisions definitely need to be addressed. With 618857441 objects, the probability of a hash collision with 64 bit is already over 1% [1].
> > 
> > Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for
> > a unique identifier; time to start looking at how to extend statx.
> > 
> 
> 64 should be plenty...
> 
> If you have 32 bits for free allocation, and 7 bits for sharding across
> 128 CPUs, then you can allocate many more than 4 billion inodes.  Maybe
> not the full 500 billion for 39 bits, but if you actually spread the
> load over all the shards, then certainly tens of billions.
> 
> If you use 22 bits for volume number and 42 bits for inodes in a volume,
> then you can spend 7 on sharding and still have room for 55 of Donald's
> filesystems to be allocated by each CPU.
> 
> And if Donald only needs thousands of volumes, not millions, then he
> could configure for a whole lot more headroom.
> 
> In fact, if you use the 64 bits of vfs_inode number by filling in bits from
> the fs-inode number from one end, and bits from the volume number from
> the other end, then you don't need to pre-configure how the 64 bits are
> shared.
> You record inum-bits and volnum bits in the filesystem metadata, and
> increase either as needed.  Once the sum hits 64, you start returning
> ENOSPC for new files or new volumes.
> 
> There will come a day when 64 bits is not enough for inodes in a single
> filesystem.  Today is not that day.

Except filesystems are growing all the time: that leaves almost no room
for growth and then we're back in the world where users had to guess how
many inodes they were going to need in their filesystem; and if we put
this off now we're just kicking the can down the road until when it
becomes really pressing and urgent to solve.

No, we need to come up with something better.

I was chatting a bit with David Howells on IRC about this, and floated
adding the file handle to statx. It looks like there's enough space
reserved to make this feasible - probably going with a fixed maximum
size of 128-256 bits.

Thoughts?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-11 22:43                               ` NeilBrown
  2023-12-11 23:32                                 ` file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?) Kent Overstreet
@ 2023-12-11 23:40                                 ` David Howells
  2023-12-12 20:59                                   ` Kent Overstreet
  2023-12-13 12:43                                 ` How to cope with subvolumes and snapshots on muti-user systems? Donald Buczek
  2 siblings, 1 reply; 92+ messages in thread
From: David Howells @ 2023-12-11 23:40 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: dhowells, NeilBrown, Donald Buczek, linux-bcachefs,
	Stefan Krueger, linux-fsdevel

Kent Overstreet <kent.overstreet@linux.dev> wrote:

> I was chatting a bit with David Howells on IRC about this, and floated
> adding the file handle to statx. It looks like there's enough space
> reserved to make this feasible - probably going with a fixed maximum
> size of 128-256 bits.

We can always save the last bit to indicate extension space/extension record,
so we're not that strapped for space.

David


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-11 23:32                                 ` file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?) Kent Overstreet
@ 2023-12-11 23:53                                   ` NeilBrown
  2023-12-12  0:05                                     ` Kent Overstreet
  2023-12-12  0:25                                   ` David Howells
  1 sibling, 1 reply; 92+ messages in thread
From: NeilBrown @ 2023-12-11 23:53 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel

On Tue, 12 Dec 2023, Kent Overstreet wrote:
> On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote:
> > On Sat, 09 Dec 2023, Kent Overstreet wrote:
> > > On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote:
> > > > On 12/8/23 03:49, Kent Overstreet wrote:
> > > > 
> > > > > We really only need 6 or 7 bits out of the inode number for sharding;
> > > > > then 20-32 bits (nobody's going to have a billion snapshots; a million
> > > > > is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
> > > > > bits for actually allocating inodes out of.
> > > > > 
> > > > > That'll be enough for the vast, vast majority of users, but exceeding
> > > > > that limit is already something we're technically capable of: we're
> > > > > currently seeing filesystems well over 100 TB, petabyte range expected
> > > > > as fsck gets more optimized and online fsck comes.
> > > > 
> > > > 30 bits would not be enough even today:
> > > > 
> > > > buczek@done:~$ df -i /amd/done/C/C8024
> > > > Filesystem         Inodes     IUsed      IFree IUse% Mounted on
> > > > /dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024
> > > > 
> > > > So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).
> > 
> > only 30 bits though.  So it is a long way before you use all 32 bits.
> > How many volumes do you have?
> > 
> > > > 
> > > > And if the idea to produce unique inode numbers by hashing the filehandle into 64 is followed, collisions definitely need to be addressed. With 618857441 objects, the probability of a hash collision with 64 bit is already over 1% [1].
> > > 
> > > Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for
> > > a unique identifier; time to start looking at how to extend statx.
> > > 
> > 
> > 64 should be plenty...
> > 
> > If you have 32 bits for free allocation, and 7 bits for sharding across
> > 128 CPUs, then you can allocate many more than 4 billion inodes.  Maybe
> > not the full 500 billion for 39 bits, but if you actually spread the
> > load over all the shards, then certainly tens of billions.
> > 
> > If you use 22 bits for volume number and 42 bits for inodes in a volume,
> > then you can spend 7 on sharding and still have room for 55 of Donald's
> > filesystems to be allocated by each CPU.
> > 
> > And if Donald only needs thousands of volumes, not millions, then he
> > could configure for a whole lot more headroom.
> > 
> > In fact, if you use the 64 bits of vfs_inode number by filling in bits from
> > the fs-inode number from one end, and bits from the volume number from
> > the other end, then you don't need to pre-configure how the 64 bits are
> > shared.
> > You record inum-bits and volnum bits in the filesystem metadata, and
> > increase either as needed.  Once the sum hits 64, you start returning
> > ENOSPC for new files or new volumes.
> > 
> > There will come a day when 64 bits is not enough for inodes in a single
> > filesystem.  Today is not that day.
> 
> Except filesystems are growing all the time: that leaves almost no room
> for growth and then we're back in the world where users had to guess how
> many inodes they were going to need in their filesystem; and if we put
> this off now we're just kicking the can down the road until when it
> becomes really pressing and urgent to solve.
> 
> No, we need to come up with something better.
> 
> I was chatting a bit with David Howells on IRC about this, and floated
> adding the file handle to statx. It looks like there's enough space
> reserved to make this feasible - probably going with a fixed maximum
> size of 128-256 bits.

Unless there is room for 128 bytes (1024bits), it cannot be used for
NFSv4.  That would be ... sad.

> 
> Thoughts?
> 

I'm completely in favour of exporting the (full) filehandle through
statx. (If the application asked for the filehandle, it will expect a
larger structure to be returned.  We don't need to use the currently
reserved space).

I'm completely in favour of updating user-space tools to use the
filehandle to check if two handles are for the same file.

I'm not in favour of any filesystem depending on this for correct
functionality today.  As long as the filesystem isn't so large that
inum+volnum simply cannot fit in 64 bits, we should make a reasonable
effort to present them both in 64 bits.  Depending on the filehandle is a
good plan for long term growth, not for basic functionality today.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-11 23:53                                   ` NeilBrown
@ 2023-12-12  0:05                                     ` Kent Overstreet
  2023-12-12  0:59                                       ` NeilBrown
  0 siblings, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12  0:05 UTC (permalink / raw)
  To: NeilBrown
  Cc: Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel

On Tue, Dec 12, 2023 at 10:53:07AM +1100, NeilBrown wrote:
> On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote:
> > > On Sat, 09 Dec 2023, Kent Overstreet wrote:
> > > > On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote:
> > > > > On 12/8/23 03:49, Kent Overstreet wrote:
> > > > > 
> > > > > > We really only need 6 or 7 bits out of the inode number for sharding;
> > > > > > then 20-32 bits (nobody's going to have a billion snapshots; a million
> > > > > > is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
> > > > > > bits for actually allocating inodes out of.
> > > > > > 
> > > > > > That'll be enough for the vast, vast majority of users, but exceeding
> > > > > > that limit is already something we're technically capable of: we're
> > > > > > currently seeing filesystems well over 100 TB, petabyte range expected
> > > > > > as fsck gets more optimized and online fsck comes.
> > > > > 
> > > > > 30 bits would not be enough even today:
> > > > > 
> > > > > buczek@done:~$ df -i /amd/done/C/C8024
> > > > > Filesystem         Inodes     IUsed      IFree IUse% Mounted on
> > > > > /dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024
> > > > > 
> > > > > So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).
> > > 
> > > only 30 bits though.  So it is a long way before you use all 32 bits.
> > > How many volumes do you have?
> > > 
> > > > > 
> > > > > And if the idea to produce unique inode numbers by hashing the filehandle into 64 is followed, collisions definitely need to be addressed. With 618857441 objects, the probability of a hash collision with 64 bit is already over 1% [1].
> > > > 
> > > > Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for
> > > > a unique identifier; time to start looking at how to extend statx.
> > > > 
> > > 
> > > 64 should be plenty...
> > > 
> > > If you have 32 bits for free allocation, and 7 bits for sharding across
> > > 128 CPUs, then you can allocate many more than 4 billion inodes.  Maybe
> > > not the full 500 billion for 39 bits, but if you actually spread the
> > > load over all the shards, then certainly tens of billions.
> > > 
> > > If you use 22 bits for volume number and 42 bits for inodes in a volume,
> > > then you can spend 7 on sharding and still have room for 55 of Donald's
> > > filesystems to be allocated by each CPU.
> > > 
> > > And if Donald only needs thousands of volumes, not millions, then he
> > > could configure for a whole lot more headroom.
> > > 
> > > In fact, if you use the 64 bits of vfs_inode number by filling in bits from
> > > the fs-inode number from one end, and bits from the volume number from
> > > the other end, then you don't need to pre-configure how the 64 bits are
> > > shared.
> > > You record inum-bits and volnum bits in the filesystem metadata, and
> > > increase either as needed.  Once the sum hits 64, you start returning
> > > ENOSPC for new files or new volumes.
> > > 
> > > There will come a day when 64 bits is not enough for inodes in a single
> > > filesystem.  Today is not that day.
> > 
> > Except filesystems are growing all the time: that leaves almost no room
> > for growth and then we're back in the world where users had to guess how
> > many inodes they were going to need in their filesystem; and if we put
> > this off now we're just kicking the can down the road until when it
> > becomes really pressing and urgent to solve.
> > 
> > No, we need to come up with something better.
> > 
> > I was chatting a bit with David Howells on IRC about this, and floated
> > adding the file handle to statx. It looks like there's enough space
> > reserved to make this feasible - probably going with a fixed maximum
> > size of 128-256 bits.
> 
> Unless there is room for 128 bytes (1024bits), it cannot be used for
> NFSv4.  That would be ... sad.

NFSv4 specs that for the maximum size? That is pretty hefty...

> > Thoughts?
> > 
> 
> I'm completely in favour of exporting the (full) filehandle through
> statx. (If the application asked for the filehandle, it will expect a
> larger structure to be returned.  We don't need to use the currently
> reserved space).
> 
> I'm completely in favour of updating user-space tools to use the
> filehandle to check if two handles are for the same file.
> 
> I'm not in favour of any filesystem depending on this for correct
> functionality today.  As long as the filesystem isn't so large that
> inum+volnum simply cannot fit in 64 bits, we should make a reasonable
> effort to present them both in 64 bits.  Depending on the filehandle is a
> good plan for long term growth, not for basic functionality today.

My standing policy in these situations is that I'll do the stopgap/hacky
measure... but not before doing actual, real work on the longterm
solution :)

So if we're all in favor of statx as the real long term solution, how
about we see how far we get with that?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-11 23:32                                 ` file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?) Kent Overstreet
  2023-12-11 23:53                                   ` NeilBrown
@ 2023-12-12  0:25                                   ` David Howells
  1 sibling, 0 replies; 92+ messages in thread
From: David Howells @ 2023-12-12  0:25 UTC (permalink / raw)
  To: NeilBrown
  Cc: dhowells, Kent Overstreet, Donald Buczek, linux-bcachefs,
	Stefan Krueger, linux-fsdevel, jaltman

NeilBrown <neilb@suse.de> wrote:

> I'm not in favour of any filesystem depending on this for correct
> functionality today.  As long as the filesystem isn't so large that
> inum+volnum simply cannot fit in 64 bits, we should make a reasonable
> effort to present them both in 64 bits.

The Auristor version of AFS (which is supported by the in-kernel afs
filesystem) has a file handle (FID) that consists of a 64-bit volume ID (which
is arguably a superblock-level thing), a 96-bit vnode ID (equivalent to the
inode number) and a 32-bit uniquifier.  I don't think the capacity of these
values is fully utilised by the server... yet, but I can't support them
correctly with the UAPI that we have.

Allowing some expansion of stx_ino and/or replacing that with an stx_fid would
be helpful in that regard.  That said, getting userspace to be able to handle
this is I think going to be a large undertaking with lots of auditing
required.

David

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  0:05                                     ` Kent Overstreet
@ 2023-12-12  0:59                                       ` NeilBrown
  2023-12-12  1:10                                         ` Kent Overstreet
                                                           ` (2 more replies)
  0 siblings, 3 replies; 92+ messages in thread
From: NeilBrown @ 2023-12-12  0:59 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel

On Tue, 12 Dec 2023, Kent Overstreet wrote:
> On Tue, Dec 12, 2023 at 10:53:07AM +1100, NeilBrown wrote:
> > On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > > On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote:
> > > > On Sat, 09 Dec 2023, Kent Overstreet wrote:
> > > > > On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote:
> > > > > > On 12/8/23 03:49, Kent Overstreet wrote:
> > > > > > 
> > > > > > > We really only need 6 or 7 bits out of the inode number for sharding;
> > > > > > > then 20-32 bits (nobody's going to have a billion snapshots; a million
> > > > > > > is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
> > > > > > > bits for actually allocating inodes out of.
> > > > > > > 
> > > > > > > That'll be enough for the vast, vast majority of users, but exceeding
> > > > > > > that limit is already something we're technically capable of: we're
> > > > > > > currently seeing filesystems well over 100 TB, petabyte range expected
> > > > > > > as fsck gets more optimized and online fsck comes.
> > > > > > 
> > > > > > 30 bits would not be enough even today:
> > > > > > 
> > > > > > buczek@done:~$ df -i /amd/done/C/C8024
> > > > > > Filesystem         Inodes     IUsed      IFree IUse% Mounted on
> > > > > > /dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024
> > > > > > 
> > > > > > So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).
> > > > 
> > > > only 30 bits though.  So it is a long way before you use all 32 bits.
> > > > How many volumes do you have?
> > > > 
> > > > > > 
> > > > > > And if the idea to produce unique inode numbers by hashing the filehandle into 64 is followed, collisions definitely need to be addressed. With 618857441 objects, the probability of a hash collision with 64 bit is already over 1% [1].
> > > > > 
> > > > > Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for
> > > > > a unique identifier; time to start looking at how to extend statx.
> > > > > 
> > > > 
> > > > 64 should be plenty...
> > > > 
> > > > If you have 32 bits for free allocation, and 7 bits for sharding across
> > > > 128 CPUs, then you can allocate many more than 4 billion inodes.  Maybe
> > > > not the full 500 billion for 39 bits, but if you actually spread the
> > > > load over all the shards, then certainly tens of billions.
> > > > 
> > > > If you use 22 bits for volume number and 42 bits for inodes in a volume,
> > > > then you can spend 7 on sharding and still have room for 55 of Donald's
> > > > filesystems to be allocated by each CPU.
> > > > 
> > > > And if Donald only needs thousands of volumes, not millions, then he
> > > > could configure for a whole lot more headroom.
> > > > 
> > > > In fact, if you use the 64 bits of vfs_inode number by filling in bits from
> > > > the fs-inode number from one end, and bits from the volume number from
> > > > the other end, then you don't need to pre-configure how the 64 bits are
> > > > shared.
> > > > You record inum-bits and volnum bits in the filesystem metadata, and
> > > > increase either as needed.  Once the sum hits 64, you start returning
> > > > ENOSPC for new files or new volumes.
> > > > 
> > > > There will come a day when 64 bits is not enough for inodes in a single
> > > > filesystem.  Today is not that day.
> > > 
> > > Except filesystems are growing all the time: that leaves almost no room
> > > for growth and then we're back in the world where users had to guess how
> > > many inodes they were going to need in their filesystem; and if we put
> > > this off now we're just kicking the can down the road until when it
> > > becomes really pressing and urgent to solve.
> > > 
> > > No, we need to come up with something better.
> > > 
> > > I was chatting a bit with David Howells on IRC about this, and floated
> > > adding the file handle to statx. It looks like there's enough space
> > > reserved to make this feasible - probably going with a fixed maximum
> > > size of 128-256 bits.
> > 
> > Unless there is room for 128 bytes (1024bits), it cannot be used for
> > NFSv4.  That would be ... sad.
> 
> NFSv4 specs that for the maximum size? That is pretty hefty...

It is - but it needs room to identify the filesystem and it needs to be
stable across time.  That need is more than a local filesystem needs.

NFSv2 allowed 32 bytes which is enough for a 16 byte filesys uuid, 8
byte inum and 8byte generation num.  But only just.

NFSv3 allowed 64 bytes which was likely plenty for (nearly?) every
situation.

NFSv4 doubled it again because .... who knows.  "why not" I guess.
Linux nfsd typically uses 20 or 28 bytes plus whatever the filesystem
wants. (28 when the export point is not the root of the filesystem).
I suspect this always fits within an NFSv3 handle except when
re-exporting an NFS filesystem.  NFS re-export is an interesting case...


> 
> > > Thoughts?
> > > 
> > 
> > I'm completely in favour of exporting the (full) filehandle through
> > statx. (If the application asked for the filehandle, it will expect a
> > larger structure to be returned.  We don't need to use the currently
> > reserved space).
> > 
> > I'm completely in favour of updating user-space tools to use the
> > filehandle to check if two handles are for the same file.
> > 
> > I'm not in favour of any filesystem depending on this for correct
> > functionality today.  As long as the filesystem isn't so large that
> > inum+volnum simply cannot fit in 64 bits, we should make a reasonable
> > effort to present them both in 64 bits.  Depending on the filehandle is a
> > good plan for long term growth, not for basic functionality today.
> 
> My standing policy in these situations is that I'll do the stopgap/hacky
> measure... but not before doing actual, real work on the longterm
> solution :)

Eminently sensible.

> 
> So if we're all in favor of statx as the real long term solution, how
> about we see how far we get with that?
> 

I suggest:

 STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
                              same inode number

 
 __u64 stx_vol     Volume identifier.  Two files with same stx_vol and 
                   stx_ino MUST be the same.  Exact meaning of volumes
                   is filesys-specific
 
 STATX_VOL         Want stx_vol

  __u8 stx_handle_len  Length of stx_handle if present
  __u8 stx_handle[128] Unique stable identifier for this file.  Will
                       NEVER be reused for a different file.
                       This appears AFTER __statx_pad2, beyond
                       the current 'struct statx'.
 STATX_HANDLE      Want stx_handle_len and stx_handle. Buffer for
                   receiving statx info has at least
                   sizeof(struct statx)+128 bytes.

I think both the handle and the vol can be useful.
NFS can provide stx_handle but not stx_vol.  It is the thing
to use for equality testing, but it is only needed if
STATX_ATTR_INUM_NOT_UNIQUE is set.
stx_vol is useful for "du -x" or maybe "du --one-volume" or similar.


Note that we *could* add stx_vol to NFSv4.2.  It is designed for
incremental extension.  I suspect we wouldn't want to rush into this,
but to wait to see if different volume-capable filesystems have other
details of volumes that are common and can usefully be exported by statx
- or NFS.

NeilBrown

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  0:59                                       ` NeilBrown
@ 2023-12-12  1:10                                         ` Kent Overstreet
  2023-12-12  2:13                                           ` NeilBrown
  2023-12-12  5:53                                         ` Dave Chinner
  2023-12-12  7:03                                         ` David Howells
  2 siblings, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12  1:10 UTC (permalink / raw)
  To: NeilBrown
  Cc: Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel

On Tue, Dec 12, 2023 at 11:59:51AM +1100, NeilBrown wrote:
> On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > NFSv4 specs that for the maximum size? That is pretty hefty...
> 
> It is - but it needs room to identify the filesystem and it needs to be
> stable across time.  That need is more than a local filesystem needs.
> 
> NFSv2 allowed 32 bytes which is enough for a 16 byte filesys uuid, 8
> byte inum and 8byte generation num.  But only just.
> 
> NFSv3 allowed 64 bytes which was likely plenty for (nearly?) every
> situation.
> 
> NFSv4 doubled it again because .... who knows.  "why not" I guess.
> Linux nfsd typically uses 20 or 28 bytes plus whatever the filesystem
> wants. (28 when the export point is not the root of the filesystem).
> I suspect this always fits within an NFSv3 handle except when
> re-exporting an NFS filesystem.  NFS re-export is an interesting case...

Now I'm really curious - i_generation wasn't enough? Are we including
filesystem UUIDs?

I suppose if we want to be able to round trip this stuff we do need to
allocate space for it, even if a local filesystem would never include
it.

> I suggest:
> 
>  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
>                               same inode number
> 
>  
>  __u64 stx_vol     Volume identifier.  Two files with same stx_vol and 
>                    stx_ino MUST be the same.  Exact meaning of volumes
>                    is filesys-specific

NFS reexport that you mentioned previously makes it seem like this
guarantee is impossible to provide in general (so I'd leave it out
entirely, it's just something for people to trip over).

But we definitely want stx_vol in there. Another thing that people ask
for is a way to ask "is this a subvolume root?" - we should make sure
that's clearly specified, or can we just include a bit for it?

>  STATX_VOL         Want stx_vol
> 
>   __u8 stx_handle_len  Length of stx_handle if present
>   __u8 stx_handle[128] Unique stable identifier for this file.  Will
>                        NEVER be reused for a different file.
>                        This appears AFTER __statx_pad2, beyond
>                        the current 'struct statx'.
>  STATX_HANDLE      Want stx_handle_len and stx_handle. Buffer for
>                    receiving statx info has at least
>                    sizeof(struct statx)+128 bytes.
> 
> I think both the handle and the vol can be useful.
> NFS can provide stx_handle but not stx_vol.  It is the thing
> to use for equality testing, but it is only needed if
> STATX_ATTR_INUM_NOT_UNIQUE is set.
> stx_vol is useful for "du -x" or maybe "du --one-volume" or similar.
> 
> 
> Note that we *could* add stx_vol to NFSv4.2.  It is designed for
> incremental extension.  I suspect we wouldn't want to rush into this,
> but to wait to see if different volume-capable filesystems have other
> details of volumes that are common and can usefully be exported by statx

Sounds reasonable

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  1:10                                         ` Kent Overstreet
@ 2023-12-12  2:13                                           ` NeilBrown
  2023-12-12  2:24                                             ` Kent Overstreet
  2023-12-12  9:08                                             ` Christian Brauner
  0 siblings, 2 replies; 92+ messages in thread
From: NeilBrown @ 2023-12-12  2:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel

On Tue, 12 Dec 2023, Kent Overstreet wrote:
> On Tue, Dec 12, 2023 at 11:59:51AM +1100, NeilBrown wrote:
> > On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > > NFSv4 specs that for the maximum size? That is pretty hefty...
> > 
> > It is - but it needs room to identify the filesystem and it needs to be
> > stable across time.  That need is more than a local filesystem needs.
> > 
> > NFSv2 allowed 32 bytes which is enough for a 16 byte filesys uuid, 8
> > byte inum and 8byte generation num.  But only just.
> > 
> > NFSv3 allowed 64 bytes which was likely plenty for (nearly?) every
> > situation.
> > 
> > NFSv4 doubled it again because .... who knows.  "why not" I guess.
> > Linux nfsd typically uses 20 or 28 bytes plus whatever the filesystem
> > wants. (28 when the export point is not the root of the filesystem).
> > I suspect this always fits within an NFSv3 handle except when
> > re-exporting an NFS filesystem.  NFS re-export is an interesting case...
> 
> Now I'm really curious - i_generation wasn't enough? Are we including
> filesystem UUIDs?

i_generation was invented so that it could be inserted into the NFS
fileshandle.

The NFS filehandle is opaque.  It likely contains an inode number, a
generation number, and a filesystem identifier.  But it is not possible
to extract those from the handle.

> 
> I suppose if we want to be able to round trip this stuff we do need to
> allocate space for it, even if a local filesystem would never include
> it.
> 
> > I suggest:
> > 
> >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> >                               same inode number
> > 
> >  
> >  __u64 stx_vol     Volume identifier.  Two files with same stx_vol and 
> >                    stx_ino MUST be the same.  Exact meaning of volumes
> >                    is filesys-specific
> 
> NFS reexport that you mentioned previously makes it seem like this
> guarantee is impossible to provide in general (so I'd leave it out
> entirely, it's just something for people to trip over).

NFS would not set stx_vol and would not return STATX_VOL in stx_mask.
So it would not attempt to provide that guarantee.

Maybe we don't need to explicitly make this guarantee.

> 
> But we definitely want stx_vol in there. Another thing that people ask
> for is a way to ask "is this a subvolume root?" - we should make sure
> that's clearly specified, or can we just include a bit for it?

The start way to test for a filesystem root - or mount point at least -
is to stat the directory in question and its parent (..) and see if the
have the same st_dev or not.
Applying the same logic to volumes means that a single stx_vol number is
sufficient.

I'm not strongly against a STATX_ATTR_VOL_ROOT flag providing everyone
agrees what it means that we cannot imagine any awkward corner-cases
(like a 'root' being different from a 'mount point').

> 
> >  STATX_VOL         Want stx_vol
> > 
> >   __u8 stx_handle_len  Length of stx_handle if present
> >   __u8 stx_handle[128] Unique stable identifier for this file.  Will
> >                        NEVER be reused for a different file.
> >                        This appears AFTER __statx_pad2, beyond
> >                        the current 'struct statx'.
> >  STATX_HANDLE      Want stx_handle_len and stx_handle. Buffer for
> >                    receiving statx info has at least
> >                    sizeof(struct statx)+128 bytes.
> > 
> > I think both the handle and the vol can be useful.
> > NFS can provide stx_handle but not stx_vol.  It is the thing
> > to use for equality testing, but it is only needed if
> > STATX_ATTR_INUM_NOT_UNIQUE is set.
> > stx_vol is useful for "du -x" or maybe "du --one-volume" or similar.
> > 
> > 
> > Note that we *could* add stx_vol to NFSv4.2.  It is designed for
> > incremental extension.  I suspect we wouldn't want to rush into this,
> > but to wait to see if different volume-capable filesystems have other
> > details of volumes that are common and can usefully be exported by statx
> 
> Sounds reasonable
> 

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  2:13                                           ` NeilBrown
@ 2023-12-12  2:24                                             ` Kent Overstreet
  2023-12-12  9:08                                             ` Christian Brauner
  1 sibling, 0 replies; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12  2:24 UTC (permalink / raw)
  To: NeilBrown
  Cc: Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel

On Tue, Dec 12, 2023 at 01:13:07PM +1100, NeilBrown wrote:
> On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > I suppose if we want to be able to round trip this stuff we do need to
> > allocate space for it, even if a local filesystem would never include
> > it.
> > 
> > > I suggest:
> > > 
> > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > >                               same inode number
> > > 
> > >  
> > >  __u64 stx_vol     Volume identifier.  Two files with same stx_vol and 
> > >                    stx_ino MUST be the same.  Exact meaning of volumes
> > >                    is filesys-specific
> > 
> > NFS reexport that you mentioned previously makes it seem like this
> > guarantee is impossible to provide in general (so I'd leave it out
> > entirely, it's just something for people to trip over).
> 
> NFS would not set stx_vol and would not return STATX_VOL in stx_mask.
> So it would not attempt to provide that guarantee.
> 
> Maybe we don't need to explicitly make this guarantee.

I tend to lean towards pushing people to only use the filehandle for
testing if files are the same, and discouraging using stx_vol for this
purpose.

Since the filehandle will include the generation number, it lets us fix
permenantly time of use races that stx_vol would still be subject to.

> > But we definitely want stx_vol in there. Another thing that people ask
> > for is a way to ask "is this a subvolume root?" - we should make sure
> > that's clearly specified, or can we just include a bit for it?
> 
> The start way to test for a filesystem root - or mount point at least -
> is to stat the directory in question and its parent (..) and see if the
> have the same st_dev or not.
> Applying the same logic to volumes means that a single stx_vol number is
> sufficient.
> 
> I'm not strongly against a STATX_ATTR_VOL_ROOT flag providing everyone
> agrees what it means that we cannot imagine any awkward corner-cases
> (like a 'root' being different from a 'mount point').

I'd like to do that, it lets us define properly some corner cases (is
the filesystem also a subvolume root if it's got the same volume ID as
the parent dir, on another filesystem? Let's make sure that's a yes).

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  0:59                                       ` NeilBrown
  2023-12-12  1:10                                         ` Kent Overstreet
@ 2023-12-12  5:53                                         ` Dave Chinner
  2023-12-12  6:32                                           ` Amir Goldstein
                                                             ` (2 more replies)
  2023-12-12  7:03                                         ` David Howells
  2 siblings, 3 replies; 92+ messages in thread
From: Dave Chinner @ 2023-12-12  5:53 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	David Howells, linux-fsdevel

On Tue, Dec 12, 2023 at 11:59:51AM +1100, NeilBrown wrote:
> On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > On Tue, Dec 12, 2023 at 10:53:07AM +1100, NeilBrown wrote:
> > > On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > > > On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote:
> > > > > On Sat, 09 Dec 2023, Kent Overstreet wrote:
> > > > Thoughts?
> > > > 
> > > 
> > > I'm completely in favour of exporting the (full) filehandle through
> > > statx. (If the application asked for the filehandle, it will expect a
> > > larger structure to be returned.  We don't need to use the currently
> > > reserved space).
> > > 
> > > I'm completely in favour of updating user-space tools to use the
> > > filehandle to check if two handles are for the same file.
> > > 
> > > I'm not in favour of any filesystem depending on this for correct
> > > functionality today.  As long as the filesystem isn't so large that
> > > inum+volnum simply cannot fit in 64 bits, we should make a reasonable
> > > effort to present them both in 64 bits.  Depending on the filehandle is a
> > > good plan for long term growth, not for basic functionality today.
> > 
> > My standing policy in these situations is that I'll do the stopgap/hacky
> > measure... but not before doing actual, real work on the longterm
> > solution :)
> 
> Eminently sensible.
> 
> > 
> > So if we're all in favor of statx as the real long term solution, how
> > about we see how far we get with that?
> > 
> 
> I suggest:
> 
>  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
>                               same inode number
> 
>  
>  __u64 stx_vol     Volume identifier.  Two files with same stx_vol and 
>                    stx_ino MUST be the same.  Exact meaning of volumes
>                    is filesys-specific
>  
>  STATX_VOL         Want stx_vol
> 
>   __u8 stx_handle_len  Length of stx_handle if present
>   __u8 stx_handle[128] Unique stable identifier for this file.  Will
>                        NEVER be reused for a different file.
>                        This appears AFTER __statx_pad2, beyond
>                        the current 'struct statx'.
>  STATX_HANDLE      Want stx_handle_len and stx_handle. Buffer for
>                    receiving statx info has at least
>                    sizeof(struct statx)+128 bytes.

Hmmm.

Doesn't anyone else see or hear the elephant trumpeting loudly in
the middle of the room?

I mean, we already have name_to_handle_at() for userspace to get a
unique, opaque, filesystem defined file handle for any given file.
It's the same filehandle that filesystems hand to the nfsd so nfs
clients can uniquely identify the file they are asking the nfsd to
operate on.

The contents of these filehandles is entirely defined by the file
system and completely opaque to the user. The only thing that
parses the internal contents of the handle is the filesystem itself.
Therefore, as long as the fs encodes the information it needs into the
handle to determine what subvol/snapshot the inode belongs to when
the handle is passed back to it (e.g. from open_by_handle_at()) then
nothing else needs to care how it is encoded.

So can someone please explain to me why we need to try to re-invent
a generic filehandle concept in statx when we already have a
have working and widely supported user API that provides exactly
this functionality?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  5:53                                         ` Dave Chinner
@ 2023-12-12  6:32                                           ` Amir Goldstein
  2023-12-12  8:56                                             ` Christian Brauner
  2023-12-12  9:10                                             ` David Howells
  2023-12-12  9:10                                           ` file handle in statx Donald Buczek
  2023-12-12 15:21                                           ` file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?) Kent Overstreet
  2 siblings, 2 replies; 92+ messages in thread
From: Amir Goldstein @ 2023-12-12  6:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Kent Overstreet, Donald Buczek, linux-bcachefs,
	Stefan Krueger, David Howells, linux-fsdevel, Christian Brauner

On Tue, Dec 12, 2023 at 7:53 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 12, 2023 at 11:59:51AM +1100, NeilBrown wrote:
> > On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > > On Tue, Dec 12, 2023 at 10:53:07AM +1100, NeilBrown wrote:
> > > > On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > > > > On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote:
> > > > > > On Sat, 09 Dec 2023, Kent Overstreet wrote:
> > > > > Thoughts?
> > > > >
> > > >
> > > > I'm completely in favour of exporting the (full) filehandle through
> > > > statx. (If the application asked for the filehandle, it will expect a
> > > > larger structure to be returned.  We don't need to use the currently
> > > > reserved space).
> > > >
> > > > I'm completely in favour of updating user-space tools to use the
> > > > filehandle to check if two handles are for the same file.
> > > >
> > > > I'm not in favour of any filesystem depending on this for correct
> > > > functionality today.  As long as the filesystem isn't so large that
> > > > inum+volnum simply cannot fit in 64 bits, we should make a reasonable
> > > > effort to present them both in 64 bits.  Depending on the filehandle is a
> > > > good plan for long term growth, not for basic functionality today.
> > >
> > > My standing policy in these situations is that I'll do the stopgap/hacky
> > > measure... but not before doing actual, real work on the longterm
> > > solution :)
> >
> > Eminently sensible.
> >
> > >
> > > So if we're all in favor of statx as the real long term solution, how
> > > about we see how far we get with that?
> > >
> >
> > I suggest:
> >
> >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> >                               same inode number
> >
> >
> >  __u64 stx_vol     Volume identifier.  Two files with same stx_vol and
> >                    stx_ino MUST be the same.  Exact meaning of volumes
> >                    is filesys-specific
> >
> >  STATX_VOL         Want stx_vol
> >
> >   __u8 stx_handle_len  Length of stx_handle if present
> >   __u8 stx_handle[128] Unique stable identifier for this file.  Will
> >                        NEVER be reused for a different file.
> >                        This appears AFTER __statx_pad2, beyond
> >                        the current 'struct statx'.
> >  STATX_HANDLE      Want stx_handle_len and stx_handle. Buffer for
> >                    receiving statx info has at least
> >                    sizeof(struct statx)+128 bytes.
>
> Hmmm.
>
> Doesn't anyone else see or hear the elephant trumpeting loudly in
> the middle of the room?
>
> I mean, we already have name_to_handle_at() for userspace to get a
> unique, opaque, filesystem defined file handle for any given file.
> It's the same filehandle that filesystems hand to the nfsd so nfs
> clients can uniquely identify the file they are asking the nfsd to
> operate on.
>
> The contents of these filehandles is entirely defined by the file
> system and completely opaque to the user. The only thing that
> parses the internal contents of the handle is the filesystem itself.
> Therefore, as long as the fs encodes the information it needs into the
> handle to determine what subvol/snapshot the inode belongs to when
> the handle is passed back to it (e.g. from open_by_handle_at()) then
> nothing else needs to care how it is encoded.
>
> So can someone please explain to me why we need to try to re-invent
> a generic filehandle concept in statx when we already have a
> have working and widely supported user API that provides exactly
> this functionality?

Yeh.

Not to mention that since commit 64343119d7b8 ("exportfs: support encoding
non-decodeable file handles by default"), exporting file handles as strong
object identifiers is not limited to filesystems that support NFS export.

All fs have a default implementation of encode_fh() by encoding a file id
of type FILEID_INO64_GEN from { i_ino, i_generation } and any fs can
define its own encode_fh() operation (e.g. to include subvol id) even without
implementing the decode fh operations needed for NFS export.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  0:59                                       ` NeilBrown
  2023-12-12  1:10                                         ` Kent Overstreet
  2023-12-12  5:53                                         ` Dave Chinner
@ 2023-12-12  7:03                                         ` David Howells
  2 siblings, 0 replies; 92+ messages in thread
From: David Howells @ 2023-12-12  7:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dhowells, NeilBrown, Kent Overstreet, Donald Buczek,
	linux-bcachefs, Stefan Krueger, linux-fsdevel

Dave Chinner <david@fromorbit.com> wrote:

> I mean, we already have name_to_handle_at() for userspace to get a
> unique, opaque, filesystem defined file handle for any given file.
> It's the same filehandle that filesystems hand to the nfsd so nfs
> clients can uniquely identify the file they are asking the nfsd to
> operate on.

That's the along lines I was thinking of when I suggested using a file handle
to Kent.

Question is, though, do we need to also expand stx_ino in some way for apps
that present it to the user?

David


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  6:32                                           ` Amir Goldstein
@ 2023-12-12  8:56                                             ` Christian Brauner
  2023-12-12 15:16                                               ` Kent Overstreet
  2023-12-12  9:10                                             ` David Howells
  1 sibling, 1 reply; 92+ messages in thread
From: Christian Brauner @ 2023-12-12  8:56 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, NeilBrown, Kent Overstreet, Donald Buczek,
	linux-bcachefs, Stefan Krueger, David Howells, linux-fsdevel,
	Josef Bacik, linux-btrfs

On Tue, Dec 12, 2023 at 08:32:55AM +0200, Amir Goldstein wrote:
> On Tue, Dec 12, 2023 at 7:53 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Dec 12, 2023 at 11:59:51AM +1100, NeilBrown wrote:
> > > On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > > > On Tue, Dec 12, 2023 at 10:53:07AM +1100, NeilBrown wrote:
> > > > > On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > > > > > On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote:
> > > > > > > On Sat, 09 Dec 2023, Kent Overstreet wrote:
> > > > > > Thoughts?
> > > > > >
> > > > >
> > > > > I'm completely in favour of exporting the (full) filehandle through
> > > > > statx. (If the application asked for the filehandle, it will expect a
> > > > > larger structure to be returned.  We don't need to use the currently
> > > > > reserved space).
> > > > >
> > > > > I'm completely in favour of updating user-space tools to use the
> > > > > filehandle to check if two handles are for the same file.
> > > > >
> > > > > I'm not in favour of any filesystem depending on this for correct
> > > > > functionality today.  As long as the filesystem isn't so large that
> > > > > inum+volnum simply cannot fit in 64 bits, we should make a reasonable
> > > > > effort to present them both in 64 bits.  Depending on the filehandle is a
> > > > > good plan for long term growth, not for basic functionality today.
> > > >
> > > > My standing policy in these situations is that I'll do the stopgap/hacky
> > > > measure... but not before doing actual, real work on the longterm
> > > > solution :)
> > >
> > > Eminently sensible.
> > >
> > > >
> > > > So if we're all in favor of statx as the real long term solution, how
> > > > about we see how far we get with that?
> > > >
> > >
> > > I suggest:
> > >
> > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > >                               same inode number

This is just ugly with questionable value. A constant reminder of how
broken this is. Exposing the subvolume id also makes this somewhat redundant.

> > >
> > >
> > >  __u64 stx_vol     Volume identifier.  Two files with same stx_vol and
> > >                    stx_ino MUST be the same.  Exact meaning of volumes
> > >                    is filesys-specific

Exposing this makes sense. I've mentioned that I'd be open to exporting
the subvolume id in statx before even though I know some people don't
like that. I still stand by that. Especially now that we have btrfs and
bcachefs that both have the concept of a subvolume id it makes sense to
put it into statx().

> > >
> > >  STATX_VOL         Want stx_vol
> > >
> > >   __u8 stx_handle_len  Length of stx_handle if present
> > >   __u8 stx_handle[128] Unique stable identifier for this file.  Will
> > >                        NEVER be reused for a different file.
> > >                        This appears AFTER __statx_pad2, beyond
> > >                        the current 'struct statx'.
> > >  STATX_HANDLE      Want stx_handle_len and stx_handle. Buffer for
> > >                    receiving statx info has at least
> > >                    sizeof(struct statx)+128 bytes.

No, we're should not be reinventing another opaque handle. And even if
it really doesn't belong into statx().

> >
> > Hmmm.
> >
> > Doesn't anyone else see or hear the elephant trumpeting loudly in
> > the middle of the room?
> >
> > I mean, we already have name_to_handle_at() for userspace to get a
> > unique, opaque, filesystem defined file handle for any given file.

I agree.

> > It's the same filehandle that filesystems hand to the nfsd so nfs
> > clients can uniquely identify the file they are asking the nfsd to
> > operate on.
> >
> > The contents of these filehandles is entirely defined by the file
> > system and completely opaque to the user. The only thing that
> > parses the internal contents of the handle is the filesystem itself.
> > Therefore, as long as the fs encodes the information it needs into the
> > handle to determine what subvol/snapshot the inode belongs to when
> > the handle is passed back to it (e.g. from open_by_handle_at()) then
> > nothing else needs to care how it is encoded.
> >
> > So can someone please explain to me why we need to try to re-invent
> > a generic filehandle concept in statx when we already have a
> > have working and widely supported user API that provides exactly
> > this functionality?
> 
> Yeh.
> 
> Not to mention that since commit 64343119d7b8 ("exportfs: support encoding
> non-decodeable file handles by default"), exporting file handles as strong
> object identifiers is not limited to filesystems that support NFS export.
> 
> All fs have a default implementation of encode_fh() by encoding a file id
> of type FILEID_INO64_GEN from { i_ino, i_generation } and any fs can
> define its own encode_fh() operation (e.g. to include subvol id) even without
> implementing the decode fh operations needed for NFS export.

I also would like both btrfs and bcachefs to agree.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  2:13                                           ` NeilBrown
  2023-12-12  2:24                                             ` Kent Overstreet
@ 2023-12-12  9:08                                             ` Christian Brauner
  1 sibling, 0 replies; 92+ messages in thread
From: Christian Brauner @ 2023-12-12  9:08 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	David Howells, linux-fsdevel

On Tue, Dec 12, 2023 at 01:13:07PM +1100, NeilBrown wrote:
> On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > On Tue, Dec 12, 2023 at 11:59:51AM +1100, NeilBrown wrote:
> > > On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > > > NFSv4 specs that for the maximum size? That is pretty hefty...
> > > 
> > > It is - but it needs room to identify the filesystem and it needs to be
> > > stable across time.  That need is more than a local filesystem needs.
> > > 
> > > NFSv2 allowed 32 bytes which is enough for a 16 byte filesys uuid, 8
> > > byte inum and 8byte generation num.  But only just.
> > > 
> > > NFSv3 allowed 64 bytes which was likely plenty for (nearly?) every
> > > situation.
> > > 
> > > NFSv4 doubled it again because .... who knows.  "why not" I guess.
> > > Linux nfsd typically uses 20 or 28 bytes plus whatever the filesystem
> > > wants. (28 when the export point is not the root of the filesystem).
> > > I suspect this always fits within an NFSv3 handle except when
> > > re-exporting an NFS filesystem.  NFS re-export is an interesting case...
> > 
> > Now I'm really curious - i_generation wasn't enough? Are we including
> > filesystem UUIDs?
> 
> i_generation was invented so that it could be inserted into the NFS
> fileshandle.
> 
> The NFS filehandle is opaque.  It likely contains an inode number, a
> generation number, and a filesystem identifier.  But it is not possible
> to extract those from the handle.
> 
> > 
> > I suppose if we want to be able to round trip this stuff we do need to
> > allocate space for it, even if a local filesystem would never include
> > it.
> > 
> > > I suggest:
> > > 
> > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > >                               same inode number
> > > 
> > >  
> > >  __u64 stx_vol     Volume identifier.  Two files with same stx_vol and 
> > >                    stx_ino MUST be the same.  Exact meaning of volumes
> > >                    is filesys-specific
> > 
> > NFS reexport that you mentioned previously makes it seem like this
> > guarantee is impossible to provide in general (so I'd leave it out
> > entirely, it's just something for people to trip over).
> 
> NFS would not set stx_vol and would not return STATX_VOL in stx_mask.
> So it would not attempt to provide that guarantee.
> 
> Maybe we don't need to explicitly make this guarantee.
> 
> > 
> > But we definitely want stx_vol in there. Another thing that people ask
> > for is a way to ask "is this a subvolume root?" - we should make sure
> > that's clearly specified, or can we just include a bit for it?
> 
> The start way to test for a filesystem root - or mount point at least -
> is to stat the directory in question and its parent (..) and see if the
> have the same st_dev or not.

It depends. If you want to figure out whether it's a different
filesystem or a different btrfs subvolume then yes, this generally
works because of changing device ids. But it doesn't work for
bind-mounts as they don't change device numbers. But maybe you and I are
using mount point differently here.

> Applying the same logic to volumes means that a single stx_vol number is
> sufficient.

Yes, that would generally work.

> 
> I'm not strongly against a STATX_ATTR_VOL_ROOT flag providing everyone
> agrees what it means that we cannot imagine any awkward corner-cases
> (like a 'root' being different from a 'mount point').

I feel like you might have missed my previous mails where I strongly
argued for the addition of STATX_ATTR_SUBVOLUME_ROOT:

https://lore.kernel.org/linux-btrfs/20231108-herleiten-bezwangen-ffb2821f539e@brauner

The concept of a subvolume root and of a mount point should be kept
separate. Christoph tried mapping subvolumes to vfsmounts, something
that I (and Al) vehemently oppose for various reasons outlined in that
and other long threads.

I still think that we should start with exposing subvolume id first.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-12  5:53                                         ` Dave Chinner
  2023-12-12  6:32                                           ` Amir Goldstein
@ 2023-12-12  9:10                                           ` Donald Buczek
  2023-12-12 15:20                                             ` Theodore Ts'o
  2023-12-12 15:21                                           ` file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?) Kent Overstreet
  2 siblings, 1 reply; 92+ messages in thread
From: Donald Buczek @ 2023-12-12  9:10 UTC (permalink / raw)
  To: Dave Chinner, NeilBrown
  Cc: Kent Overstreet, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel

On 12/12/23 06:53, Dave Chinner wrote:

> So can someone please explain to me why we need to try to re-invent
> a generic filehandle concept in statx when we already have a
> have working and widely supported user API that provides exactly
> this functionality?

name_to_handle_at() is fine, but userspace could profit from being able to retrieve
the filehandle together with the other metadata in a single system call.

One argument regarding stx_vol was that userspace tools, which walk the
filesystem tree, might want to avoid to run blindly into snapshots.

Additionally, if a volume defines the namespace for a set of unique inode numbers,
the tools could continue to use inode numbers just on a per-volume basis.

Best

  Donald

> Cheers,
> 
> Dave.

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  6:32                                           ` Amir Goldstein
  2023-12-12  8:56                                             ` Christian Brauner
@ 2023-12-12  9:10                                             ` David Howells
  2023-12-12  9:23                                               ` Christian Brauner
  2023-12-12  9:46                                               ` David Howells
  1 sibling, 2 replies; 92+ messages in thread
From: David Howells @ 2023-12-12  9:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: dhowells, Amir Goldstein, Dave Chinner, NeilBrown,
	Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel, Josef Bacik, linux-btrfs

Christian Brauner <brauner@kernel.org> wrote:

> > > > I suggest:
> > > >
> > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > >                               same inode number
> 
> This is just ugly with questionable value. A constant reminder of how
> broken this is. Exposing the subvolume id also makes this somewhat redundant.

There is a upcoming potential problem where even the 64-bit field I placed in
statx() may be insufficient.  The Auristor AFS server, for example, has a
96-bit vnode ID, but I can't properly represent this in stx_ino.  Currently, I
just truncate the value to fit and hope that the discarded part will be all
zero, but that's not really a good thing to do - especially when stx_ino is
used programmatically to check for hardlinks.

Would it be better to add an 'stx_ino_2' field and corresponding flag?

David

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  9:10                                             ` David Howells
@ 2023-12-12  9:23                                               ` Christian Brauner
  2023-12-12  9:28                                                 ` Miklos Szeredi
  2023-12-12  9:46                                               ` David Howells
  1 sibling, 1 reply; 92+ messages in thread
From: Christian Brauner @ 2023-12-12  9:23 UTC (permalink / raw)
  To: David Howells
  Cc: Amir Goldstein, Dave Chinner, NeilBrown, Kent Overstreet,
	Donald Buczek, linux-bcachefs, Stefan Krueger, linux-fsdevel,
	Josef Bacik, linux-btrfs

On Tue, Dec 12, 2023 at 09:10:47AM +0000, David Howells wrote:
> Christian Brauner <brauner@kernel.org> wrote:
> 
> > > > > I suggest:
> > > > >
> > > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > > >                               same inode number
> > 
> > This is just ugly with questionable value. A constant reminder of how
> > broken this is. Exposing the subvolume id also makes this somewhat redundant.
> 
> There is a upcoming potential problem where even the 64-bit field I placed in
> statx() may be insufficient.  The Auristor AFS server, for example, has a
> 96-bit vnode ID, but I can't properly represent this in stx_ino.  Currently, I

Is that vnode ID akin to a volume? Because if so you could just
piggy-back on a subvolume id field in statx() and expose it there.

> just truncate the value to fit and hope that the discarded part will be all
> zero, but that's not really a good thing to do - especially when stx_ino is
> used programmatically to check for hardlinks.
> 
> Would it be better to add an 'stx_ino_2' field and corresponding flag?

Would this be meaningfully different from using a file handle?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  9:23                                               ` Christian Brauner
@ 2023-12-12  9:28                                                 ` Miklos Szeredi
  2023-12-12  9:35                                                   ` Christian Brauner
  0 siblings, 1 reply; 92+ messages in thread
From: Miklos Szeredi @ 2023-12-12  9:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Howells, Amir Goldstein, Dave Chinner, NeilBrown,
	Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, 12 Dec 2023 at 10:23, Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Dec 12, 2023 at 09:10:47AM +0000, David Howells wrote:
> > Christian Brauner <brauner@kernel.org> wrote:
> >
> > > > > > I suggest:
> > > > > >
> > > > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > > > >                               same inode number
> > >
> > > This is just ugly with questionable value. A constant reminder of how
> > > broken this is. Exposing the subvolume id also makes this somewhat redundant.
> >
> > There is a upcoming potential problem where even the 64-bit field I placed in
> > statx() may be insufficient.  The Auristor AFS server, for example, has a
> > 96-bit vnode ID, but I can't properly represent this in stx_ino.  Currently, I
>
> Is that vnode ID akin to a volume? Because if so you could just
> piggy-back on a subvolume id field in statx() and expose it there.

And how would exporting a subvolume ID and expecting userspace to take
that into account when checking for hard links be meaningfully
different than expecting userspace to retrieve the file handle and
compare that?

The latter would at least be a generic solution, including stacking fs
inodes, instead of tackling just a specific corner of the problem
space.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  9:28                                                 ` Miklos Szeredi
@ 2023-12-12  9:35                                                   ` Christian Brauner
  2023-12-12  9:42                                                     ` Miklos Szeredi
  2023-12-12 15:28                                                     ` Kent Overstreet
  0 siblings, 2 replies; 92+ messages in thread
From: Christian Brauner @ 2023-12-12  9:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Amir Goldstein, Dave Chinner, NeilBrown,
	Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, Dec 12, 2023 at 10:28:12AM +0100, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 10:23, Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Tue, Dec 12, 2023 at 09:10:47AM +0000, David Howells wrote:
> > > Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > > > > > I suggest:
> > > > > > >
> > > > > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > > > > >                               same inode number
> > > >
> > > > This is just ugly with questionable value. A constant reminder of how
> > > > broken this is. Exposing the subvolume id also makes this somewhat redundant.
> > >
> > > There is a upcoming potential problem where even the 64-bit field I placed in
> > > statx() may be insufficient.  The Auristor AFS server, for example, has a
> > > 96-bit vnode ID, but I can't properly represent this in stx_ino.  Currently, I
> >
> > Is that vnode ID akin to a volume? Because if so you could just
> > piggy-back on a subvolume id field in statx() and expose it there.
> 
> And how would exporting a subvolume ID and expecting userspace to take
> that into account when checking for hard links be meaningfully
> different than expecting userspace to retrieve the file handle and
> compare that?
> 
> The latter would at least be a generic solution, including stacking fs
> inodes, instead of tackling just a specific corner of the problem
> space.

So taking a step back here, please. The original motivation for this
discussion was restricted to handle btrfs - and now bcachefs as well.
Both have a concept of a subvolume so it made sense to go that route.
IOW, it wasn't originally a generic problem or pitched as such.

Would overlayfs be able to utilize an extended inode field as well?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  9:35                                                   ` Christian Brauner
@ 2023-12-12  9:42                                                     ` Miklos Szeredi
  2023-12-12 13:47                                                       ` Christian Brauner
  2023-12-12 15:28                                                     ` Kent Overstreet
  1 sibling, 1 reply; 92+ messages in thread
From: Miklos Szeredi @ 2023-12-12  9:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Howells, Amir Goldstein, Dave Chinner, NeilBrown,
	Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, 12 Dec 2023 at 10:35, Christian Brauner <brauner@kernel.org> wrote:

> So taking a step back here, please. The original motivation for this
> discussion was restricted to handle btrfs - and now bcachefs as well.
> Both have a concept of a subvolume so it made sense to go that route.
> IOW, it wasn't originally a generic problem or pitched as such.
>
> Would overlayfs be able to utilize an extended inode field as well?

Well, yes, but I don't think that's the right solution.

I think the right solution is to move to using file handles instead of
st_ino, the problem with that is that there's no way kernel can force
upgrading userspace.

It might help to have the fh in statx, since that's easier on the
userspace programmer than having to deal with two interfaces (i_ino
won't go away for some time, because of backward compatibility).
OTOH I also don't like the way it would need to be shoehorned into
statx.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  9:10                                             ` David Howells
  2023-12-12  9:23                                               ` Christian Brauner
@ 2023-12-12  9:46                                               ` David Howells
  1 sibling, 0 replies; 92+ messages in thread
From: David Howells @ 2023-12-12  9:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: dhowells, Amir Goldstein, Dave Chinner, NeilBrown,
	Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel, Josef Bacik, linux-btrfs

Christian Brauner <brauner@kernel.org> wrote:

> > There is a upcoming potential problem where even the 64-bit field I placed
> > in statx() may be insufficient.  The Auristor AFS server, for example, has
> > a 96-bit vnode ID, but I can't properly represent this in stx_ino.
> > Currently, I
> 
> Is that vnode ID akin to a volume? Because if so you could just
> piggy-back on a subvolume id field in statx() and expose it there.

No.  The volume ID is the ID of the volume.  The vnode is the equivalent of an
inode.

> > just truncate the value to fit and hope that the discarded part will be all
> > zero, but that's not really a good thing to do - especially when stx_ino is
> > used programmatically to check for hardlinks.
> > 
> > Would it be better to add an 'stx_ino_2' field and corresponding flag?
> 
> Would this be meaningfully different from using a file handle?

There's also the matter of presenting the "inode number" to the user - "ls -i"
for example.

David


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  9:42                                                     ` Miklos Szeredi
@ 2023-12-12 13:47                                                       ` Christian Brauner
  2023-12-12 14:06                                                         ` Miklos Szeredi
  0 siblings, 1 reply; 92+ messages in thread
From: Christian Brauner @ 2023-12-12 13:47 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Amir Goldstein, Dave Chinner, NeilBrown,
	Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, Dec 12, 2023 at 10:42:58AM +0100, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 10:35, Christian Brauner <brauner@kernel.org> wrote:
> 
> > So taking a step back here, please. The original motivation for this
> > discussion was restricted to handle btrfs - and now bcachefs as well.
> > Both have a concept of a subvolume so it made sense to go that route.
> > IOW, it wasn't originally a generic problem or pitched as such.
> >
> > Would overlayfs be able to utilize an extended inode field as well?
> 
> Well, yes, but I don't think that's the right solution.

Exposing the subvolume id in statx() is still fine imho. It's a concept
shared between btrfs and bcachefs and it's pretty useful for interested
userspace that wants to make use of these apis.

> 
> I think the right solution is to move to using file handles instead of
> st_ino, the problem with that is that there's no way kernel can force
> upgrading userspace.

That's not our job tbh. I get why this is desirable. What we can do is
advertise better and newer apis but we don't have to make unpleasant
compromises in our interfaces for that.

> 
> It might help to have the fh in statx, since that's easier on the
> userspace programmer than having to deal with two interfaces (i_ino
> won't go away for some time, because of backward compatibility).
> OTOH I also don't like the way it would need to be shoehorned into
> statx.

No, it really doesn't belong into statx().

And besides, the file handle apis name_to_handle_at() are already
in wider use than a lot of people think. Not just for the exportfs case
but also for example, cgroupfs uses file handles to provide unique
identifiers for cgroups that can be compared.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 13:47                                                       ` Christian Brauner
@ 2023-12-12 14:06                                                         ` Miklos Szeredi
  2023-12-12 15:24                                                           ` Christian Brauner
  0 siblings, 1 reply; 92+ messages in thread
From: Miklos Szeredi @ 2023-12-12 14:06 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Howells, Amir Goldstein, Dave Chinner, NeilBrown,
	Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, 12 Dec 2023 at 14:48, Christian Brauner <brauner@kernel.org> wrote:

> Exposing the subvolume id in statx() is still fine imho. It's a concept
> shared between btrfs and bcachefs and it's pretty useful for interested
> userspace that wants to make use of these apis.

Exposing subvolume ID should be okay, as long as it's not advertised
as a way to uniquely identify an inode.   Its use should be limited to
finding subvolume boundaries.

> > It might help to have the fh in statx, since that's easier on the
> > userspace programmer than having to deal with two interfaces (i_ino
> > won't go away for some time, because of backward compatibility).
> > OTOH I also don't like the way it would need to be shoehorned into
> > statx.
>
> No, it really doesn't belong into statx().
>
> And besides, the file handle apis name_to_handle_at() are already
> in wider use than a lot of people think. Not just for the exportfs case
> but also for example, cgroupfs uses file handles to provide unique
> identifiers for cgroups that can be compared.

The issue with name_to_handle_at() is its use of the old, non-unique
mount ID.  Yes, yes, we can get away with

 fd = openat(dfd, path, O_PATH);
 name_to_handle_at(fd, "", ..., AT_EMPTY_PATH);
 statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, ...);
 close(fd);

But that's *four* syscalls instead of one...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  8:56                                             ` Christian Brauner
@ 2023-12-12 15:16                                               ` Kent Overstreet
  2023-12-12 15:29                                                 ` Christian Brauner
  0 siblings, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 15:16 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, Dave Chinner, NeilBrown, Donald Buczek,
	linux-bcachefs, Stefan Krueger, David Howells, linux-fsdevel,
	Josef Bacik, linux-btrfs

On Tue, Dec 12, 2023 at 09:56:45AM +0100, Christian Brauner wrote:
> On Tue, Dec 12, 2023 at 08:32:55AM +0200, Amir Goldstein wrote:
> > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > >                               same inode number
> 
> This is just ugly with questionable value. A constant reminder of how
> broken this is. Exposing the subvolume id also makes this somewhat redundant.

Oh hell no. We finally get a reasonably productive discussion and I wake
up to you calling things ugly and broken, with no reason or
justification or any response to the justification that's already been
given?

Christain, that's not how we do things. Let's not turn what was a
productive discussion into a trainwreck, OK?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-12  9:10                                           ` file handle in statx Donald Buczek
@ 2023-12-12 15:20                                             ` Theodore Ts'o
  2023-12-12 17:15                                               ` Frank Filz
  2023-12-13  7:37                                               ` Donald Buczek
  0 siblings, 2 replies; 92+ messages in thread
From: Theodore Ts'o @ 2023-12-12 15:20 UTC (permalink / raw)
  To: Donald Buczek
  Cc: Dave Chinner, NeilBrown, Kent Overstreet, linux-bcachefs,
	Stefan Krueger, David Howells, linux-fsdevel

On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> On 12/12/23 06:53, Dave Chinner wrote:
> 
> > So can someone please explain to me why we need to try to re-invent
> > a generic filehandle concept in statx when we already have a
> > have working and widely supported user API that provides exactly
> > this functionality?
> 
> name_to_handle_at() is fine, but userspace could profit from being
> able to retrieve the filehandle together with the other metadata in
> a single system call.

Can you say more?  What, specifically is the application that would
want to do that, and is it really in such a hot path that it would be
a user-visible improveable, let aloine something that can be actually
be measured?

						- Ted

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  5:53                                         ` Dave Chinner
  2023-12-12  6:32                                           ` Amir Goldstein
  2023-12-12  9:10                                           ` file handle in statx Donald Buczek
@ 2023-12-12 15:21                                           ` Kent Overstreet
  2023-12-12 20:48                                             ` Dave Chinner
  2 siblings, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 15:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Donald Buczek, linux-bcachefs, Stefan Krueger,
	David Howells, linux-fsdevel

On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote:
> Doesn't anyone else see or hear the elephant trumpeting loudly in
> the middle of the room?
> 
> I mean, we already have name_to_handle_at() for userspace to get a
> unique, opaque, filesystem defined file handle for any given file.
> It's the same filehandle that filesystems hand to the nfsd so nfs
> clients can uniquely identify the file they are asking the nfsd to
> operate on.
> 
> The contents of these filehandles is entirely defined by the file
> system and completely opaque to the user. The only thing that
> parses the internal contents of the handle is the filesystem itself.
> Therefore, as long as the fs encodes the information it needs into the
> handle to determine what subvol/snapshot the inode belongs to when
> the handle is passed back to it (e.g. from open_by_handle_at()) then
> nothing else needs to care how it is encoded.
> 
> So can someone please explain to me why we need to try to re-invent
> a generic filehandle concept in statx when we already have a
> have working and widely supported user API that provides exactly
> this functionality?

Definitely should be part of the discussion :)

But I think it _does_ need to be in statx; because:
 - we've determined that 64 bit ino_t just isn't a future proof
   interface, we're having real problems with it today
 - statx is _the_ standard, future proofed interface for getting inode
   attributes
 - therefore, if we want userspace programmers to be using filehandles,
   instead of inode numbers, so there code isn't broken, we need to be
   providing interfaces that guide them in that direction.

Even assuming we can update all the documentation to say "filehandles
are the correct way to test inode uniqueness", you know at least half of
programmers will stick to stx_ino instead of the filehandle if the
filehandle is an extra syscall.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 14:06                                                         ` Miklos Szeredi
@ 2023-12-12 15:24                                                           ` Christian Brauner
  0 siblings, 0 replies; 92+ messages in thread
From: Christian Brauner @ 2023-12-12 15:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: David Howells, Amir Goldstein, Dave Chinner, NeilBrown,
	Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, Dec 12, 2023 at 03:06:07PM +0100, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 14:48, Christian Brauner <brauner@kernel.org> wrote:
> 
> > Exposing the subvolume id in statx() is still fine imho. It's a concept
> > shared between btrfs and bcachefs and it's pretty useful for interested
> > userspace that wants to make use of these apis.
> 
> Exposing subvolume ID should be okay, as long as it's not advertised
> as a way to uniquely identify an inode.   Its use should be limited to
> finding subvolume boundaries.
> 
> > > It might help to have the fh in statx, since that's easier on the
> > > userspace programmer than having to deal with two interfaces (i_ino
> > > won't go away for some time, because of backward compatibility).
> > > OTOH I also don't like the way it would need to be shoehorned into
> > > statx.
> >
> > No, it really doesn't belong into statx().
> >
> > And besides, the file handle apis name_to_handle_at() are already
> > in wider use than a lot of people think. Not just for the exportfs case
> > but also for example, cgroupfs uses file handles to provide unique
> > identifiers for cgroups that can be compared.
> 
> The issue with name_to_handle_at() is its use of the old, non-unique
> mount ID.  Yes, yes, we can get away with
> 
>  fd = openat(dfd, path, O_PATH);
>  name_to_handle_at(fd, "", ..., AT_EMPTY_PATH);
>  statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, ...);
>  close(fd);
> 
> But that's *four* syscalls instead of one...

Yeah, but putting this into statx() isn't really nice imho. Once we do
actually land the unique mount id thing it wouldn't be the worst thing
to add name_to_handle_at2().

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12  9:35                                                   ` Christian Brauner
  2023-12-12  9:42                                                     ` Miklos Szeredi
@ 2023-12-12 15:28                                                     ` Kent Overstreet
  1 sibling, 0 replies; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 15:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Miklos Szeredi, David Howells, Amir Goldstein, Dave Chinner,
	NeilBrown, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, Dec 12, 2023 at 10:35:40AM +0100, Christian Brauner wrote:
> On Tue, Dec 12, 2023 at 10:28:12AM +0100, Miklos Szeredi wrote:
> > On Tue, 12 Dec 2023 at 10:23, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Tue, Dec 12, 2023 at 09:10:47AM +0000, David Howells wrote:
> > > > Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > > > > > I suggest:
> > > > > > > >
> > > > > > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > > > > > >                               same inode number
> > > > >
> > > > > This is just ugly with questionable value. A constant reminder of how
> > > > > broken this is. Exposing the subvolume id also makes this somewhat redundant.
> > > >
> > > > There is a upcoming potential problem where even the 64-bit field I placed in
> > > > statx() may be insufficient.  The Auristor AFS server, for example, has a
> > > > 96-bit vnode ID, but I can't properly represent this in stx_ino.  Currently, I
> > >
> > > Is that vnode ID akin to a volume? Because if so you could just
> > > piggy-back on a subvolume id field in statx() and expose it there.
> > 
> > And how would exporting a subvolume ID and expecting userspace to take
> > that into account when checking for hard links be meaningfully
> > different than expecting userspace to retrieve the file handle and
> > compare that?
> > 
> > The latter would at least be a generic solution, including stacking fs
> > inodes, instead of tackling just a specific corner of the problem
> > space.
> 
> So taking a step back here, please. The original motivation for this
> discussion was restricted to handle btrfs - and now bcachefs as well.
> Both have a concept of a subvolume so it made sense to go that route.
> IOW, it wasn't originally a generic problem or pitched as such.
> 
> Would overlayfs be able to utilize an extended inode field as well?

No, the original motivation was not just btrfs and bcachefs; overlayfs
fundamentally needs to export a bigger identifier than the host
filesystems - pigeonhole principle, if anyone remembers their
combinatorics.

This applies to any filesystem which is under the hood reexporting from
other filesystems; there's a lot of stuff going on in container
filesystems right now, and I expect it'll come up there (you can create
new identifiers if you're exporting file by file, but not if it's
directory trees).

And Neal brought up NFS re-exports; I think there's a compelling
argument to be made that we ought to be able to round trip NFS
filehandles.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 15:16                                               ` Kent Overstreet
@ 2023-12-12 15:29                                                 ` Christian Brauner
  2023-12-12 15:35                                                   ` Kent Overstreet
  0 siblings, 1 reply; 92+ messages in thread
From: Christian Brauner @ 2023-12-12 15:29 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Amir Goldstein, Dave Chinner, NeilBrown, Donald Buczek,
	linux-bcachefs, Stefan Krueger, David Howells, linux-fsdevel,
	Josef Bacik, linux-btrfs

On Tue, Dec 12, 2023 at 10:16:31AM -0500, Kent Overstreet wrote:
> On Tue, Dec 12, 2023 at 09:56:45AM +0100, Christian Brauner wrote:
> > On Tue, Dec 12, 2023 at 08:32:55AM +0200, Amir Goldstein wrote:
> > > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > > >                               same inode number
> > 
> > This is just ugly with questionable value. A constant reminder of how
> > broken this is. Exposing the subvolume id also makes this somewhat redundant.
> 
> Oh hell no. We finally get a reasonably productive discussion and I wake
> up to you calling things ugly and broken, with no reason or
> justification or any response to the justification that's already been
> given?
> 
> Christain, that's not how we do things. Let's not turn what was a
> productive discussion into a trainwreck, OK?

That's a major aggressive tone for no good reason. And probably not a
good way to convince anyone. I'm also not sure why you're taking a
dislike of this flag as a personal affront.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 15:29                                                 ` Christian Brauner
@ 2023-12-12 15:35                                                   ` Kent Overstreet
  2023-12-12 15:38                                                     ` Miklos Szeredi
  0 siblings, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 15:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, Dave Chinner, NeilBrown, Donald Buczek,
	linux-bcachefs, Stefan Krueger, David Howells, linux-fsdevel,
	Josef Bacik, linux-btrfs

On Tue, Dec 12, 2023 at 04:29:09PM +0100, Christian Brauner wrote:
> On Tue, Dec 12, 2023 at 10:16:31AM -0500, Kent Overstreet wrote:
> > On Tue, Dec 12, 2023 at 09:56:45AM +0100, Christian Brauner wrote:
> > > On Tue, Dec 12, 2023 at 08:32:55AM +0200, Amir Goldstein wrote:
> > > > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > > > >                               same inode number
> > > 
> > > This is just ugly with questionable value. A constant reminder of how
> > > broken this is. Exposing the subvolume id also makes this somewhat redundant.
> > 
> > Oh hell no. We finally get a reasonably productive discussion and I wake
> > up to you calling things ugly and broken, with no reason or
> > justification or any response to the justification that's already been
> > given?
> > 
> > Christain, that's not how we do things. Let's not turn what was a
> > productive discussion into a trainwreck, OK?
> 
> That's a major aggressive tone for no good reason. And probably not a
> good way to convince anyone. I'm also not sure why you're taking a
> dislike of this flag as a personal affront.

No, if you're going to show up with comments like that it is entirely
called for.

Other poeple have been finding ways to contribute to the technical
discussion; just calling things "ugly and broken" does not.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 15:35                                                   ` Kent Overstreet
@ 2023-12-12 15:38                                                     ` Miklos Szeredi
  2023-12-12 15:43                                                       ` Kent Overstreet
  2023-12-12 21:46                                                       ` NeilBrown
  0 siblings, 2 replies; 92+ messages in thread
From: Miklos Szeredi @ 2023-12-12 15:38 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christian Brauner, Amir Goldstein, Dave Chinner, NeilBrown,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, 12 Dec 2023 at 16:35, Kent Overstreet <kent.overstreet@linux.dev> wrote:

> Other poeple have been finding ways to contribute to the technical
> discussion; just calling things "ugly and broken" does not.

Kent, calm down please.  We call things "ugly and broken" all the
time.  That's just an opinion, you are free to argue it, and no need
to take it personally.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 15:38                                                     ` Miklos Szeredi
@ 2023-12-12 15:43                                                       ` Kent Overstreet
  2023-12-12 15:57                                                         ` Miklos Szeredi
  2023-12-12 21:46                                                       ` NeilBrown
  1 sibling, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 15:43 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Christian Brauner, Amir Goldstein, Dave Chinner, NeilBrown,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, Dec 12, 2023 at 04:38:29PM +0100, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 16:35, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > Other poeple have been finding ways to contribute to the technical
> > discussion; just calling things "ugly and broken" does not.
> 
> Kent, calm down please.  We call things "ugly and broken" all the
> time.  That's just an opinion, you are free to argue it, and no need
> to take it personally.

It's an entirely subjective judgement that has no place in a discussion
where we're trying to decide things on technical merits.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 15:43                                                       ` Kent Overstreet
@ 2023-12-12 15:57                                                         ` Miklos Szeredi
  2023-12-12 16:08                                                           ` Kent Overstreet
  0 siblings, 1 reply; 92+ messages in thread
From: Miklos Szeredi @ 2023-12-12 15:57 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christian Brauner, Amir Goldstein, Dave Chinner, NeilBrown,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, 12 Dec 2023 at 16:43, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> On Tue, Dec 12, 2023 at 04:38:29PM +0100, Miklos Szeredi wrote:
> > On Tue, 12 Dec 2023 at 16:35, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > > Other poeple have been finding ways to contribute to the technical
> > > discussion; just calling things "ugly and broken" does not.
> >
> > Kent, calm down please.  We call things "ugly and broken" all the
> > time.  That's just an opinion, you are free to argue it, and no need
> > to take it personally.
>
> It's an entirely subjective judgement that has no place in a discussion
> where we're trying to decide things on technical merits.

Software is like architecture.  It must not collapse, but to function
well it also needs to look good.  That's especially relevant to APIs,
and yes, it's a matter of taste and can be very subjective.

On this particular point (STATX_ATTR_INUM_NOT_UNIQUE) I can completely
understand Christian's reaction.  But if you think this would be a
useful feature, please state your technical argument.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 15:57                                                         ` Miklos Szeredi
@ 2023-12-12 16:08                                                           ` Kent Overstreet
  2023-12-12 16:30                                                             ` Miklos Szeredi
  2023-12-13  9:41                                                             ` Christian Brauner
  0 siblings, 2 replies; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 16:08 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Christian Brauner, Amir Goldstein, Dave Chinner, NeilBrown,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, Dec 12, 2023 at 04:57:41PM +0100, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 16:43, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > On Tue, Dec 12, 2023 at 04:38:29PM +0100, Miklos Szeredi wrote:
> > > On Tue, 12 Dec 2023 at 16:35, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > >
> > > > Other poeple have been finding ways to contribute to the technical
> > > > discussion; just calling things "ugly and broken" does not.
> > >
> > > Kent, calm down please.  We call things "ugly and broken" all the
> > > time.  That's just an opinion, you are free to argue it, and no need
> > > to take it personally.
> >
> > It's an entirely subjective judgement that has no place in a discussion
> > where we're trying to decide things on technical merits.
> 
> Software is like architecture.  It must not collapse, but to function
> well it also needs to look good.  That's especially relevant to APIs,
> and yes, it's a matter of taste and can be very subjective.

Good taste is a highly important trait - in the course of normal work we
trust our subjective opinions constantly, because those opinions have
been trained by years and years of judgements and we can't reason
through everything.

But when you show up to a discussion that's been going on for a page,
where everything's been constructively gathering input, and you start
namecalling - and crucially, _without giving any technical justification
for your opinions_ - that's just you being a dick.

> On this particular point (STATX_ATTR_INUM_NOT_UNIQUE) I can completely
> understand Christian's reaction.  But if you think this would be a
> useful feature, please state your technical argument.

Well, you could have scanned through the prior thread, or read the
extensive LWN coverage, but in short - inode number uniqueness is
something that's already impossible to guarantee, today.

In short, STATX_ATTR_INUM_NOT_UNIQUE is required to tell userspace when
they _must_ do the new thing if they care about correctness; it provides
a way to tell userspace what guarantees we're able to provide.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 16:08                                                           ` Kent Overstreet
@ 2023-12-12 16:30                                                             ` Miklos Szeredi
  2023-12-12 16:41                                                               ` Kent Overstreet
  2023-12-12 21:53                                                               ` NeilBrown
  2023-12-13  9:41                                                             ` Christian Brauner
  1 sibling, 2 replies; 92+ messages in thread
From: Miklos Szeredi @ 2023-12-12 16:30 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christian Brauner, Amir Goldstein, Dave Chinner, NeilBrown,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, 12 Dec 2023 at 17:08, Kent Overstreet <kent.overstreet@linux.dev> wrote:

> In short, STATX_ATTR_INUM_NOT_UNIQUE is required to tell userspace when
> they _must_ do the new thing if they care about correctness; it provides
> a way to tell userspace what guarantees we're able to provide.

That flag would not help with improving userspace software.

What would help, if said software started using a unique identifier.
We already seem to have a unique ID in the form of file handles,
though some exotic filesystems might allow more than one fh to refer
to the same inode, so this still needs some looking into.

The big problem is that we can't do a lot about existing software, and
must keep trying to keep st_ino unique for the foreseeable future.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 16:30                                                             ` Miklos Szeredi
@ 2023-12-12 16:41                                                               ` Kent Overstreet
  2023-12-12 21:53                                                               ` NeilBrown
  1 sibling, 0 replies; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 16:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Christian Brauner, Amir Goldstein, Dave Chinner, NeilBrown,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Tue, Dec 12, 2023 at 05:30:23PM +0100, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 17:08, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > In short, STATX_ATTR_INUM_NOT_UNIQUE is required to tell userspace when
> > they _must_ do the new thing if they care about correctness; it provides
> > a way to tell userspace what guarantees we're able to provide.
> 
> That flag would not help with improving userspace software.

I disagree. Just having it there and in the documentation will be a good
solid nudge to userspace programmers that this is something they need to
consider.

And: if we size this thing for NFSv4 filehandles, those are big: there's
an overhead associated with using those, since userspace generally has
to track in memory all inode numbers/file handles that it's seen.
STATX_ATTR_INUM_NOT_UNIQUE would allow userspace to avoid that overhead
when it is safe to do so.

(But remember that file handles include inode generation numbers, and
inode numbers do not; that is something we should remember to document,
and explain why it is important).

> What would help, if said software started using a unique identifier.
> We already seem to have a unique ID in the form of file handles,
> though some exotic filesystems might allow more than one fh to refer
> to the same inode, so this still needs some looking into.
> 
> The big problem is that we can't do a lot about existing software, and
> must keep trying to keep st_ino unique for the foreseeable future.

Whatever hacks we try to apply to st_ino are outside the scope of this
discussion. Right now, we need to be figuring out what our future
proofed interface is going to be, so that we don't end up kicking this
can down the road when st_ino will be _really_ space constrained.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: file handle in statx
  2023-12-12 15:20                                             ` Theodore Ts'o
@ 2023-12-12 17:15                                               ` Frank Filz
  2023-12-12 17:44                                                 ` Kent Overstreet
  2023-12-12 20:59                                                 ` Dave Chinner
  2023-12-13  7:37                                               ` Donald Buczek
  1 sibling, 2 replies; 92+ messages in thread
From: Frank Filz @ 2023-12-12 17:15 UTC (permalink / raw)
  To: 'Theodore Ts'o', 'Donald Buczek'
  Cc: 'Dave Chinner', 'NeilBrown',
	'Kent Overstreet',
	linux-bcachefs, 'Stefan Krueger', 'David Howells',
	linux-fsdevel

> On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > On 12/12/23 06:53, Dave Chinner wrote:
> >
> > > So can someone please explain to me why we need to try to re-invent
> > > a generic filehandle concept in statx when we already have a have
> > > working and widely supported user API that provides exactly this
> > > functionality?
> >
> > name_to_handle_at() is fine, but userspace could profit from being
> > able to retrieve the filehandle together with the other metadata in a
> > single system call.
> 
> Can you say more?  What, specifically is the application that would want
to do
> that, and is it really in such a hot path that it would be a user-visible
> improveable, let aloine something that can be actually be measured?

A user space NFS server like Ganesha could benefit from getting attributes
and file handle in a single system call.

Potentially it could also avoid some of the challenges of using
name_to_handle_at that is a privileged operation.

Frank



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-12 17:15                                               ` Frank Filz
@ 2023-12-12 17:44                                                 ` Kent Overstreet
  2023-12-12 18:17                                                   ` Amir Goldstein
  2023-12-12 20:59                                                 ` Dave Chinner
  1 sibling, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 17:44 UTC (permalink / raw)
  To: Frank Filz
  Cc: 'Theodore Ts'o', 'Donald Buczek',
	'Dave Chinner', 'NeilBrown',
	linux-bcachefs, 'Stefan Krueger', 'David Howells',
	linux-fsdevel

On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > On 12/12/23 06:53, Dave Chinner wrote:
> > >
> > > > So can someone please explain to me why we need to try to re-invent
> > > > a generic filehandle concept in statx when we already have a have
> > > > working and widely supported user API that provides exactly this
> > > > functionality?
> > >
> > > name_to_handle_at() is fine, but userspace could profit from being
> > > able to retrieve the filehandle together with the other metadata in a
> > > single system call.
> > 
> > Can you say more?  What, specifically is the application that would want
> to do
> > that, and is it really in such a hot path that it would be a user-visible
> > improveable, let aloine something that can be actually be measured?
> 
> A user space NFS server like Ganesha could benefit from getting attributes
> and file handle in a single system call.
> 
> Potentially it could also avoid some of the challenges of using
> name_to_handle_at that is a privileged operation.

which begs the question - why is name_to_handle_at() privileged?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-12 17:44                                                 ` Kent Overstreet
@ 2023-12-12 18:17                                                   ` Amir Goldstein
  2023-12-12 19:18                                                     ` Frank Filz
  0 siblings, 1 reply; 92+ messages in thread
From: Amir Goldstein @ 2023-12-12 18:17 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Frank Filz, Theodore Ts'o, Donald Buczek, Dave Chinner,
	NeilBrown, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel

On Tue, Dec 12, 2023 at 7:44 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > >
> > > > > So can someone please explain to me why we need to try to re-invent
> > > > > a generic filehandle concept in statx when we already have a have
> > > > > working and widely supported user API that provides exactly this
> > > > > functionality?
> > > >
> > > > name_to_handle_at() is fine, but userspace could profit from being
> > > > able to retrieve the filehandle together with the other metadata in a
> > > > single system call.
> > >
> > > Can you say more?  What, specifically is the application that would want
> > to do
> > > that, and is it really in such a hot path that it would be a user-visible
> > > improveable, let aloine something that can be actually be measured?
> >
> > A user space NFS server like Ganesha could benefit from getting attributes
> > and file handle in a single system call.
> >
> > Potentially it could also avoid some of the challenges of using
> > name_to_handle_at that is a privileged operation.
>
> which begs the question - why is name_to_handle_at() privileged?
>

AFAICT, it is not privileged.
Only open_by_handle_at() is privileged.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: file handle in statx
  2023-12-12 18:17                                                   ` Amir Goldstein
@ 2023-12-12 19:18                                                     ` Frank Filz
  0 siblings, 0 replies; 92+ messages in thread
From: Frank Filz @ 2023-12-12 19:18 UTC (permalink / raw)
  To: 'Amir Goldstein', 'Kent Overstreet'
  Cc: 'Theodore Ts'o', 'Donald Buczek',
	'Dave Chinner', 'NeilBrown',
	linux-bcachefs, 'Stefan Krueger', 'David Howells',
	linux-fsdevel

> On Tue, Dec 12, 2023 at 7:44 PM Kent Overstreet <kent.overstreet@linux.dev>
> wrote:
> >
> > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > > >
> > > > > > So can someone please explain to me why we need to try to
> > > > > > re-invent a generic filehandle concept in statx when we
> > > > > > already have a have working and widely supported user API that
> > > > > > provides exactly this functionality?
> > > > >
> > > > > name_to_handle_at() is fine, but userspace could profit from
> > > > > being able to retrieve the filehandle together with the other
> > > > > metadata in a single system call.
> > > >
> > > > Can you say more?  What, specifically is the application that
> > > > would want
> > > to do
> > > > that, and is it really in such a hot path that it would be a
> > > > user-visible improveable, let aloine something that can be actually be
> measured?
> > >
> > > A user space NFS server like Ganesha could benefit from getting
> > > attributes and file handle in a single system call.
> > >
> > > Potentially it could also avoid some of the challenges of using
> > > name_to_handle_at that is a privileged operation.
> >
> > which begs the question - why is name_to_handle_at() privileged?
> >
> 
> AFAICT, it is not privileged.
> Only open_by_handle_at() is privileged.

Ah, that makes sense. I'm a consumer of these interfaces, not so much in the kernel these days.

Ganesha depends on open_by_handle_at as well, so requires privilege to do handle stuff with kernel file systems.

In any case, Ganesha could easily benefit from a savings of system calls.

What would also be handy is a read directory with attributes that gave the statx results for each entry.

Frank


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 15:21                                           ` file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?) Kent Overstreet
@ 2023-12-12 20:48                                             ` Dave Chinner
  2023-12-12 21:23                                               ` Kent Overstreet
  2023-12-12 22:00                                               ` NeilBrown
  0 siblings, 2 replies; 92+ messages in thread
From: Dave Chinner @ 2023-12-12 20:48 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: NeilBrown, Donald Buczek, linux-bcachefs, Stefan Krueger,
	David Howells, linux-fsdevel

On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote:
> On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote:
> > Doesn't anyone else see or hear the elephant trumpeting loudly in
> > the middle of the room?
> > 
> > I mean, we already have name_to_handle_at() for userspace to get a
> > unique, opaque, filesystem defined file handle for any given file.
> > It's the same filehandle that filesystems hand to the nfsd so nfs
> > clients can uniquely identify the file they are asking the nfsd to
> > operate on.
> > 
> > The contents of these filehandles is entirely defined by the file
> > system and completely opaque to the user. The only thing that
> > parses the internal contents of the handle is the filesystem itself.
> > Therefore, as long as the fs encodes the information it needs into the
> > handle to determine what subvol/snapshot the inode belongs to when
> > the handle is passed back to it (e.g. from open_by_handle_at()) then
> > nothing else needs to care how it is encoded.
> > 
> > So can someone please explain to me why we need to try to re-invent
> > a generic filehandle concept in statx when we already have a
> > have working and widely supported user API that provides exactly
> > this functionality?
> 
> Definitely should be part of the discussion :)
> 
> But I think it _does_ need to be in statx; because:
>  - we've determined that 64 bit ino_t just isn't a future proof
>    interface, we're having real problems with it today
>  - statx is _the_ standard, future proofed interface for getting inode
>    attributes

No, it most definitely isn't, and statx was never intended as a
dumping ground for anything and everything inode related. e.g. Any
inode attribute that can be modified needs to use a different
interface - one that has a corresponding "set" operation.

>  - therefore, if we want userspace programmers to be using filehandles,
>    instead of inode numbers, so there code isn't broken, we need to be
>    providing interfaces that guide them in that direction.

We already have a filehandle interface they can use for this
purpose. It is already used by some userspace applications for this
purpose.

Anything new API function do with statx() will require application
changes, and the vast majority of applications aren't using statx()
directly - they are using stat() which glibc wraps to statx()
internally. So they are going to need a change of API, anyway.

So, fundamentally, there is a change of API for most applications
that need to do thorough inode uniqueness checks regardless of
anything else. They can do this right now - just continue using
stat() as they do right now, and then use name_to_filehandle_at()
for uniqueness checks.

> Even assuming we can update all the documentation to say "filehandles
> are the correct way to test inode uniqueness", you know at least half of
> programmers will stick to stx_ino instead of the filehandle if the
> filehandle is an extra syscall.

Your argument is "programmers suck so we must design for the
lowest common denominator". That's an -awful- way to design APIs.

Further, this "programmers suck" design comes at a cost to every
statx() call that does not need filehandles. That's the vast
majority of statx() calls that are made on a system. Why should we
slow down statx() for all users when so few applications actually
need uniqueness and they can take the cost of robust uniqueness
tests with an extra syscall entirely themselves?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-11 23:40                                 ` David Howells
@ 2023-12-12 20:59                                   ` Kent Overstreet
  2023-12-12 22:57                                     ` NeilBrown
  0 siblings, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 20:59 UTC (permalink / raw)
  To: David Howells
  Cc: NeilBrown, Donald Buczek, linux-bcachefs, Stefan Krueger, linux-fsdevel

On Mon, Dec 11, 2023 at 11:40:16PM +0000, David Howells wrote:
> Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > I was chatting a bit with David Howells on IRC about this, and floated
> > adding the file handle to statx. It looks like there's enough space
> > reserved to make this feasible - probably going with a fixed maximum
> > size of 128-256 bits.
> 
> We can always save the last bit to indicate extension space/extension record,
> so we're not that strapped for space.

So we'll need that if we want to round trip NFSv4 filehandles, they
won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96
bytes reserved).

Obvious question (Neal): do/will real world implementations ever come
close to making use of this, or was this a "future proofing gone wild"
thing?

Say we do decide we want to spec it that large: _can_ we extend struct
statx? I'm wondering if the userspace side was thought through, I'm
sure glibc people will have something to say.

Kernel side we can definitely extend struct statx, and we know how many
bytes to copy to userspace because we know what fields userspace
requested. The part I'm concerned about is that if we extend userspace's
struct statx, that introduces obvious ABI compabitibility issues.

So this would probably force glibc to introduce a new version of struct
statx, if I'm not mistaken.

Or: another option would be to reserve something small and sane in
struct statx (32 bytes max, I'd say), and then set a flag to tell
userspace they need to use name_to_handle_at() if it didn't fit.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-12 17:15                                               ` Frank Filz
  2023-12-12 17:44                                                 ` Kent Overstreet
@ 2023-12-12 20:59                                                 ` Dave Chinner
  2023-12-12 21:57                                                   ` NeilBrown
  1 sibling, 1 reply; 92+ messages in thread
From: Dave Chinner @ 2023-12-12 20:59 UTC (permalink / raw)
  To: Frank Filz
  Cc: 'Theodore Ts'o', 'Donald Buczek',
	'NeilBrown', 'Kent Overstreet',
	linux-bcachefs, 'Stefan Krueger', 'David Howells',
	linux-fsdevel

On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > On 12/12/23 06:53, Dave Chinner wrote:
> > >
> > > > So can someone please explain to me why we need to try to re-invent
> > > > a generic filehandle concept in statx when we already have a have
> > > > working and widely supported user API that provides exactly this
> > > > functionality?
> > >
> > > name_to_handle_at() is fine, but userspace could profit from being
> > > able to retrieve the filehandle together with the other metadata in a
> > > single system call.
> > 
> > Can you say more?  What, specifically is the application that would want
> to do
> > that, and is it really in such a hot path that it would be a user-visible
> > improveable, let aloine something that can be actually be measured?
> 
> A user space NFS server like Ganesha could benefit from getting attributes
> and file handle in a single system call.

At the cost of every other application that doesn't need those
attributes. That's not a good trade-off - the cost of a syscall is
tiny compared to the rest of the work that has to be done to create
a stable filehandle for an inode, even if we have a variant of
name_to_handle_at() that takes an open fd rather than a path.

> Potentially it could also avoid some of the challenges of using
> name_to_handle_at that is a privileged operation.

It isn't. open_by_handle_at() is privileged because it bypasses all
the path based access checking that a normal open() does. Anyone can
get a filehandle for a path from the kernel, but few can actually
use it for anything other than file uniqueness checks....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 20:48                                             ` Dave Chinner
@ 2023-12-12 21:23                                               ` Kent Overstreet
  2023-12-12 22:10                                                 ` Dave Chinner
  2023-12-12 22:00                                               ` NeilBrown
  1 sibling, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 21:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Donald Buczek, linux-bcachefs, Stefan Krueger,
	David Howells, linux-fsdevel

On Wed, Dec 13, 2023 at 07:48:16AM +1100, Dave Chinner wrote:
> On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote:
> > On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote:
> > > Doesn't anyone else see or hear the elephant trumpeting loudly in
> > > the middle of the room?
> > > 
> > > I mean, we already have name_to_handle_at() for userspace to get a
> > > unique, opaque, filesystem defined file handle for any given file.
> > > It's the same filehandle that filesystems hand to the nfsd so nfs
> > > clients can uniquely identify the file they are asking the nfsd to
> > > operate on.
> > > 
> > > The contents of these filehandles is entirely defined by the file
> > > system and completely opaque to the user. The only thing that
> > > parses the internal contents of the handle is the filesystem itself.
> > > Therefore, as long as the fs encodes the information it needs into the
> > > handle to determine what subvol/snapshot the inode belongs to when
> > > the handle is passed back to it (e.g. from open_by_handle_at()) then
> > > nothing else needs to care how it is encoded.
> > > 
> > > So can someone please explain to me why we need to try to re-invent
> > > a generic filehandle concept in statx when we already have a
> > > have working and widely supported user API that provides exactly
> > > this functionality?
> > 
> > Definitely should be part of the discussion :)
> > 
> > But I think it _does_ need to be in statx; because:
> >  - we've determined that 64 bit ino_t just isn't a future proof
> >    interface, we're having real problems with it today
> >  - statx is _the_ standard, future proofed interface for getting inode
> >    attributes
> 
> No, it most definitely isn't, and statx was never intended as a
> dumping ground for anything and everything inode related. e.g. Any
> inode attribute that can be modified needs to use a different
> interface - one that has a corresponding "set" operation.

And here I thought the whole point of statx was to be an extensible way
to request any sort of inode attribute.

> >  - therefore, if we want userspace programmers to be using filehandles,
> >    instead of inode numbers, so there code isn't broken, we need to be
> >    providing interfaces that guide them in that direction.
> 
> We already have a filehandle interface they can use for this
> purpose. It is already used by some userspace applications for this
> purpose.
> 
> Anything new API function do with statx() will require application
> changes, and the vast majority of applications aren't using statx()
> directly - they are using stat() which glibc wraps to statx()
> internally. So they are going to need a change of API, anyway.
> 
> So, fundamentally, there is a change of API for most applications
> that need to do thorough inode uniqueness checks regardless of
> anything else. They can do this right now - just continue using
> stat() as they do right now, and then use name_to_filehandle_at()
> for uniqueness checks.
> 
> > Even assuming we can update all the documentation to say "filehandles
> > are the correct way to test inode uniqueness", you know at least half of
> > programmers will stick to stx_ino instead of the filehandle if the
> > filehandle is an extra syscall.
> 
> Your argument is "programmers suck so we must design for the
> lowest common denominator". That's an -awful- way to design APIs.

No, I'm saying if the old way doing things no longer works, we ought to
make the new future proofed way as ergonomic and easy to use as the old
way was - else it won't get used.

At the _very_ least we need to add a flag to statx for "inode number is
unreliable for uniqueness checks".

bcachefs could leave this off until the first snapshot has been taken.

But even with that option, I think we ought to be telling anyone doing
uniqueness checks to use the filehandle, because it includes i_generation.

> Further, this "programmers suck" design comes at a cost to every
> statx() call that does not need filehandles. That's the vast
> majority of statx() calls that are made on a system. Why should we
> slow down statx() for all users when so few applications actually
> need uniqueness and they can take the cost of robust uniqueness
> tests with an extra syscall entirely themselves?

For any local filesystem the filehandle is going to be the inode
generation number, the subvolume ID (if applicable), and the inode
number. That's 16 bytes on bcachefs, and if we make an attempt to
standardize how this works across filesystems we should be able to do it
without adding a new indirect function call to the statx() path. That
sounds pretty negligable to me.

The syscall overhead isn't going to be negligable when an application
has to do lots of scanning to find the files its interested - rsync.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 15:38                                                     ` Miklos Szeredi
  2023-12-12 15:43                                                       ` Kent Overstreet
@ 2023-12-12 21:46                                                       ` NeilBrown
  2023-12-13  9:47                                                         ` Christian Brauner
  1 sibling, 1 reply; 92+ messages in thread
From: NeilBrown @ 2023-12-12 21:46 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Kent Overstreet, Christian Brauner, Amir Goldstein, Dave Chinner,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Wed, 13 Dec 2023, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 16:35, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > Other poeple have been finding ways to contribute to the technical
> > discussion; just calling things "ugly and broken" does not.
> 
> Kent, calm down please.  We call things "ugly and broken" all the
> time.  That's just an opinion, you are free to argue it, and no need
> to take it personally.

But maybe we shouldn't.  Maybe we should focus on saying what, exactly,
is unpleasant to look at and way.  Or what exactly causes poor
funcationality.
"ugly" and "broken" are not particularly useful words in a technical
discussion.  I understand people want to use them, but they really need
to be backed up with details.  It is details that matter.

NeilBrown


> 
> Thanks,
> Miklos
> 
> 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 16:30                                                             ` Miklos Szeredi
  2023-12-12 16:41                                                               ` Kent Overstreet
@ 2023-12-12 21:53                                                               ` NeilBrown
  1 sibling, 0 replies; 92+ messages in thread
From: NeilBrown @ 2023-12-12 21:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Kent Overstreet, Christian Brauner, Amir Goldstein, Dave Chinner,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Wed, 13 Dec 2023, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 17:08, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > In short, STATX_ATTR_INUM_NOT_UNIQUE is required to tell userspace when
> > they _must_ do the new thing if they care about correctness; it provides
> > a way to tell userspace what guarantees we're able to provide.
> 
> That flag would not help with improving userspace software.

Are you sure?

Suppose I wanted to export an filesystem using some protocol (maybe one
called "NFSv4"), and suppose this protocol supported the communication
of an attribute called "fileid" which was optional but requires to be
fully unique if provided at all.

If I had access to STATX_ATTR_INUM_NOT_UNIQUE, then I could not export
the fileid when it didn't met the protocol requirements, but could when
it did.

This may not be a strong case for the inclusion of the flag it is, I
think, a clear indication that "would not help" is what our fact
checkers would call "over-reach".

NeilBrown

> 
> What would help, if said software started using a unique identifier.
> We already seem to have a unique ID in the form of file handles,
> though some exotic filesystems might allow more than one fh to refer
> to the same inode, so this still needs some looking into.
> 
> The big problem is that we can't do a lot about existing software, and
> must keep trying to keep st_ino unique for the foreseeable future.
> 
> Thanks,
> Miklos
> 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-12 20:59                                                 ` Dave Chinner
@ 2023-12-12 21:57                                                   ` NeilBrown
  2023-12-12 22:23                                                     ` Dave Chinner
  0 siblings, 1 reply; 92+ messages in thread
From: NeilBrown @ 2023-12-12 21:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Frank Filz, 'Theodore Ts'o', 'Donald Buczek',
	'Kent Overstreet',
	linux-bcachefs, 'Stefan Krueger', 'David Howells',
	linux-fsdevel

On Wed, 13 Dec 2023, Dave Chinner wrote:
> On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > >
> > > > > So can someone please explain to me why we need to try to re-invent
> > > > > a generic filehandle concept in statx when we already have a have
> > > > > working and widely supported user API that provides exactly this
> > > > > functionality?
> > > >
> > > > name_to_handle_at() is fine, but userspace could profit from being
> > > > able to retrieve the filehandle together with the other metadata in a
> > > > single system call.
> > > 
> > > Can you say more?  What, specifically is the application that would want
> > to do
> > > that, and is it really in such a hot path that it would be a user-visible
> > > improveable, let aloine something that can be actually be measured?
> > 
> > A user space NFS server like Ganesha could benefit from getting attributes
> > and file handle in a single system call.
> 
> At the cost of every other application that doesn't need those
> attributes.

Why do you think there would be a cost?

statx only returns the attributes that it was asked for.  If an
application doesn't ask for STATX_HANDLE, it doesn't get STATS_HANDLE
and (providing filesystems honour the flag) doesn't pay the price for
generating it (which is often trivial, but may not always be).

NeilBrown


>               That's not a good trade-off - the cost of a syscall is
> tiny compared to the rest of the work that has to be done to create
> a stable filehandle for an inode, even if we have a variant of
> name_to_handle_at() that takes an open fd rather than a path.
> 
> > Potentially it could also avoid some of the challenges of using
> > name_to_handle_at that is a privileged operation.
> 
> It isn't. open_by_handle_at() is privileged because it bypasses all
> the path based access checking that a normal open() does. Anyone can
> get a filehandle for a path from the kernel, but few can actually
> use it for anything other than file uniqueness checks....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 20:48                                             ` Dave Chinner
  2023-12-12 21:23                                               ` Kent Overstreet
@ 2023-12-12 22:00                                               ` NeilBrown
  1 sibling, 0 replies; 92+ messages in thread
From: NeilBrown @ 2023-12-12 22:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	David Howells, linux-fsdevel

On Wed, 13 Dec 2023, Dave Chinner wrote:
> On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote:
> > On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote:
> > > Doesn't anyone else see or hear the elephant trumpeting loudly in
> > > the middle of the room?
> > > 
> > > I mean, we already have name_to_handle_at() for userspace to get a
> > > unique, opaque, filesystem defined file handle for any given file.
> > > It's the same filehandle that filesystems hand to the nfsd so nfs
> > > clients can uniquely identify the file they are asking the nfsd to
> > > operate on.
> > > 
> > > The contents of these filehandles is entirely defined by the file
> > > system and completely opaque to the user. The only thing that
> > > parses the internal contents of the handle is the filesystem itself.
> > > Therefore, as long as the fs encodes the information it needs into the
> > > handle to determine what subvol/snapshot the inode belongs to when
> > > the handle is passed back to it (e.g. from open_by_handle_at()) then
> > > nothing else needs to care how it is encoded.
> > > 
> > > So can someone please explain to me why we need to try to re-invent
> > > a generic filehandle concept in statx when we already have a
> > > have working and widely supported user API that provides exactly
> > > this functionality?
> > 
> > Definitely should be part of the discussion :)
> > 
> > But I think it _does_ need to be in statx; because:
> >  - we've determined that 64 bit ino_t just isn't a future proof
> >    interface, we're having real problems with it today
> >  - statx is _the_ standard, future proofed interface for getting inode
> >    attributes
> 
> No, it most definitely isn't, and statx was never intended as a
> dumping ground for anything and everything inode related. e.g. Any
> inode attribute that can be modified needs to use a different
> interface - one that has a corresponding "set" operation.

stx_size, stx_mtime, stx_atime, stx_mode, stx_owner, stx_group,
stx_nlink, ....

These can all be modified but don't have matched get and set operations.

stx_handle, of course, cannot be modified.

I think I must be completely misunderstanding you, because the way I
read your words, they don't make any sense at all.  Help!

NeilBrown

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 21:23                                               ` Kent Overstreet
@ 2023-12-12 22:10                                                 ` Dave Chinner
  2023-12-12 22:31                                                   ` NeilBrown
  0 siblings, 1 reply; 92+ messages in thread
From: Dave Chinner @ 2023-12-12 22:10 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: NeilBrown, Donald Buczek, linux-bcachefs, Stefan Krueger,
	David Howells, linux-fsdevel

On Tue, Dec 12, 2023 at 04:23:06PM -0500, Kent Overstreet wrote:
> On Wed, Dec 13, 2023 at 07:48:16AM +1100, Dave Chinner wrote:
> > On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote:
> > > On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote:
> > > > Doesn't anyone else see or hear the elephant trumpeting loudly in
> > > > the middle of the room?
> > > > 
> > > > I mean, we already have name_to_handle_at() for userspace to get a
> > > > unique, opaque, filesystem defined file handle for any given file.
> > > > It's the same filehandle that filesystems hand to the nfsd so nfs
> > > > clients can uniquely identify the file they are asking the nfsd to
> > > > operate on.
> > > > 
> > > > The contents of these filehandles is entirely defined by the file
> > > > system and completely opaque to the user. The only thing that
> > > > parses the internal contents of the handle is the filesystem itself.
> > > > Therefore, as long as the fs encodes the information it needs into the
> > > > handle to determine what subvol/snapshot the inode belongs to when
> > > > the handle is passed back to it (e.g. from open_by_handle_at()) then
> > > > nothing else needs to care how it is encoded.
> > > > 
> > > > So can someone please explain to me why we need to try to re-invent
> > > > a generic filehandle concept in statx when we already have a
> > > > have working and widely supported user API that provides exactly
> > > > this functionality?
> > > 
> > > Definitely should be part of the discussion :)
> > > 
> > > But I think it _does_ need to be in statx; because:
> > >  - we've determined that 64 bit ino_t just isn't a future proof
> > >    interface, we're having real problems with it today
> > >  - statx is _the_ standard, future proofed interface for getting inode
> > >    attributes
> > 
> > No, it most definitely isn't, and statx was never intended as a
> > dumping ground for anything and everything inode related. e.g. Any
> > inode attribute that can be modified needs to use a different
> > interface - one that has a corresponding "set" operation.
> 
> And here I thought the whole point of statx was to be an extensible way
> to request any sort of inode attribute.

Within reason. When the size of a single attribute is dynamically
sized, can double the size of the statx structure and there's an
existing syscall to retrieve that information from the filesystem,
it makes no sense *at all* to duplicate that information in statx().

statx() is for new attributes we don't have any other way of exposing to
userspace easily. it is designed for fixed size attributes and
flags, it is not designed for dynamically sized information to be
embedded in struct statx.

Filehandles are not new, and we already have APIs that expose them
to userspace that handle the dynamic size of filehandles just fine.
This functionality simply does not belong in statx() at all.

> > >  - therefore, if we want userspace programmers to be using filehandles,
> > >    instead of inode numbers, so there code isn't broken, we need to be
> > >    providing interfaces that guide them in that direction.
> > 
> > We already have a filehandle interface they can use for this
> > purpose. It is already used by some userspace applications for this
> > purpose.
> > 
> > Anything new API function do with statx() will require application
> > changes, and the vast majority of applications aren't using statx()
> > directly - they are using stat() which glibc wraps to statx()
> > internally. So they are going to need a change of API, anyway.
> > 
> > So, fundamentally, there is a change of API for most applications
> > that need to do thorough inode uniqueness checks regardless of
> > anything else. They can do this right now - just continue using
> > stat() as they do right now, and then use name_to_filehandle_at()
> > for uniqueness checks.
> > 
> > > Even assuming we can update all the documentation to say "filehandles
> > > are the correct way to test inode uniqueness", you know at least half of
> > > programmers will stick to stx_ino instead of the filehandle if the
> > > filehandle is an extra syscall.
> > 
> > Your argument is "programmers suck so we must design for the
> > lowest common denominator". That's an -awful- way to design APIs.
> 
> No, I'm saying if the old way doing things no longer works, we ought to
> make the new future proofed way as ergonomic and easy to use as the old
> way was - else it won't get used.
> 
> At the _very_ least we need to add a flag to statx for "inode number is
> unreliable for uniqueness checks".
> 
> bcachefs could leave this off until the first snapshot has been taken.
> 
> But even with that option, I think we ought to be telling anyone doing
> uniqueness checks to use the filehandle, because it includes i_generation.
> 
> > Further, this "programmers suck" design comes at a cost to every
> > statx() call that does not need filehandles. That's the vast
> > majority of statx() calls that are made on a system. Why should we
> > slow down statx() for all users when so few applications actually
> > need uniqueness and they can take the cost of robust uniqueness
> > tests with an extra syscall entirely themselves?
> 
> For any local filesystem the filehandle is going to be the inode
> generation number, the subvolume ID (if applicable), and the inode
> number. That's 16 bytes on bcachefs, and if we make an attempt to
> standardize how this works across filesystems we should be able to do it
> without adding a new indirect function call to the statx() path. That
> sounds pretty negligable to me.

Filehandle encoding is already standardised.

exportfs_encode_fh() does everything needed already, including
providing the default EXPORT_FH_FID behaviour
(exportfs_encode_ino64_fid()) for filesystems that don't otherwise
support filehandles.

e.g. this will call into bcachefs and run bch2_encode_fh() to
calculate the filehandle for bcachefs.  It will call into btrfs
and run btrfs_encode_fh() to encode the file handle. The
infrastructure for encoding filehandles in a consistent, common
manner across all filesystems in the kernel is already present.

What you are suggesting is that we now duplicate filehandle encoding
into every filesystem's statx() implementation.  That's a bad
trade-off from a maintenance, testing and consistency POV because
now we end up with lots of individual, filehandle encoding
implementations in addition to the generic filehandle
infrastructure that we all have to test and validate.

> The syscall overhead isn't going to be negligable when an application
> has to do lots of scanning to find the files its interested - rsync.

Similarly, the extra syscall overhead for calling
name_to_handle_at() will be neglible for applications that need
robust inode uniqueness detection....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-12 21:57                                                   ` NeilBrown
@ 2023-12-12 22:23                                                     ` Dave Chinner
  2023-12-12 22:36                                                       ` NeilBrown
  2023-12-12 22:39                                                       ` Kent Overstreet
  0 siblings, 2 replies; 92+ messages in thread
From: Dave Chinner @ 2023-12-12 22:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Frank Filz, 'Theodore Ts'o', 'Donald Buczek',
	'Kent Overstreet',
	linux-bcachefs, 'Stefan Krueger', 'David Howells',
	linux-fsdevel

On Wed, Dec 13, 2023 at 08:57:43AM +1100, NeilBrown wrote:
> On Wed, 13 Dec 2023, Dave Chinner wrote:
> > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > > >
> > > > > > So can someone please explain to me why we need to try to re-invent
> > > > > > a generic filehandle concept in statx when we already have a have
> > > > > > working and widely supported user API that provides exactly this
> > > > > > functionality?
> > > > >
> > > > > name_to_handle_at() is fine, but userspace could profit from being
> > > > > able to retrieve the filehandle together with the other metadata in a
> > > > > single system call.
> > > > 
> > > > Can you say more?  What, specifically is the application that would want
> > > to do
> > > > that, and is it really in such a hot path that it would be a user-visible
> > > > improveable, let aloine something that can be actually be measured?
> > > 
> > > A user space NFS server like Ganesha could benefit from getting attributes
> > > and file handle in a single system call.
> > 
> > At the cost of every other application that doesn't need those
> > attributes.
> 
> Why do you think there would be a cost?

It's as much maintenance and testing cost as it is a runtime cost.
We have to test and check this functionality works as advertised,
and we have to maintain that in working order forever more. That's
not free, especially if it is decided that the implementation needs
to be hyper-optimised in each individual filesystem because of
performance cost reasons.

Indeed, even the runtime "do we need to fetch this information"
checks have a measurable cost, especially as statx() is a very hot
kernel path. We've been optimising branches out of things like
setting up kiocbs because when that path is taken millions of times
every second each logic branch that decides if something needs to be
done or not has a direct measurable cost. statx() is a hot path that
can be called millions of times a second.....

And then comes the cost of encoding dynamically sized information in
struct statx - filehandles are not fixed size - and statx is most
definitely not set up or intended for dynamically sized attribute
data. This adds more complexity to statx because it wasn't designed
or intended to handle dynamically sized attributes. Optional
attributes, yes, but not attributes that might vary in size from fs
to fs or even inode type to inode type within a fileystem (e.g. dir
filehandles can, optionally, encode the parent inode in them).

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 22:10                                                 ` Dave Chinner
@ 2023-12-12 22:31                                                   ` NeilBrown
  2023-12-12 23:06                                                     ` Dave Chinner
  0 siblings, 1 reply; 92+ messages in thread
From: NeilBrown @ 2023-12-12 22:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	David Howells, linux-fsdevel

On Wed, 13 Dec 2023, Dave Chinner wrote:
> 
> What you are suggesting is that we now duplicate filehandle encoding
> into every filesystem's statx() implementation.  That's a bad
> trade-off from a maintenance, testing and consistency POV because
> now we end up with lots of individual, filehandle encoding
> implementations in addition to the generic filehandle
> infrastructure that we all have to test and validate.

Not correct.  We are suggesting an interface, not an implementation.
Here you are proposing a suboptimal implementation, pointing out its
weakness, and suggesting the has consequences for the interface
proposal.  Is that the strawman fallacy?

vfs_getattr_nosec could, after calling i_op->getattr, check if
STATX_HANDLE is set in request_mask but not in ->result_mask.
If so it could call exportfs_encode_fh() and handle the result.

No filesystem need to be changed.

NeilBrown

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-12 22:23                                                     ` Dave Chinner
@ 2023-12-12 22:36                                                       ` NeilBrown
  2023-12-12 22:39                                                       ` Kent Overstreet
  1 sibling, 0 replies; 92+ messages in thread
From: NeilBrown @ 2023-12-12 22:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Frank Filz, 'Theodore Ts'o', 'Donald Buczek',
	'Kent Overstreet',
	linux-bcachefs, 'Stefan Krueger', 'David Howells',
	linux-fsdevel

On Wed, 13 Dec 2023, Dave Chinner wrote:
> On Wed, Dec 13, 2023 at 08:57:43AM +1100, NeilBrown wrote:
> > On Wed, 13 Dec 2023, Dave Chinner wrote:
> > > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > > > >
> > > > > > > So can someone please explain to me why we need to try to re-invent
> > > > > > > a generic filehandle concept in statx when we already have a have
> > > > > > > working and widely supported user API that provides exactly this
> > > > > > > functionality?
> > > > > >
> > > > > > name_to_handle_at() is fine, but userspace could profit from being
> > > > > > able to retrieve the filehandle together with the other metadata in a
> > > > > > single system call.
> > > > > 
> > > > > Can you say more?  What, specifically is the application that would want
> > > > to do
> > > > > that, and is it really in such a hot path that it would be a user-visible
> > > > > improveable, let aloine something that can be actually be measured?
> > > > 
> > > > A user space NFS server like Ganesha could benefit from getting attributes
> > > > and file handle in a single system call.
> > > 
> > > At the cost of every other application that doesn't need those
> > > attributes.
> > 
> > Why do you think there would be a cost?
> 
> It's as much maintenance and testing cost as it is a runtime cost.

You are moving the goal posts.  Maintenance is not a cost borne by
"every other application".

But if you think it is a concern it is certainly worth mentioning.

> We have to test and check this functionality works as advertised,
> and we have to maintain that in working order forever more. That's
> not free, especially if it is decided that the implementation needs
> to be hyper-optimised in each individual filesystem because of
> performance cost reasons.
> 
> Indeed, even the runtime "do we need to fetch this information"
> checks have a measurable cost, especially as statx() is a very hot
> kernel path. We've been optimising branches out of things like
> setting up kiocbs because when that path is taken millions of times
> every second each logic branch that decides if something needs to be
> done or not has a direct measurable cost. statx() is a hot path that
> can be called millions of times a second.....
> 
> And then comes the cost of encoding dynamically sized information in
> struct statx - filehandles are not fixed size - and statx is most
> definitely not set up or intended for dynamically sized attribute
> data. This adds more complexity to statx because it wasn't designed
> or intended to handle dynamically sized attributes. Optional
> attributes, yes, but not attributes that might vary in size from fs
> to fs or even inode type to inode type within a fileystem (e.g. dir
> filehandles can, optionally, encode the parent inode in them).

Filehandles are fixed sized.  They are one integer in the range 0-128
plus 128 bytes.  We know/promise the when the integer is less than 128,
trailing bytes of the 128 will all be zero.

NeilBRown


> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-12 22:23                                                     ` Dave Chinner
  2023-12-12 22:36                                                       ` NeilBrown
@ 2023-12-12 22:39                                                       ` Kent Overstreet
  2023-12-12 23:44                                                         ` Dave Chinner
  1 sibling, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 22:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Frank Filz, 'Theodore Ts'o',
	'Donald Buczek', linux-bcachefs, 'Stefan Krueger',
	'David Howells',
	linux-fsdevel

On Wed, Dec 13, 2023 at 09:23:18AM +1100, Dave Chinner wrote:
> On Wed, Dec 13, 2023 at 08:57:43AM +1100, NeilBrown wrote:
> > On Wed, 13 Dec 2023, Dave Chinner wrote:
> > > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > > > >
> > > > > > > So can someone please explain to me why we need to try to re-invent
> > > > > > > a generic filehandle concept in statx when we already have a have
> > > > > > > working and widely supported user API that provides exactly this
> > > > > > > functionality?
> > > > > >
> > > > > > name_to_handle_at() is fine, but userspace could profit from being
> > > > > > able to retrieve the filehandle together with the other metadata in a
> > > > > > single system call.
> > > > > 
> > > > > Can you say more?  What, specifically is the application that would want
> > > > to do
> > > > > that, and is it really in such a hot path that it would be a user-visible
> > > > > improveable, let aloine something that can be actually be measured?
> > > > 
> > > > A user space NFS server like Ganesha could benefit from getting attributes
> > > > and file handle in a single system call.
> > > 
> > > At the cost of every other application that doesn't need those
> > > attributes.
> > 
> > Why do you think there would be a cost?
> 
> It's as much maintenance and testing cost as it is a runtime cost.
> We have to test and check this functionality works as advertised,
> and we have to maintain that in working order forever more. That's
> not free, especially if it is decided that the implementation needs
> to be hyper-optimised in each individual filesystem because of
> performance cost reasons.
> 
> Indeed, even the runtime "do we need to fetch this information"
> checks have a measurable cost, especially as statx() is a very hot
> kernel path. We've been optimising branches out of things like
> setting up kiocbs because when that path is taken millions of times
> every second each logic branch that decides if something needs to be
> done or not has a direct measurable cost. statx() is a hot path that
> can be called millions of times a second.....

Like Neal mentioned we won't even be fetching the fh if it wasn't
explicitly requested - and like I mentioned, we can avoid the
.encode_fh() call for local filesystems with a bit of work at the VFS
layer.

OTOH, when you're running rsync in incremental mode, and detecting
hardlinks, your point that "statx can be called millions of times per
second" would apply just as much to the additional name_to_handle_at()
call - we'd be nearly doubling their overhead for scanning files that
don't need to be sent.

> And then comes the cost of encoding dynamically sized information in
> struct statx - filehandles are not fixed size - and statx is most
> definitely not set up or intended for dynamically sized attribute
> data. This adds more complexity to statx because it wasn't designed
> or intended to handle dynamically sized attributes. Optional
> attributes, yes, but not attributes that might vary in size from fs
> to fs or even inode type to inode type within a fileystem (e.g. dir
> filehandles can, optionally, encode the parent inode in them).

Since it looks like expanding statx is not going to be quite as easy as
hoped, I proposed elsewhere in the thread that we reserve a smaller
fixed size in statx (32 bytes) and set a flag if it won't fit,
indicating that userspace needs to fall back to name_to_handle_at().

Stuffing a _dynamically_ sized attribute into statx would indeed be
painful - I believe were always talking about a fixed size buffer in
statx, the discussion's been over how big it needs to be...

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 20:59                                   ` Kent Overstreet
@ 2023-12-12 22:57                                     ` NeilBrown
  2023-12-12 23:43                                       ` Kent Overstreet
  0 siblings, 1 reply; 92+ messages in thread
From: NeilBrown @ 2023-12-12 22:57 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: David Howells, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel

On Wed, 13 Dec 2023, Kent Overstreet wrote:
> On Mon, Dec 11, 2023 at 11:40:16PM +0000, David Howells wrote:
> > Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > 
> > > I was chatting a bit with David Howells on IRC about this, and floated
> > > adding the file handle to statx. It looks like there's enough space
> > > reserved to make this feasible - probably going with a fixed maximum
> > > size of 128-256 bits.
> > 
> > We can always save the last bit to indicate extension space/extension record,
> > so we're not that strapped for space.
> 
> So we'll need that if we want to round trip NFSv4 filehandles, they
> won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96
> bytes reserved).
> 
> Obvious question (Neal): do/will real world implementations ever come
> close to making use of this, or was this a "future proofing gone wild"
> thing?

I have no useful data.  I have seen lots of filehandles but I don't pay
much attention to their length.  Certainly some are longer than 32 bytes.

> 
> Say we do decide we want to spec it that large: _can_ we extend struct
> statx? I'm wondering if the userspace side was thought through, I'm
> sure glibc people will have something to say.

The man page says:

     Therefore, do not simply set mask to UINT_MAX (all bits set), as
     one or more bits may, in the future, be used to specify an
     extension to the buffer.

I suspect the glibc people read that.

NeilBrown




> 
> Kernel side we can definitely extend struct statx, and we know how many
> bytes to copy to userspace because we know what fields userspace
> requested. The part I'm concerned about is that if we extend userspace's
> struct statx, that introduces obvious ABI compabitibility issues.
> 
> So this would probably force glibc to introduce a new version of struct
> statx, if I'm not mistaken.
> 
> Or: another option would be to reserve something small and sane in
> struct statx (32 bytes max, I'd say), and then set a flag to tell
> userspace they need to use name_to_handle_at() if it didn't fit.
> 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 22:31                                                   ` NeilBrown
@ 2023-12-12 23:06                                                     ` Dave Chinner
  2023-12-12 23:42                                                       ` Kent Overstreet
  2023-12-13  0:03                                                       ` NeilBrown
  0 siblings, 2 replies; 92+ messages in thread
From: Dave Chinner @ 2023-12-12 23:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	David Howells, linux-fsdevel

On Wed, Dec 13, 2023 at 09:31:13AM +1100, NeilBrown wrote:
> On Wed, 13 Dec 2023, Dave Chinner wrote:
> > 
> > What you are suggesting is that we now duplicate filehandle encoding
> > into every filesystem's statx() implementation.  That's a bad
> > trade-off from a maintenance, testing and consistency POV because
> > now we end up with lots of individual, filehandle encoding
> > implementations in addition to the generic filehandle
> > infrastructure that we all have to test and validate.
> 
> Not correct.  We are suggesting an interface, not an implementation.
> Here you are proposing a suboptimal implementation, pointing out its
> weakness, and suggesting the has consequences for the interface
> proposal.  Is that the strawman fallacy?

No, you simply haven't followed deep enough into the rabbit hole to
understand Kent was suggesting potential implementation details to
address hot path performance concerns with filehandle encoding.

> vfs_getattr_nosec could, after calling i_op->getattr, check if
> STATX_HANDLE is set in request_mask but not in ->result_mask.
> If so it could call exportfs_encode_fh() and handle the result.
>
> No filesystem need to be changed.

Well, yes, it's pretty damn obvious that is exactly what I've been
advocating for here - if we are going to put filehandles in statx(),
then it must use the same infrastructure as name_to_handle_at().
i.e. calling exportfs_encode_fh(EXPORT_FH_FID) to generate the
filehandle.

The important discussion detail you've missed about
exportfs_encode_fh() is that it *requires* adding a new indirect
call (via export_ops->encode_fh) in the statx path to encode the
filehandle, and that's exactly what Kent was suggesting we can code
the implementation to avoid.

Avoiding an indirect function call is an implementation detail, not
an interface design requirement.

And the only way to avoid adding new indirect calls to encoding
filesystem specific filehandles is to implement the encoding in the
existing individual filesystem i_op->getattr methods. i.e. duplicate
the filehandle encoding in the statx path rather than use
exportfs_encode_fh().....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 23:06                                                     ` Dave Chinner
@ 2023-12-12 23:42                                                       ` Kent Overstreet
  2023-12-13  0:03                                                       ` NeilBrown
  1 sibling, 0 replies; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 23:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Donald Buczek, linux-bcachefs, Stefan Krueger,
	David Howells, linux-fsdevel

On Wed, Dec 13, 2023 at 10:06:37AM +1100, Dave Chinner wrote:
> On Wed, Dec 13, 2023 at 09:31:13AM +1100, NeilBrown wrote:
> > On Wed, 13 Dec 2023, Dave Chinner wrote:
> > > 
> > > What you are suggesting is that we now duplicate filehandle encoding
> > > into every filesystem's statx() implementation.  That's a bad
> > > trade-off from a maintenance, testing and consistency POV because
> > > now we end up with lots of individual, filehandle encoding
> > > implementations in addition to the generic filehandle
> > > infrastructure that we all have to test and validate.
> > 
> > Not correct.  We are suggesting an interface, not an implementation.
> > Here you are proposing a suboptimal implementation, pointing out its
> > weakness, and suggesting the has consequences for the interface
> > proposal.  Is that the strawman fallacy?
> 
> No, you simply haven't followed deep enough into the rabbit hole to
> understand Kent was suggesting potential implementation details to
> address hot path performance concerns with filehandle encoding.
> 
> > vfs_getattr_nosec could, after calling i_op->getattr, check if
> > STATX_HANDLE is set in request_mask but not in ->result_mask.
> > If so it could call exportfs_encode_fh() and handle the result.
> >
> > No filesystem need to be changed.
> 
> Well, yes, it's pretty damn obvious that is exactly what I've been
> advocating for here - if we are going to put filehandles in statx(),
> then it must use the same infrastructure as name_to_handle_at().
> i.e. calling exportfs_encode_fh(EXPORT_FH_FID) to generate the
> filehandle.
> 
> The important discussion detail you've missed about
> exportfs_encode_fh() is that it *requires* adding a new indirect
> call (via export_ops->encode_fh) in the statx path to encode the
> filehandle, and that's exactly what Kent was suggesting we can code
> the implementation to avoid.
> 
> Avoiding an indirect function call is an implementation detail, not
> an interface design requirement.
> 
> And the only way to avoid adding new indirect calls to encoding
> filesystem specific filehandles is to implement the encoding in the
> existing individual filesystem i_op->getattr methods. i.e. duplicate
> the filehandle encoding in the statx path rather than use
> exportfs_encode_fh().....

I was thinking along the lines of coming up with a common fh type for
local filesystems (why exactly do we need 15?) and adding a volume ID to
the VFS inode so this could live entirely in VS code for most
filesystems, but that's an option too.

Might be the best one, since btrfs and bcachefs actually do want a
different fh type (btrfs: 64 bit subvol, 64 bit ino, bcachefs: 64 bit
ino, 32 bit subvol, 32 bit generation), and we don't want to generate a
bigger fh than necessary for when it's being consumed by a stacking
filesystem that has to generate a new fh by concatanating something.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 22:57                                     ` NeilBrown
@ 2023-12-12 23:43                                       ` Kent Overstreet
  2023-12-13  0:02                                         ` NeilBrown
  0 siblings, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-12 23:43 UTC (permalink / raw)
  To: NeilBrown
  Cc: David Howells, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel

On Wed, Dec 13, 2023 at 09:57:22AM +1100, NeilBrown wrote:
> On Wed, 13 Dec 2023, Kent Overstreet wrote:
> > On Mon, Dec 11, 2023 at 11:40:16PM +0000, David Howells wrote:
> > > Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > 
> > > > I was chatting a bit with David Howells on IRC about this, and floated
> > > > adding the file handle to statx. It looks like there's enough space
> > > > reserved to make this feasible - probably going with a fixed maximum
> > > > size of 128-256 bits.
> > > 
> > > We can always save the last bit to indicate extension space/extension record,
> > > so we're not that strapped for space.
> > 
> > So we'll need that if we want to round trip NFSv4 filehandles, they
> > won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96
> > bytes reserved).
> > 
> > Obvious question (Neal): do/will real world implementations ever come
> > close to making use of this, or was this a "future proofing gone wild"
> > thing?
> 
> I have no useful data.  I have seen lots of filehandles but I don't pay
> much attention to their length.  Certainly some are longer than 32 bytes.
> 
> > 
> > Say we do decide we want to spec it that large: _can_ we extend struct
> > statx? I'm wondering if the userspace side was thought through, I'm
> > sure glibc people will have something to say.
> 
> The man page says:
> 
>      Therefore, do not simply set mask to UINT_MAX (all bits set), as
>      one or more bits may, in the future, be used to specify an
>      extension to the buffer.
> 
> I suspect the glibc people read that.

The trouble is that C has no notion of which types are safe to pass
across a dynamic library boundary, so if we increase the size of struct
statx and someone's doing that things will break in nasty ways.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-12 22:39                                                       ` Kent Overstreet
@ 2023-12-12 23:44                                                         ` Dave Chinner
  2023-12-13  0:00                                                           ` Kent Overstreet
  0 siblings, 1 reply; 92+ messages in thread
From: Dave Chinner @ 2023-12-12 23:44 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: NeilBrown, Frank Filz, 'Theodore Ts'o',
	'Donald Buczek', linux-bcachefs, 'Stefan Krueger',
	'David Howells',
	linux-fsdevel

On Tue, Dec 12, 2023 at 05:39:27PM -0500, Kent Overstreet wrote:
> On Wed, Dec 13, 2023 at 09:23:18AM +1100, Dave Chinner wrote:
> > On Wed, Dec 13, 2023 at 08:57:43AM +1100, NeilBrown wrote:
> > > On Wed, 13 Dec 2023, Dave Chinner wrote:
> > > > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > > > > >
> > > > > > > > So can someone please explain to me why we need to try to re-invent
> > > > > > > > a generic filehandle concept in statx when we already have a have
> > > > > > > > working and widely supported user API that provides exactly this
> > > > > > > > functionality?
> > > > > > >
> > > > > > > name_to_handle_at() is fine, but userspace could profit from being
> > > > > > > able to retrieve the filehandle together with the other metadata in a
> > > > > > > single system call.
> > > > > > 
> > > > > > Can you say more?  What, specifically is the application that would want
> > > > > to do
> > > > > > that, and is it really in such a hot path that it would be a user-visible
> > > > > > improveable, let aloine something that can be actually be measured?
> > > > > 
> > > > > A user space NFS server like Ganesha could benefit from getting attributes
> > > > > and file handle in a single system call.
> > > > 
> > > > At the cost of every other application that doesn't need those
> > > > attributes.
> > > 
> > > Why do you think there would be a cost?
> > 
> > It's as much maintenance and testing cost as it is a runtime cost.
> > We have to test and check this functionality works as advertised,
> > and we have to maintain that in working order forever more. That's
> > not free, especially if it is decided that the implementation needs
> > to be hyper-optimised in each individual filesystem because of
> > performance cost reasons.
> > 
> > Indeed, even the runtime "do we need to fetch this information"
> > checks have a measurable cost, especially as statx() is a very hot
> > kernel path. We've been optimising branches out of things like
> > setting up kiocbs because when that path is taken millions of times
> > every second each logic branch that decides if something needs to be
> > done or not has a direct measurable cost. statx() is a hot path that
> > can be called millions of times a second.....
> 
> Like Neal mentioned we won't even be fetching the fh if it wasn't
> explicitly requested - and like I mentioned, we can avoid the
> .encode_fh() call for local filesystems with a bit of work at the VFS
> layer.
> 
> OTOH, when you're running rsync in incremental mode, and detecting
> hardlinks, your point that "statx can be called millions of times per
> second" would apply just as much to the additional name_to_handle_at()
> call - we'd be nearly doubling their overhead for scanning files that
> don't need to be sent.

Hardlinked files are indicated by st_nlink > 1, not by requiring
userspace to store every st_ino/dev it sees and having to compare
the st-ino/dev of every newly stat()d inode against that ino/dev
cache.

We only need ino/dev/filehandles for hardlink path disambiguation.

IOWs, this use case does not need name_to_handle_at() for millions
of inodes - it is just needed on the regular file inodes that have
st_nlink > 1.

Hence even for wrokloads like rsync with hardlink detection, we
don't need filehandles for every inode being stat()d.  And that's
ignoring the fact that, outside of certain niche use cases,
hardlinks are rare.

I'm really struggling to see what filehandles in statx() actually
optimises in any meaningful manner....

> > And then comes the cost of encoding dynamically sized information in
> > struct statx - filehandles are not fixed size - and statx is most
> > definitely not set up or intended for dynamically sized attribute
> > data. This adds more complexity to statx because it wasn't designed
> > or intended to handle dynamically sized attributes. Optional
> > attributes, yes, but not attributes that might vary in size from fs
> > to fs or even inode type to inode type within a fileystem (e.g. dir
> > filehandles can, optionally, encode the parent inode in them).
> 
> Since it looks like expanding statx is not going to be quite as easy as
> hoped, I proposed elsewhere in the thread that we reserve a smaller
> fixed size in statx (32 bytes) and set a flag if it won't fit,
> indicating that userspace needs to fall back to name_to_handle_at().

struct btrfs_fid is 40 bytes in size. Sure, that's not all used for
name_to_handle_at(), but we already have in-kernel filehandles that
can optionally configured to be bigger than 32 bytes...

> Stuffing a _dynamically_ sized attribute into statx would indeed be
> painful - I believe were always talking about a fixed size buffer in
> statx, the discussion's been over how big it needs to be...

The contents of the buffer is still dynamically sized, so there's
still a length attribute that needs to be emitted to userspace with
the buffer.

And then what happens with the next attribute that someone wants
statx() to expose that can be dynamically sized? Are we really
planning to allow the struct statx to be expanded indefinitely
with largely unused static data arrays?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-12 23:44                                                         ` Dave Chinner
@ 2023-12-13  0:00                                                           ` Kent Overstreet
  0 siblings, 0 replies; 92+ messages in thread
From: Kent Overstreet @ 2023-12-13  0:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Frank Filz, 'Theodore Ts'o',
	'Donald Buczek', linux-bcachefs, 'Stefan Krueger',
	'David Howells',
	linux-fsdevel

On Wed, Dec 13, 2023 at 10:44:07AM +1100, Dave Chinner wrote:
> On Tue, Dec 12, 2023 at 05:39:27PM -0500, Kent Overstreet wrote:
> > Like Neal mentioned we won't even be fetching the fh if it wasn't
> > explicitly requested - and like I mentioned, we can avoid the
> > .encode_fh() call for local filesystems with a bit of work at the VFS
> > layer.
> > 
> > OTOH, when you're running rsync in incremental mode, and detecting
> > hardlinks, your point that "statx can be called millions of times per
> > second" would apply just as much to the additional name_to_handle_at()
> > call - we'd be nearly doubling their overhead for scanning files that
> > don't need to be sent.
> 
> Hardlinked files are indicated by st_nlink > 1, not by requiring
> userspace to store every st_ino/dev it sees and having to compare
> the st-ino/dev of every newly stat()d inode against that ino/dev
> cache.
> 
> We only need ino/dev/filehandles for hardlink path disambiguation.
> 
> IOWs, this use case does not need name_to_handle_at() for millions
> of inodes - it is just needed on the regular file inodes that have
> st_nlink > 1.

Ok yeah, that's a really good point. Perhaps nanme_to_handle_at() is
sufficient, then.

If so, maybe we can just add STATX_ATTR_INUM_NOT_UNIQUE and STATX_VOL
now, and leave STATX_HANDLE until someone discovers an application where
it actually does matter.

> > > And then comes the cost of encoding dynamically sized information in
> > > struct statx - filehandles are not fixed size - and statx is most
> > > definitely not set up or intended for dynamically sized attribute
> > > data. This adds more complexity to statx because it wasn't designed
> > > or intended to handle dynamically sized attributes. Optional
> > > attributes, yes, but not attributes that might vary in size from fs
> > > to fs or even inode type to inode type within a fileystem (e.g. dir
> > > filehandles can, optionally, encode the parent inode in them).
> > 
> > Since it looks like expanding statx is not going to be quite as easy as
> > hoped, I proposed elsewhere in the thread that we reserve a smaller
> > fixed size in statx (32 bytes) and set a flag if it won't fit,
> > indicating that userspace needs to fall back to name_to_handle_at().
> 
> struct btrfs_fid is 40 bytes in size. Sure, that's not all used for
> name_to_handle_at(), but we already have in-kernel filehandles that
> can optionally configured to be bigger than 32 bytes...

The hell is all that for!? They never reuse inode numbers, why are there
generation numbers in there? And do they not have inode -> dirent
backrefs?

> > Stuffing a _dynamically_ sized attribute into statx would indeed be
> > painful - I believe were always talking about a fixed size buffer in
> > statx, the discussion's been over how big it needs to be...
> 
> The contents of the buffer is still dynamically sized, so there's
> still a length attribute that needs to be emitted to userspace with
> the buffer.

Correct

> And then what happens with the next attribute that someone wants
> statx() to expose that can be dynamically sized? Are we really
> planning to allow the struct statx to be expanded indefinitely
> with largely unused static data arrays?

Well, struct stat/statx is not a long lived object that anyone would
ever keep a lot of around; it's a short lived object that just needs to
be efficient to access and ABI stable, so yes, if this comes up again
that's what we should do.

The alternative would be adding fields with an [ offset, length ] scheme
and treating the statx buffer as a bump allocator, but simple and fast
to access beats space efficiency here...

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 23:43                                       ` Kent Overstreet
@ 2023-12-13  0:02                                         ` NeilBrown
  2023-12-13  0:14                                           ` Kent Overstreet
  2023-12-13 22:45                                           ` Andreas Dilger
  0 siblings, 2 replies; 92+ messages in thread
From: NeilBrown @ 2023-12-13  0:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: David Howells, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel

On Wed, 13 Dec 2023, Kent Overstreet wrote:
> On Wed, Dec 13, 2023 at 09:57:22AM +1100, NeilBrown wrote:
> > On Wed, 13 Dec 2023, Kent Overstreet wrote:
> > > On Mon, Dec 11, 2023 at 11:40:16PM +0000, David Howells wrote:
> > > > Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > 
> > > > > I was chatting a bit with David Howells on IRC about this, and floated
> > > > > adding the file handle to statx. It looks like there's enough space
> > > > > reserved to make this feasible - probably going with a fixed maximum
> > > > > size of 128-256 bits.
> > > > 
> > > > We can always save the last bit to indicate extension space/extension record,
> > > > so we're not that strapped for space.
> > > 
> > > So we'll need that if we want to round trip NFSv4 filehandles, they
> > > won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96
> > > bytes reserved).
> > > 
> > > Obvious question (Neal): do/will real world implementations ever come
> > > close to making use of this, or was this a "future proofing gone wild"
> > > thing?
> > 
> > I have no useful data.  I have seen lots of filehandles but I don't pay
> > much attention to their length.  Certainly some are longer than 32 bytes.
> > 
> > > 
> > > Say we do decide we want to spec it that large: _can_ we extend struct
> > > statx? I'm wondering if the userspace side was thought through, I'm
> > > sure glibc people will have something to say.
> > 
> > The man page says:
> > 
> >      Therefore, do not simply set mask to UINT_MAX (all bits set), as
> >      one or more bits may, in the future, be used to specify an
> >      extension to the buffer.
> > 
> > I suspect the glibc people read that.
> 
> The trouble is that C has no notion of which types are safe to pass
> across a dynamic library boundary, so if we increase the size of struct
> statx and someone's doing that things will break in nasty ways.
> 

Maybe we don't increase the size of struct statx.
Maybe we declare

   struct statx2 {
     struct statx base;
     __u8 stx_handle[128];
   }

and pass then when we request STX_HANDLE.

Or maybe there is a better way.  I'm sure the glibc developers have face
this sort of problem before.

NeilBrown


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 23:06                                                     ` Dave Chinner
  2023-12-12 23:42                                                       ` Kent Overstreet
@ 2023-12-13  0:03                                                       ` NeilBrown
  1 sibling, 0 replies; 92+ messages in thread
From: NeilBrown @ 2023-12-13  0:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Donald Buczek, linux-bcachefs, Stefan Krueger,
	David Howells, linux-fsdevel

On Wed, 13 Dec 2023, Dave Chinner wrote:
> On Wed, Dec 13, 2023 at 09:31:13AM +1100, NeilBrown wrote:
> > On Wed, 13 Dec 2023, Dave Chinner wrote:
> > > 
> > > What you are suggesting is that we now duplicate filehandle encoding
> > > into every filesystem's statx() implementation.  That's a bad
> > > trade-off from a maintenance, testing and consistency POV because
> > > now we end up with lots of individual, filehandle encoding
> > > implementations in addition to the generic filehandle
> > > infrastructure that we all have to test and validate.
> > 
> > Not correct.  We are suggesting an interface, not an implementation.
> > Here you are proposing a suboptimal implementation, pointing out its
> > weakness, and suggesting the has consequences for the interface
> > proposal.  Is that the strawman fallacy?
> 
> No, you simply haven't followed deep enough into the rabbit hole to
> understand Kent was suggesting potential implementation details to
> address hot path performance concerns with filehandle encoding.

Yes, I missed that.  Sorry.

NeilBrown


> 
> > vfs_getattr_nosec could, after calling i_op->getattr, check if
> > STATX_HANDLE is set in request_mask but not in ->result_mask.
> > If so it could call exportfs_encode_fh() and handle the result.
> >
> > No filesystem need to be changed.
> 
> Well, yes, it's pretty damn obvious that is exactly what I've been
> advocating for here - if we are going to put filehandles in statx(),
> then it must use the same infrastructure as name_to_handle_at().
> i.e. calling exportfs_encode_fh(EXPORT_FH_FID) to generate the
> filehandle.
> 
> The important discussion detail you've missed about
> exportfs_encode_fh() is that it *requires* adding a new indirect
> call (via export_ops->encode_fh) in the statx path to encode the
> filehandle, and that's exactly what Kent was suggesting we can code
> the implementation to avoid.
> 
> Avoiding an indirect function call is an implementation detail, not
> an interface design requirement.
> 
> And the only way to avoid adding new indirect calls to encoding
> filesystem specific filehandles is to implement the encoding in the
> existing individual filesystem i_op->getattr methods. i.e. duplicate
> the filehandle encoding in the statx path rather than use
> exportfs_encode_fh().....
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-13  0:02                                         ` NeilBrown
@ 2023-12-13  0:14                                           ` Kent Overstreet
  2023-12-13 22:45                                           ` Andreas Dilger
  1 sibling, 0 replies; 92+ messages in thread
From: Kent Overstreet @ 2023-12-13  0:14 UTC (permalink / raw)
  To: NeilBrown
  Cc: David Howells, Donald Buczek, linux-bcachefs, Stefan Krueger,
	linux-fsdevel

On Wed, Dec 13, 2023 at 11:02:29AM +1100, NeilBrown wrote:
> On Wed, 13 Dec 2023, Kent Overstreet wrote:
> > On Wed, Dec 13, 2023 at 09:57:22AM +1100, NeilBrown wrote:
> > > On Wed, 13 Dec 2023, Kent Overstreet wrote:
> > > > On Mon, Dec 11, 2023 at 11:40:16PM +0000, David Howells wrote:
> > > > > Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > > 
> > > > > > I was chatting a bit with David Howells on IRC about this, and floated
> > > > > > adding the file handle to statx. It looks like there's enough space
> > > > > > reserved to make this feasible - probably going with a fixed maximum
> > > > > > size of 128-256 bits.
> > > > > 
> > > > > We can always save the last bit to indicate extension space/extension record,
> > > > > so we're not that strapped for space.
> > > > 
> > > > So we'll need that if we want to round trip NFSv4 filehandles, they
> > > > won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96
> > > > bytes reserved).
> > > > 
> > > > Obvious question (Neal): do/will real world implementations ever come
> > > > close to making use of this, or was this a "future proofing gone wild"
> > > > thing?
> > > 
> > > I have no useful data.  I have seen lots of filehandles but I don't pay
> > > much attention to their length.  Certainly some are longer than 32 bytes.
> > > 
> > > > 
> > > > Say we do decide we want to spec it that large: _can_ we extend struct
> > > > statx? I'm wondering if the userspace side was thought through, I'm
> > > > sure glibc people will have something to say.
> > > 
> > > The man page says:
> > > 
> > >      Therefore, do not simply set mask to UINT_MAX (all bits set), as
> > >      one or more bits may, in the future, be used to specify an
> > >      extension to the buffer.
> > > 
> > > I suspect the glibc people read that.
> > 
> > The trouble is that C has no notion of which types are safe to pass
> > across a dynamic library boundary, so if we increase the size of struct
> > statx and someone's doing that things will break in nasty ways.
> > 
> 
> Maybe we don't increase the size of struct statx.
> Maybe we declare
> 
>    struct statx2 {
>      struct statx base;
>      __u8 stx_handle[128];
>    }
> 
> and pass then when we request STX_HANDLE.

yeah, I think that's what would be required.

David was originally proposing having userspace just pass in a size_t
for the buffer size, and I really think that would've been better - if
this thing can ever change size, we need to make that explicit in the
API.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-12 15:20                                             ` Theodore Ts'o
  2023-12-12 17:15                                               ` Frank Filz
@ 2023-12-13  7:37                                               ` Donald Buczek
  2023-12-13 12:28                                                 ` Kent Overstreet
  1 sibling, 1 reply; 92+ messages in thread
From: Donald Buczek @ 2023-12-13  7:37 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, NeilBrown, Kent Overstreet, linux-bcachefs,
	Stefan Krueger, David Howells, linux-fsdevel

On 12/12/23 16:20, Theodore Ts'o wrote:
> On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
>> On 12/12/23 06:53, Dave Chinner wrote:
>>
>>> So can someone please explain to me why we need to try to re-invent
>>> a generic filehandle concept in statx when we already have a
>>> have working and widely supported user API that provides exactly
>>> this functionality?
>>
>> name_to_handle_at() is fine, but userspace could profit from being
>> able to retrieve the filehandle together with the other metadata in
>> a single system call.
> 
> Can you say more?  What, specifically is the application that would
> want to do that, and is it really in such a hot path that it would be
> a user-visible improveable, let aloine something that can be actually
> be measured?

Probably not for the specific applications I mentioned (backup, mirror,
accounting). These are intended to run continuously, slowly and unnoticed
in the background, so they are memory and i/o throttled via cgroups anyway
and one is even using sleep after so-and-so many stat calls to reduce
its impact.

If they could tell a directory from a snapshot, I would probably stop them
from walking into snapshots. And if not, the snapshot id is all that is
needed to tell a clone in a snapshot from a hardlink. So these don't really
need the filehandle.

In the thread it was assumed, that there are other (unspecified)
applications which need the filehandle and currently use name_to_handle_at().

I though it was self-evident that a single syscall to retrieve all
information atomically is better than a set of syscalls. Each additional
syscall has overhead and you need to be concerned with the data changing
between the calls.

Userspace nfs server as an example of an application, where visible
performance is more relevant, was already mentioned by someone else.

Best
  Donald

> 
> 						- Ted
-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 16:08                                                           ` Kent Overstreet
  2023-12-12 16:30                                                             ` Miklos Szeredi
@ 2023-12-13  9:41                                                             ` Christian Brauner
  1 sibling, 0 replies; 92+ messages in thread
From: Christian Brauner @ 2023-12-13  9:41 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Miklos Szeredi, Amir Goldstein, Dave Chinner, NeilBrown,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

> But when you show up to a discussion that's been going on for a page,

On the bcachefs mailing list without fsdevel or anyone else Cced.

> where everything's been constructively gathering input, and you start
> namecalling - and crucially, _without giving any technical justification

I didn't namecall at all. I just didn't like the flag and called it
"ugly" and explicitly said that I didn't see the value it brings. And
I'm not the only one.

I've been pretty supportive of the other parts of this. So I truly don't
understand why your turning this into a personal thing.

> for your opinions_ - that's just you being a dick.

Calling me a dick even if just implied is clearly crossing a line.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-12 21:46                                                       ` NeilBrown
@ 2023-12-13  9:47                                                         ` Christian Brauner
  2023-12-13 10:04                                                           ` Christian Brauner
  2023-12-14 22:47                                                           ` NeilBrown
  0 siblings, 2 replies; 92+ messages in thread
From: Christian Brauner @ 2023-12-13  9:47 UTC (permalink / raw)
  To: NeilBrown
  Cc: Miklos Szeredi, Kent Overstreet, Amir Goldstein, Dave Chinner,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Wed, Dec 13, 2023 at 08:46:54AM +1100, NeilBrown wrote:
> On Wed, 13 Dec 2023, Miklos Szeredi wrote:
> > On Tue, 12 Dec 2023 at 16:35, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > 
> > > Other poeple have been finding ways to contribute to the technical
> > > discussion; just calling things "ugly and broken" does not.
> > 
> > Kent, calm down please.  We call things "ugly and broken" all the
> > time.  That's just an opinion, you are free to argue it, and no need
> > to take it personally.
> 
> But maybe we shouldn't.  Maybe we should focus on saying what, exactly,
> is unpleasant to look at and way.  Or what exactly causes poor
> funcationality.

I said it's "ugly" and I doubted it's value. I didn't call it "broken".
And I've been supportive of the other parts. Yet everyone seems fine
with having this spiral out of control to the point where I'm being
called a dick.

You hade a privat discussion on the bcachefs mailing list and it seems
you expected to show up here with a complete interface that we just all
pick up and merge even though this is a multi-year longstanding
argument.

I've been supportive of both the subvol addition to statx and the
STATX_* flag to indicate a subvolume root. Yet somehow you're all
extremely focussed on me disliking this flag.

> "ugly" and "broken" are not particularly useful words in a technical
> discussion.  I understand people want to use them, but they really need
> to be backed up with details.  It is details that matter.

I did say that I don't see the value. And it's perfectly ok for you to
reiterate why it provides value. Your whole discussion has been private
on some other mailing list without the relevant maintainers Cced.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-13  9:47                                                         ` Christian Brauner
@ 2023-12-13 10:04                                                           ` Christian Brauner
  2023-12-14 22:47                                                           ` NeilBrown
  1 sibling, 0 replies; 92+ messages in thread
From: Christian Brauner @ 2023-12-13 10:04 UTC (permalink / raw)
  To: NeilBrown
  Cc: Miklos Szeredi, Kent Overstreet, Amir Goldstein, Dave Chinner,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Wed, Dec 13, 2023 at 10:48:01AM +0100, Christian Brauner wrote:
> On Wed, Dec 13, 2023 at 08:46:54AM +1100, NeilBrown wrote:
> > On Wed, 13 Dec 2023, Miklos Szeredi wrote:
> > > On Tue, 12 Dec 2023 at 16:35, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > 
> > > > Other poeple have been finding ways to contribute to the technical
> > > > discussion; just calling things "ugly and broken" does not.
> > > 
> > > Kent, calm down please.  We call things "ugly and broken" all the
> > > time.  That's just an opinion, you are free to argue it, and no need
> > > to take it personally.
> > 
> > But maybe we shouldn't.  Maybe we should focus on saying what, exactly,
> > is unpleasant to look at and way.  Or what exactly causes poor
> > funcationality.
> 
> I said it's "ugly" and I doubted it's value. I didn't call it "broken".

I see where you took that from. To be clear, what I meant by broken is
the device number switching that btrfs has been doing which has caused
so much pain already and is at least partially responsible for this
endless long discussion. I didn't mean "broken" as in the flag is
broken. I acknowledge that I failed to make that clearer.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-13  7:37                                               ` Donald Buczek
@ 2023-12-13 12:28                                                 ` Kent Overstreet
  2023-12-13 13:48                                                   ` Donald Buczek
  0 siblings, 1 reply; 92+ messages in thread
From: Kent Overstreet @ 2023-12-13 12:28 UTC (permalink / raw)
  To: Donald Buczek
  Cc: Theodore Ts'o, Dave Chinner, NeilBrown, linux-bcachefs,
	Stefan Krueger, David Howells, linux-fsdevel

On Wed, Dec 13, 2023 at 08:37:57AM +0100, Donald Buczek wrote:
> Probably not for the specific applications I mentioned (backup, mirror,
> accounting). These are intended to run continuously, slowly and unnoticed
> in the background, so they are memory and i/o throttled via cgroups anyway
> and one is even using sleep after so-and-so many stat calls to reduce
> its impact.
> 
> If they could tell a directory from a snapshot, I would probably stop them
> from walking into snapshots. And if not, the snapshot id is all that is
> needed to tell a clone in a snapshot from a hardlink. So these don't really
> need the filehandle.

Perhaps we should allocate a bit for differentiating a snapshot from a
non snapshot subvolume?

> In the thread it was assumed, that there are other (unspecified)
> applications which need the filehandle and currently use name_to_handle_at().
> 
> I though it was self-evident that a single syscall to retrieve all
> information atomically is better than a set of syscalls. Each additional
> syscall has overhead and you need to be concerned with the data changing
> between the calls.

All other things being equal, yeah it would be. But things are never
equal :)

Expanding struct statx is not going to be as easy as hoped, so we need
to be a bit careful how we use the remaining space, and since as Dave
pointed out the filehandle isn't needed for checking uniqueness unless
nlink > 1 it's not really a hotpath in any application I can think of.

(If anyone does know of an application where it might matter, now's the
time to bring it up!)

> Userspace nfs server as an example of an application, where visible
> performance is more relevant, was already mentioned by someone else.

I'd love to hear confirmation from someone more intimately familiar with
NFS, but AFAIK it shouldn't matter there; the filehandle exists to
support resuming IO or other operations to a file (because the server
can go away and come back). If all the client did was a stat, there's no
need for a filehandle - that's not needed until a file is opened.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: How to cope with subvolumes and snapshots on muti-user systems?
  2023-12-11 22:43                               ` NeilBrown
  2023-12-11 23:32                                 ` file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?) Kent Overstreet
  2023-12-11 23:40                                 ` David Howells
@ 2023-12-13 12:43                                 ` Donald Buczek
  2 siblings, 0 replies; 92+ messages in thread
From: Donald Buczek @ 2023-12-13 12:43 UTC (permalink / raw)
  To: NeilBrown, Kent Overstreet; +Cc: linux-bcachefs, Stefan Krueger

On 12/11/23 23:43, NeilBrown wrote:
> On Sat, 09 Dec 2023, Kent Overstreet wrote:
>> On Fri, Dec 08, 2023 at 12:34:28PM +0100, Donald Buczek wrote:
>>> On 12/8/23 03:49, Kent Overstreet wrote:
>>>
>>>> We really only need 6 or 7 bits out of the inode number for sharding;
>>>> then 20-32 bits (nobody's going to have a billion snapshots; a million
>>>> is a more reasonable upper bound) for the subvolume ID leaves 30 to 40
>>>> bits for actually allocating inodes out of.
>>>>
>>>> That'll be enough for the vast, vast majority of users, but exceeding
>>>> that limit is already something we're technically capable of: we're
>>>> currently seeing filesystems well over 100 TB, petabyte range expected
>>>> as fsck gets more optimized and online fsck comes.
>>>
>>> 30 bits would not be enough even today:
>>>
>>> buczek@done:~$ df -i /amd/done/C/C8024
>>> Filesystem         Inodes     IUsed      IFree IUse% Mounted on
>>> /dev/md0       2187890304 618857441 1569032863   29% /amd/done/C/C8024
>>>
>>> So that's 32 bit on a random production system ( 618857441 == 0x24e303e1 ).
> 
> only 30 bits though.  So it is a long way before you use all 32 bits.
> How many volumes do you have?

Sorry, I didn't answer that and and maybe the question is obsolete. Technically, you are
correct, of  course, though the "long way" is only factor 4, I just wanted to show that
live filesystems are in the order of magnitude of 2^32 inodes.

I'm not sure, I understand the question about how many volumes or why it matters. Sure, if
we'd combine the filesystems into one, we would be way over 2^32 files.

Subvolumes: none.

Filesystems: three (100TB) filesystems like this on that particular server and additional
three 100TB filesystems on another server for the same application. A 300TB
filesystem on another server for a similar application (but much less inodes currently)

We currently have about 112 filesystems >= 100TB online, if I counted correctly, and
many, many more smaller ones (30TB, 40TB, 50TB).

Best
  Donald

>>> And if the idea to produce unique inode numbers by hashing the filehandle into 64 is followed, collisions definitely need to be addressed. With 618857441 objects, the probability of a hash collision with 64 bit is already over 1% [1].
>>
>> Oof, thanks for the data point. Yeah, 64 bits is clearly not enough for
>> a unique identifier; time to start looking at how to extend statx.
>>
> 
> 64 should be plenty...
> 
> If you have 32 bits for free allocation, and 7 bits for sharding across
> 128 CPUs, then you can allocate many more than 4 billion inodes.  Maybe
> not the full 500 billion for 39 bits, but if you actually spread the
> load over all the shards, then certainly tens of billions.
> 
> If you use 22 bits for volume number and 42 bits for inodes in a volume,
> then you can spend 7 on sharding and still have room for 55 of Donald's
> filesystems to be allocated by each CPU.
> 
> And if Donald only needs thousands of volumes, not millions, then he
> could configure for a whole lot more headroom.
> 
> In fact, if you use the 64 bits of vfs_inode number by filling in bits from
> the fs-inode number from one end, and bits from the volume number from
> the other end, then you don't need to pre-configure how the 64 bits are
> shared.
> You record inum-bits and volnum bits in the filesystem metadata, and
> increase either as needed.  Once the sum hits 64, you start returning
> ENOSPC for new files or new volumes.
> 
> There will come a day when 64 bits is not enough for inodes in a single
> filesystem.  Today is not that day.
> 
> NeilBrown

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-13 12:28                                                 ` Kent Overstreet
@ 2023-12-13 13:48                                                   ` Donald Buczek
  2023-12-19  7:41                                                     ` Donald Buczek
  0 siblings, 1 reply; 92+ messages in thread
From: Donald Buczek @ 2023-12-13 13:48 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Theodore Ts'o, Dave Chinner, NeilBrown, linux-bcachefs,
	Stefan Krueger, David Howells, linux-fsdevel

On 12/13/23 13:28, Kent Overstreet wrote:
> On Wed, Dec 13, 2023 at 08:37:57AM +0100, Donald Buczek wrote:
>> Probably not for the specific applications I mentioned (backup, mirror,
>> accounting). These are intended to run continuously, slowly and unnoticed
>> in the background, so they are memory and i/o throttled via cgroups anyway
>> and one is even using sleep after so-and-so many stat calls to reduce
>> its impact.
>>
>> If they could tell a directory from a snapshot, I would probably stop them
>> from walking into snapshots. And if not, the snapshot id is all that is
>> needed to tell a clone in a snapshot from a hardlink. So these don't really
>> need the filehandle.
> 
> Perhaps we should allocate a bit for differentiating a snapshot from a
> non snapshot subvolume?
Are there non-snapshots subvolumes?

From  debugfs bcachefs/../btrees, I've got the impression, that every
volume starts with a (single) snapshot.

new fileystem:

subvolumes
==========
u64s 10 type subvolume 0:1:0 len 0 ver 0: root 4096 snapshot id 4294967295 parent 0

snapshots
=========
u64s 10 type snapshot 0:4294967295:0 len 0 ver 0: is_subvol 1 deleted 0 parent          0 children          0          0 subvol 1 tree 1 depth 0 skiplist 0 0 0

`bcachefs subvolume create /mnt/v`

subvolumes
==========
u64s 10 type subvolume 0:1:0 len 0 ver 0: root 4096 snapshot id 4294967295 parent 0
u64s 10 type subvolume 0:2:0 len 0 ver 0: root 1207959552 snapshot id 4294967294 parent 0

snapshots
=========
u64s 10 type snapshot 0:4294967294:0 len 0 ver 0: is_subvol 1 deleted 0 parent          0 children          0          0 subvol 2 tree 2 depth 0 skiplist 0 0 0
u64s 10 type snapshot 0:4294967295:0 len 0 ver 0: is_subvol 1 deleted 0 parent          0 children          0          0 subvol 1 tree 1 depth 0 skiplist 0 0 0

`bcachefs subvolume snapshot /mnt/v /mnt/s`

subvolumes
==========
u64s 10 type subvolume 0:1:0 len 0 ver 0: root 4096 snapshot id 4294967295 parent 0
u64s 10 type subvolume 0:2:0 len 0 ver 0: root 1207959552 snapshot id 4294967292 parent 0
u64s 10 type subvolume 0:3:0 len 0 ver 0: root 1207959552 snapshot id 4294967293 parent 2

snapshot
========
u64s 10 type snapshot 0:4294967292:0 len 0 ver 0: is_subvol 1 deleted 0 parent 4294967294 children          0          0 subvol 2 tree 2 depth 1 skiplist 4294967294 4294967294 4294967294
u64s 10 type snapshot 0:4294967293:0 len 0 ver 0: is_subvol 1 deleted 0 parent 4294967294 children          0          0 subvol 3 tree 2 depth 1 skiplist 4294967294 4294967294 4294967294
u64s 10 type snapshot 0:4294967294:0 len 0 ver 0: is_subvol 0 deleted 0 parent          0 children 4294967293 4294967292 subvol 0 tree 2 depth 0 skiplist 0 0 0
u64s 10 type snapshot 0:4294967295:0 len 0 ver 0: is_subvol 1 deleted 0 parent          0 children          0          0 subvol 1 tree 1 depth 0 skiplist 0 0 0

Now reading and interpreting the filehandles:

/mnt/.     type  177 : 00 10 00 00 00 00 00 00 01 00 00 00 00 00 00 00 : inode 0000000000001000 subvolume 00000001 generation 00000000
/mnt/v     type  177 : 00 00 00 48 00 00 00 00 02 00 00 00 00 00 00 00 : inode 0000000048000000 subvolume 00000002 generation 00000000
/mnt/s     type  177 : 00 00 00 48 00 00 00 00 03 00 00 00 00 00 00 00 : inode 0000000048000000 subvolume 00000003 generation 00000000


So is there really a type difference between the objects created by
`bcachefs subvolume create` and `bcachefs subvolume snapshot` ? It appears
that they both point to a volume which points to a snapshot in the snapshot
tree.

Best

  Donald


>> In the thread it was assumed, that there are other (unspecified)
>> applications which need the filehandle and currently use name_to_handle_at().
>>
>> I though it was self-evident that a single syscall to retrieve all
>> information atomically is better than a set of syscalls. Each additional
>> syscall has overhead and you need to be concerned with the data changing
>> between the calls.
> 
> All other things being equal, yeah it would be. But things are never
> equal :)
> 
> Expanding struct statx is not going to be as easy as hoped, so we need
> to be a bit careful how we use the remaining space, and since as Dave
> pointed out the filehandle isn't needed for checking uniqueness unless
> nlink > 1 it's not really a hotpath in any application I can think of.
> 
> (If anyone does know of an application where it might matter, now's the
> time to bring it up!)
> 
>> Userspace nfs server as an example of an application, where visible
>> performance is more relevant, was already mentioned by someone else.
> 
> I'd love to hear confirmation from someone more intimately familiar with
> NFS, but AFAIK it shouldn't matter there; the filehandle exists to
> support resuming IO or other operations to a file (because the server
> can go away and come back). If all the client did was a stat, there's no
> need for a filehandle - that's not needed until a file is opened.

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-13  0:02                                         ` NeilBrown
  2023-12-13  0:14                                           ` Kent Overstreet
@ 2023-12-13 22:45                                           ` Andreas Dilger
  2023-12-13 23:24                                             ` Kent Overstreet
  1 sibling, 1 reply; 92+ messages in thread
From: Andreas Dilger @ 2023-12-13 22:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: Kent Overstreet, David Howells, Donald Buczek, linux-bcachefs,
	Stefan Krueger, linux-fsdevel, Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 5066 bytes --]

On Dec 12, 2023, at 5:02 PM, NeilBrown <neilb@suse.de> wrote:
> 
> On Wed, 13 Dec 2023, Kent Overstreet wrote:
>> On Wed, Dec 13, 2023 at 09:57:22AM +1100, NeilBrown wrote:
>>> On Wed, 13 Dec 2023, Kent Overstreet wrote:
>>>> On Mon, Dec 11, 2023 at 11:40:16PM +0000, David Howells wrote:
>>>>> Kent Overstreet <kent.overstreet@linux.dev> wrote:
>>>>> 
>>>>>> I was chatting a bit with David Howells on IRC about this, and floated
>>>>>> adding the file handle to statx. It looks like there's enough space
>>>>>> reserved to make this feasible - probably going with a fixed maximum
>>>>>> size of 128-256 bits.
>>>>> 
>>>>> We can always save the last bit to indicate extension space/extension record,
>>>>> so we're not that strapped for space.
>>>> 
>>>> So we'll need that if we want to round trip NFSv4 filehandles, they
>>>> won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96
>>>> bytes reserved).
>>>> 
>>>> Obvious question (Neal): do/will real world implementations ever come
>>>> close to making use of this, or was this a "future proofing gone wild"
>>>> thing?
>>> 
>>> I have no useful data.  I have seen lots of filehandles but I don't pay
>>> much attention to their length.  Certainly some are longer than 32 bytes.
>>> 
>>>> 
>>>> Say we do decide we want to spec it that large: _can_ we extend struct
>>>> statx? I'm wondering if the userspace side was thought through, I'm
>>>> sure glibc people will have something to say.
>>> 
>>> The man page says:
>>> 
>>>     Therefore, do not simply set mask to UINT_MAX (all bits set), as
>>>     one or more bits may, in the future, be used to specify an
>>>     extension to the buffer.
>>> 
>>> I suspect the glibc people read that.
>> 
>> The trouble is that C has no notion of which types are safe to pass
>> across a dynamic library boundary, so if we increase the size of struct
>> statx and someone's doing that things will break in nasty ways.
>> 
> 
> Maybe we don't increase the size of struct statx.
> Maybe we declare
> 
>   struct statx2 {
>     struct statx base;
>     __u8 stx_handle[128];
>   }
> 
> and pass then when we request STX_HANDLE.

This would be extremely fragile, in the sense that "struct statx2" breaks
if "struct statx" adds any fields.

Not getting into the question of whether the FH _should_ be added to statx
or not, but I wanted to chime in on the discussion about statx being able
to add new fields.

The answer is definitely yes.  Callers are expected to set STATX_* flags
for any fields that they are interested in, so if the caller were to set
STATX_FILE_HANDLE in the request, then it is expected they understand this
flag and have allocated a large enough struct statx to hold stx_file_handle
in the reply.

Apps that don't need/want stx_file_handle should not set STATX_FILE_HANDLE
and then the kernel won't spend time to fill this in, the same as other
fields that may have overhead, like stx_size.  That should avoid concerns
from Dave that this adds overhead to the hot path.  Most apps won't need or
want this field, but apps that *do* want it would be better served getting
it in a single syscall instead of two (and avoid potential TOCTOU races).

It should be possible for userspace and the kernel to increase the size of
struct statx independently, and not have any issues.  If userspace requests
a field via STATX_* flags that the kernel doesn't understand, then it will
be masked out by the kernel, and any extended fields in the struct will not
be referenced.  Likewise, if the kernel understands more fields than what
userspace requests, it shouldn't spend time to fill in those fields, since
userspace will ignores them anyway, so it is just wasted cycles.

Not going into whether statx _should_ handle variable-sized fields, but
it _could_ do so, though it would make struct handling a bit more complex.

Adding "__u32 stx_file_handle_size" and "__u32 stx_file_handle_offset"
(relative to the start of the struct) to encode where the variable sized
handle data would be packed at the end after fixed-size fields (aligned
on a __u64 boundary to avoid issues if there are multiple such fields).
Userspace would indicate the maximum size of file handle it expects via
(stx_file_handle_offset + stx_file_handle_size).  The kernel would be at
liberty to pack "struct file_handle" at that offset, or after the end of
whatever fields are actually used, and userspace would use this to access
the handle.

Not pretty, but probably better than reserving a huge fixed size field
that is not needed for most case (is NFS intending a server on every
sub-atomic particle in the universe?).  This would allow apps to access
fields after stx_file_handle_size/offset normally, and only compute a
simple offset to access the variable sized blob as needed.  The presence
of STATX_FILE_HANDLE would indicate if the size/offset fields were valid
upon return (since an old kernel would not zero the fields).

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-13 22:45                                           ` Andreas Dilger
@ 2023-12-13 23:24                                             ` Kent Overstreet
  0 siblings, 0 replies; 92+ messages in thread
From: Kent Overstreet @ 2023-12-13 23:24 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: NeilBrown, David Howells, Donald Buczek, linux-bcachefs,
	Stefan Krueger, linux-fsdevel, Dave Chinner

On Wed, Dec 13, 2023 at 03:45:41PM -0700, Andreas Dilger wrote:
> It should be possible for userspace and the kernel to increase the size of
> struct statx independently, and not have any issues.  If userspace requests
> a field via STATX_* flags that the kernel doesn't understand, then it will
> be masked out by the kernel, and any extended fields in the struct will not
> be referenced.  Likewise, if the kernel understands more fields than what
> userspace requests, it shouldn't spend time to fill in those fields, since
> userspace will ignores them anyway, so it is just wasted cycles.

It's not the kernel <-> userspace boundary that's the problem, it's
userspace <-> userspace if you're changing the size of struct statx and
passing it across a dynamic lib boundary.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-13  9:47                                                         ` Christian Brauner
  2023-12-13 10:04                                                           ` Christian Brauner
@ 2023-12-14 22:47                                                           ` NeilBrown
  2023-12-15  0:36                                                             ` Kent Overstreet
  1 sibling, 1 reply; 92+ messages in thread
From: NeilBrown @ 2023-12-14 22:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Miklos Szeredi, Kent Overstreet, Amir Goldstein, Dave Chinner,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Wed, 13 Dec 2023, Christian Brauner wrote:
> On Wed, Dec 13, 2023 at 08:46:54AM +1100, NeilBrown wrote:
> > On Wed, 13 Dec 2023, Miklos Szeredi wrote:
> > > On Tue, 12 Dec 2023 at 16:35, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > 
> > > > Other poeple have been finding ways to contribute to the technical
> > > > discussion; just calling things "ugly and broken" does not.
> > > 
> > > Kent, calm down please.  We call things "ugly and broken" all the
> > > time.  That's just an opinion, you are free to argue it, and no need
> > > to take it personally.
> > 
> > But maybe we shouldn't.  Maybe we should focus on saying what, exactly,
> > is unpleasant to look at and way.  Or what exactly causes poor
> > funcationality.
> 
> I said it's "ugly" and I doubted it's value. I didn't call it "broken".
> And I've been supportive of the other parts. Yet everyone seems fine
> with having this spiral out of control to the point where I'm being
> called a dick.
> 
> You hade a privat discussion on the bcachefs mailing list and it seems
> you expected to show up here with a complete interface that we just all
> pick up and merge even though this is a multi-year longstanding
> argument.

I thought I was still having that private discussion on the bcachefs
mailing list.  I didn't realise that fsdevel had been added.

NeilBrown


> 
> I've been supportive of both the subvol addition to statx and the
> STATX_* flag to indicate a subvolume root. Yet somehow you're all
> extremely focussed on me disliking this flag.
> 
> > "ugly" and "broken" are not particularly useful words in a technical
> > discussion.  I understand people want to use them, but they really need
> > to be backed up with details.  It is details that matter.
> 
> I did say that I don't see the value. And it's perfectly ok for you to
> reiterate why it provides value. Your whole discussion has been private
> on some other mailing list without the relevant maintainers Cced.
> 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
  2023-12-14 22:47                                                           ` NeilBrown
@ 2023-12-15  0:36                                                             ` Kent Overstreet
  0 siblings, 0 replies; 92+ messages in thread
From: Kent Overstreet @ 2023-12-15  0:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Miklos Szeredi, Amir Goldstein, Dave Chinner,
	Donald Buczek, linux-bcachefs, Stefan Krueger, David Howells,
	linux-fsdevel, Josef Bacik, linux-btrfs

On Fri, Dec 15, 2023 at 09:47:47AM +1100, NeilBrown wrote:
> On Wed, 13 Dec 2023, Christian Brauner wrote:
> > On Wed, Dec 13, 2023 at 08:46:54AM +1100, NeilBrown wrote:
> > > On Wed, 13 Dec 2023, Miklos Szeredi wrote:
> > > > On Tue, 12 Dec 2023 at 16:35, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > 
> > > > > Other poeple have been finding ways to contribute to the technical
> > > > > discussion; just calling things "ugly and broken" does not.
> > > > 
> > > > Kent, calm down please.  We call things "ugly and broken" all the
> > > > time.  That's just an opinion, you are free to argue it, and no need
> > > > to take it personally.
> > > 
> > > But maybe we shouldn't.  Maybe we should focus on saying what, exactly,
> > > is unpleasant to look at and way.  Or what exactly causes poor
> > > funcationality.
> > 
> > I said it's "ugly" and I doubted it's value. I didn't call it "broken".
> > And I've been supportive of the other parts. Yet everyone seems fine
> > with having this spiral out of control to the point where I'm being
> > called a dick.
> > 
> > You hade a privat discussion on the bcachefs mailing list and it seems
> > you expected to show up here with a complete interface that we just all
> > pick up and merge even though this is a multi-year longstanding
> > argument.
> 
> I thought I was still having that private discussion on the bcachefs
> mailing list.  I didn't realise that fsdevel had been added.

I only thought that the discussion had gotten to the point where we had
something concrete enough to start to circulate more widely.

I wonder if some introspection could go into why this has been a
multi-year longstanding argument.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: file handle in statx
  2023-12-13 13:48                                                   ` Donald Buczek
@ 2023-12-19  7:41                                                     ` Donald Buczek
  0 siblings, 0 replies; 92+ messages in thread
From: Donald Buczek @ 2023-12-19  7:41 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Theodore Ts'o, Dave Chinner, NeilBrown, linux-bcachefs,
	Stefan Krueger, David Howells, linux-fsdevel

Dear Kent,

On 12/13/23 14:48, Donald Buczek wrote:
> On 12/13/23 13:28, Kent Overstreet wrote:
>> On Wed, Dec 13, 2023 at 08:37:57AM +0100, Donald Buczek wrote:
>>> Probably not for the specific applications I mentioned (backup, mirror,
>>> accounting). These are intended to run continuously, slowly and unnoticed
>>> in the background, so they are memory and i/o throttled via cgroups anyway
>>> and one is even using sleep after so-and-so many stat calls to reduce
>>> its impact.
>>>
>>> If they could tell a directory from a snapshot, I would probably stop them
>>> from walking into snapshots. And if not, the snapshot id is all that is
>>> needed to tell a clone in a snapshot from a hardlink. So these don't really
>>> need the filehandle.
>>
>> Perhaps we should allocate a bit for differentiating a snapshot from a
>> non snapshot subvolume?

> Are there non-snapshots subvolumes?
> 
> From  debugfs bcachefs/../btrees, I've got the impression, that every
> volume starts with a (single) snapshot.
> [...]
> So is there really a type difference between the objects created by
> `bcachefs subvolume create` and `bcachefs subvolume snapshot` ? It appears
> that they both point to a volume which points to a snapshot in the snapshot
> tree.


On a second thought: Even if my guesses were true, it would make sense to give
userspace the information. I'd probably code my backup code to walk into volumes
(or singleton snapshots) directories and copy the data just as it would do
with conventional directories. There is no risk of seeing the same file multiple
times. Only the hardlink logic should regard these volume borders and don't
treat entries in different volumes as hardlink candidates.

Only for subvolumes which potentially duplicate data we'd need
special coding to avoid copying the same data to the backup volume over
and over. Although we already might have a similar problem already with
reflink copies.

Best
  Donald


> 
> Best
> 
>   Donald
> 
> 
>>> In the thread it was assumed, that there are other (unspecified)
>>> applications which need the filehandle and currently use name_to_handle_at().
>>>
>>> I though it was self-evident that a single syscall to retrieve all
>>> information atomically is better than a set of syscalls. Each additional
>>> syscall has overhead and you need to be concerned with the data changing
>>> between the calls.
>>
>> All other things being equal, yeah it would be. But things are never
>> equal :)
>>
>> Expanding struct statx is not going to be as easy as hoped, so we need
>> to be a bit careful how we use the remaining space, and since as Dave
>> pointed out the filehandle isn't needed for checking uniqueness unless
>> nlink > 1 it's not really a hotpath in any application I can think of.
>>
>> (If anyone does know of an application where it might matter, now's the
>> time to bring it up!)
>>
>>> Userspace nfs server as an example of an application, where visible
>>> performance is more relevant, was already mentioned by someone else.
>>
>> I'd love to hear confirmation from someone more intimately familiar with
>> NFS, but AFAIK it shouldn't matter there; the filehandle exists to
>> support resuming IO or other operations to a file (because the server
>> can go away and come back). If all the client did was a stat, there's no
>> need for a filehandle - that's not needed until a file is opened.
> 

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433


^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2023-12-19  7:42 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-28  7:49 How to cope with subvolumes and snapshots on muti-user systems? Donald Buczek
2023-11-29 21:43 ` Kent Overstreet
2023-11-30  7:35   ` Donald Buczek
2023-11-30  7:39     ` Kent Overstreet
2023-11-30 20:37       ` NeilBrown
2023-12-04 10:47         ` Donald Buczek
2023-12-04 22:45           ` NeilBrown
2023-12-05 21:35             ` Donald Buczek
2023-12-05 22:01               ` NeilBrown
2023-12-07 11:53                 ` Donald Buczek
2023-12-08  1:16                   ` NeilBrown
2023-12-08  1:37                     ` Kent Overstreet
2023-12-08  2:13                       ` NeilBrown
2023-12-08  2:49                         ` Kent Overstreet
2023-12-08 11:34                           ` Donald Buczek
2023-12-08 20:02                             ` Kent Overstreet
2023-12-11 22:43                               ` NeilBrown
2023-12-11 23:32                                 ` file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?) Kent Overstreet
2023-12-11 23:53                                   ` NeilBrown
2023-12-12  0:05                                     ` Kent Overstreet
2023-12-12  0:59                                       ` NeilBrown
2023-12-12  1:10                                         ` Kent Overstreet
2023-12-12  2:13                                           ` NeilBrown
2023-12-12  2:24                                             ` Kent Overstreet
2023-12-12  9:08                                             ` Christian Brauner
2023-12-12  5:53                                         ` Dave Chinner
2023-12-12  6:32                                           ` Amir Goldstein
2023-12-12  8:56                                             ` Christian Brauner
2023-12-12 15:16                                               ` Kent Overstreet
2023-12-12 15:29                                                 ` Christian Brauner
2023-12-12 15:35                                                   ` Kent Overstreet
2023-12-12 15:38                                                     ` Miklos Szeredi
2023-12-12 15:43                                                       ` Kent Overstreet
2023-12-12 15:57                                                         ` Miklos Szeredi
2023-12-12 16:08                                                           ` Kent Overstreet
2023-12-12 16:30                                                             ` Miklos Szeredi
2023-12-12 16:41                                                               ` Kent Overstreet
2023-12-12 21:53                                                               ` NeilBrown
2023-12-13  9:41                                                             ` Christian Brauner
2023-12-12 21:46                                                       ` NeilBrown
2023-12-13  9:47                                                         ` Christian Brauner
2023-12-13 10:04                                                           ` Christian Brauner
2023-12-14 22:47                                                           ` NeilBrown
2023-12-15  0:36                                                             ` Kent Overstreet
2023-12-12  9:10                                             ` David Howells
2023-12-12  9:23                                               ` Christian Brauner
2023-12-12  9:28                                                 ` Miklos Szeredi
2023-12-12  9:35                                                   ` Christian Brauner
2023-12-12  9:42                                                     ` Miklos Szeredi
2023-12-12 13:47                                                       ` Christian Brauner
2023-12-12 14:06                                                         ` Miklos Szeredi
2023-12-12 15:24                                                           ` Christian Brauner
2023-12-12 15:28                                                     ` Kent Overstreet
2023-12-12  9:46                                               ` David Howells
2023-12-12  9:10                                           ` file handle in statx Donald Buczek
2023-12-12 15:20                                             ` Theodore Ts'o
2023-12-12 17:15                                               ` Frank Filz
2023-12-12 17:44                                                 ` Kent Overstreet
2023-12-12 18:17                                                   ` Amir Goldstein
2023-12-12 19:18                                                     ` Frank Filz
2023-12-12 20:59                                                 ` Dave Chinner
2023-12-12 21:57                                                   ` NeilBrown
2023-12-12 22:23                                                     ` Dave Chinner
2023-12-12 22:36                                                       ` NeilBrown
2023-12-12 22:39                                                       ` Kent Overstreet
2023-12-12 23:44                                                         ` Dave Chinner
2023-12-13  0:00                                                           ` Kent Overstreet
2023-12-13  7:37                                               ` Donald Buczek
2023-12-13 12:28                                                 ` Kent Overstreet
2023-12-13 13:48                                                   ` Donald Buczek
2023-12-19  7:41                                                     ` Donald Buczek
2023-12-12 15:21                                           ` file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?) Kent Overstreet
2023-12-12 20:48                                             ` Dave Chinner
2023-12-12 21:23                                               ` Kent Overstreet
2023-12-12 22:10                                                 ` Dave Chinner
2023-12-12 22:31                                                   ` NeilBrown
2023-12-12 23:06                                                     ` Dave Chinner
2023-12-12 23:42                                                       ` Kent Overstreet
2023-12-13  0:03                                                       ` NeilBrown
2023-12-12 22:00                                               ` NeilBrown
2023-12-12  7:03                                         ` David Howells
2023-12-12  0:25                                   ` David Howells
2023-12-11 23:40                                 ` David Howells
2023-12-12 20:59                                   ` Kent Overstreet
2023-12-12 22:57                                     ` NeilBrown
2023-12-12 23:43                                       ` Kent Overstreet
2023-12-13  0:02                                         ` NeilBrown
2023-12-13  0:14                                           ` Kent Overstreet
2023-12-13 22:45                                           ` Andreas Dilger
2023-12-13 23:24                                             ` Kent Overstreet
2023-12-13 12:43                                 ` How to cope with subvolumes and snapshots on muti-user systems? Donald Buczek
2023-11-30 20:36   ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.