Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
@ 2021-08-02  9:11 Forza
  2021-08-02 21:50 ` NeilBrown
  0 siblings, 1 reply; 21+ messages in thread
From: Forza @ 2021-08-02  9:11 UTC (permalink / raw)
  To: Amir Goldstein, NeilBrown
  Cc: Al Viro, Miklos Szeredi, Christoph Hellwig, Josef Bacik,
	J. Bruce Fields, Chuck Lever, Chris Mason, David Sterba,
	linux-fsdevel, Linux NFS list, Btrfs BTRFS



---- From: Amir Goldstein <amir73il@gmail.com> -- Sent: 2021-08-02 - 09:54 ----

> On Mon, Aug 2, 2021 at 8:41 AM NeilBrown <neilb@suse.de> wrote:
>>
>> On Mon, 02 Aug 2021, Al Viro wrote:
>> > On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
>> >
>> > > It think we need to bite-the-bullet and decide that 64bits is not
>> > > enough, and in fact no number of bits will ever be enough.  overlayfs
>> > > makes this clear.
>> >
>> > Sure - let's go for broke and use XML.  Oh, wait - it's 8 months too
>> > early...
>> >
>> > > So I think we need to strongly encourage user-space to start using
>> > > name_to_handle_at() whenever there is a need to test if two things are
>> > > the same.
>> >
>> > ... and forgetting the inconvenient facts, such as that two different
>> > fhandles may correspond to the same object.
>>
>> Can they?  They certainly can if the "connectable" flag is passed.
>> name_to_handle_at() cannot set that flag.
>> nfsd can, so using name_to_handle_at() on an NFS filesystem isn't quite
>> perfect.  However it is the best that can be done over NFS.
>>
>> Or is there some other situation where two different filehandles can be
>> reported for the same inode?
>>
>> Do you have a better suggestion?
>>
> 
> Neil,
> 
> I think the plan of "changing the world" is not very realistic.
> Sure, *some* tools can be changed, but all of them?
> 
> I went back to read your initial cover letter to understand the
> problem and what I mostly found there was that the view of
> /proc/x/mountinfo was hiding information that is important for
> some tools to understand what is going on with btrfs subvols.
> 
> Well I am not a UNIX history expert, but I suppose that
> /proc/PID/mountinfo was created because /proc/mounts and
> /proc/PID/mounts no longer provided tool with all the information
> about Linux mounts.
> 
> Maybe it's time for a new interface to query the more advanced
> sb/mount topology? fsinfo() maybe? With mount2 compatible API for
> traversing mounts that is not limited to reporting all entries inside
> a single page. I suppose we could go for some hierarchical view
> under /proc/PID/mounttree. I don't know - new API is hard.
> 
> In any case, instead of changing st_dev and st_ino or changing the
> world to work with file handles, why not add inode generation (and
> maybe subvol id) to statx().
> filesystem that care enough will provide this information and tools that
> care enough will use it.
> 
> Thanks,
> Amir.

I think it would be better and easier if nfs provided clients with virtual inodes and kept an internal mapping to actual filesystem inodes. Samba does this with the mount.cifs -o noserverino option, and as far as I know it works pretty well. 

This  could be made either as an export option (/mnt/foo *(noserverino) or like in the Samba case, a mount option. 

This way existing tools will continue to work and we don't have to reinvent various Linux subsystems. Because it's an option, users that don't use btrfs or other filesystems with snapshots, can simply skip it. 

Thanks, 
Forza 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02  9:11 A Third perspective on BTRFS nfsd subvol dev/inode number issues Forza
@ 2021-08-02 21:50 ` NeilBrown
  0 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2021-08-02 21:50 UTC (permalink / raw)
  To: Forza
  Cc: Amir Goldstein, Al Viro, Miklos Szeredi, Christoph Hellwig,
	Josef Bacik, J. Bruce Fields, Chuck Lever, Chris Mason,
	David Sterba, linux-fsdevel, Linux NFS list, Btrfs BTRFS

On Mon, 02 Aug 2021, Forza wrote:
> 
> ---- From: Amir Goldstein <amir73il@gmail.com> -- Sent: 2021-08-02 - 09:54 ----
> 
> > On Mon, Aug 2, 2021 at 8:41 AM NeilBrown <neilb@suse.de> wrote:
> >>
> >> On Mon, 02 Aug 2021, Al Viro wrote:
> >> > On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> >> >
> >> > > It think we need to bite-the-bullet and decide that 64bits is not
> >> > > enough, and in fact no number of bits will ever be enough.  overlayfs
> >> > > makes this clear.
> >> >
> >> > Sure - let's go for broke and use XML.  Oh, wait - it's 8 months too
> >> > early...
> >> >
> >> > > So I think we need to strongly encourage user-space to start using
> >> > > name_to_handle_at() whenever there is a need to test if two things are
> >> > > the same.
> >> >
> >> > ... and forgetting the inconvenient facts, such as that two different
> >> > fhandles may correspond to the same object.
> >>
> >> Can they?  They certainly can if the "connectable" flag is passed.
> >> name_to_handle_at() cannot set that flag.
> >> nfsd can, so using name_to_handle_at() on an NFS filesystem isn't quite
> >> perfect.  However it is the best that can be done over NFS.
> >>
> >> Or is there some other situation where two different filehandles can be
> >> reported for the same inode?
> >>
> >> Do you have a better suggestion?
> >>
> > 
> > Neil,
> > 
> > I think the plan of "changing the world" is not very realistic.
> > Sure, *some* tools can be changed, but all of them?
> > 
> > I went back to read your initial cover letter to understand the
> > problem and what I mostly found there was that the view of
> > /proc/x/mountinfo was hiding information that is important for
> > some tools to understand what is going on with btrfs subvols.
> > 
> > Well I am not a UNIX history expert, but I suppose that
> > /proc/PID/mountinfo was created because /proc/mounts and
> > /proc/PID/mounts no longer provided tool with all the information
> > about Linux mounts.
> > 
> > Maybe it's time for a new interface to query the more advanced
> > sb/mount topology? fsinfo() maybe? With mount2 compatible API for
> > traversing mounts that is not limited to reporting all entries inside
> > a single page. I suppose we could go for some hierarchical view
> > under /proc/PID/mounttree. I don't know - new API is hard.
> > 
> > In any case, instead of changing st_dev and st_ino or changing the
> > world to work with file handles, why not add inode generation (and
> > maybe subvol id) to statx().
> > filesystem that care enough will provide this information and tools that
> > care enough will use it.
> > 
> > Thanks,
> > Amir.
> 
> I think it would be better and easier if nfs provided clients with
> virtual inodes and kept an internal mapping to actual filesystem
> inodes.  Samba does this with the mount.cifs -o noserverino option,
> and as far as I know it works pretty well. 

This approach does have it's place, but it is far from perfect.
POSIX expects an inode number to be unique and stable for the lifetime
of the file.  Different applications have different expectations ranging
from those which don't care at all to those which really need the full
lifetime and full uniqueness (like an indexing tool).

Dynamically provided inode numbers cannot provide the full guarantee.
If implemented on the NFS client, it could ensure two inodes that are in
cache have different inode numbers, and it could ensure inode numbers
are not reused for a very long time, but it could not ensure they remain
stable for the lifetime of the file.  To do that last, it would need
stable storage to keep a copy of all the metadata of the whole
filesystem.

Implementing it on the NFS server would provide fewer guarantees.  The
server has limited insight into the client-side cache, so inode numbers
might change while a file was in cache on the client.

NeilBrown



> 
> This  could be made either as an export option (/mnt/foo *(noserverino) or like in the Samba case, a mount option. 
> 
> This way existing tools will continue to work and we don't have to reinvent various Linux subsystems. Because it's an option, users that don't use btrfs or other filesystems with snapshots, can simply skip it. 
> 
> Thanks, 
> Forza 
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02 13:53                         ` Josef Bacik
@ 2021-08-03 22:29                           ` Qu Wenruo
  0 siblings, 0 replies; 21+ messages in thread
From: Qu Wenruo @ 2021-08-03 22:29 UTC (permalink / raw)
  To: Josef Bacik, Amir Goldstein, NeilBrown
  Cc: Al Viro, Miklos Szeredi, Christoph Hellwig, J. Bruce Fields,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS



On 2021/8/2 下午9:53, Josef Bacik wrote:
> On 8/2/21 3:54 AM, Amir Goldstein wrote:
>> On Mon, Aug 2, 2021 at 8:41 AM NeilBrown <neilb@suse.de> wrote:
>>>
>>> On Mon, 02 Aug 2021, Al Viro wrote:
>>>> On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
>>>>
>>>>> It think we need to bite-the-bullet and decide that 64bits is not
>>>>> enough, and in fact no number of bits will ever be enough.  overlayfs
>>>>> makes this clear.
>>>>
>>>> Sure - let's go for broke and use XML.  Oh, wait - it's 8 months too
>>>> early...
>>>>
>>>>> So I think we need to strongly encourage user-space to start using
>>>>> name_to_handle_at() whenever there is a need to test if two things are
>>>>> the same.
>>>>
>>>> ... and forgetting the inconvenient facts, such as that two different
>>>> fhandles may correspond to the same object.
>>>
>>> Can they?  They certainly can if the "connectable" flag is passed.
>>> name_to_handle_at() cannot set that flag.
>>> nfsd can, so using name_to_handle_at() on an NFS filesystem isn't quite
>>> perfect.  However it is the best that can be done over NFS.
>>>
>>> Or is there some other situation where two different filehandles can be
>>> reported for the same inode?
>>>
>>> Do you have a better suggestion?
>>>
>>
>> Neil,
>>
>> I think the plan of "changing the world" is not very realistic.
>> Sure, *some* tools can be changed, but all of them?
>>
>> I went back to read your initial cover letter to understand the
>> problem and what I mostly found there was that the view of
>> /proc/x/mountinfo was hiding information that is important for
>> some tools to understand what is going on with btrfs subvols.
>>
>> Well I am not a UNIX history expert, but I suppose that
>> /proc/PID/mountinfo was created because /proc/mounts and
>> /proc/PID/mounts no longer provided tool with all the information
>> about Linux mounts.
>>
>> Maybe it's time for a new interface to query the more advanced
>> sb/mount topology? fsinfo() maybe? With mount2 compatible API for
>> traversing mounts that is not limited to reporting all entries inside
>> a single page. I suppose we could go for some hierarchical view
>> under /proc/PID/mounttree. I don't know - new API is hard.
>>
>> In any case, instead of changing st_dev and st_ino or changing the
>> world to work with file handles, why not add inode generation (and
>> maybe subvol id) to statx().
>> filesystem that care enough will provide this information and tools that
>> care enough will use it.
>>
>
> Can y'all wait till I'm back from vacation, goddamn ;)
>
> This is what I'm aiming for, I spent some time looking at how many
> places we string parse /proc/<whatever>/mounts and my head hurts.
>
> Btrfs already has a reasonable solution for this, we have UUID's for
> everything.  UUID's aren't a strictly btrfs thing either, all the file
> systems have some sort of UUID identifier, hell its built into blkid.  I
> would love if we could do a better job about letting applications query
> information about where they are.  And we could expose this with the
> relatively common UUID format.  You ask what fs you're in, you get the
> FS UUID, and then if you're on Btrfs you get the specific subvolume UUID
> you're in.  That way you could do more fancy things like know if you've
> wandered into a new file system completely or just a different subvolume.

I'm completely on the side of using proper UUID.

But suddenly I find a problem for this, at least we still need something
like st_dev for real volume based snapshot.

One of the problem for real volume based snapshot is, the snapshoted
volume is completely the same filesystem, every binary is the same,
including UUID.

That means, the only way to distinguish such volumes is by st_dev.

For such pure UUID base solution, it's in fact unable to distinguish
them using just UUID.
Unless we have some device UUID to replace the old st_dev.

Thanks,
Qu

>
> We have to keep the st_ino/st_dev thing for backwards compatibility, but
> make it easier to get more info out of the file system.
>
> We could in theory expose just the subvolid also, since that's a nice
> simple u64, but it limits our ability to do new fancy shit in the
> future.  It's not a bad solution, but like I said I think we need to
> take a step back and figure out what problem we're specifically trying
> to solve, and work from there.  Starting from automounts and working our
> way back is not going very well.  Thanks,
>
> Josef

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02 22:36                             ` NeilBrown
@ 2021-08-03  0:15                               ` J. Bruce Fields
  0 siblings, 0 replies; 21+ messages in thread
From: J. Bruce Fields @ 2021-08-03  0:15 UTC (permalink / raw)
  To: NeilBrown
  Cc: Miklos Szeredi, Al Viro, Christoph Hellwig, Josef Bacik,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

On Tue, Aug 03, 2021 at 08:36:44AM +1000, NeilBrown wrote:
> On Tue, 03 Aug 2021, J. Bruce Fields wrote:
> > On Tue, Aug 03, 2021 at 07:59:30AM +1000, NeilBrown wrote:
> > > On Tue, 03 Aug 2021, J. Bruce Fields wrote:
> > > > On Tue, Aug 03, 2021 at 07:10:44AM +1000, NeilBrown wrote:
> > > > > On Mon, 02 Aug 2021, J. Bruce Fields wrote:
> > > > > > On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> > > > > > > For btrfs, the "location" is root.objectid ++ file.objectid.  I think
> > > > > > > the inode should become (file.objectid ^ swab64(root.objectid)).  This
> > > > > > > will provide numbers that are unique until you get very large subvols,
> > > > > > > and very many subvols.
> > > > > > 
> > > > > > If you snapshot a filesystem, I'd expect, at least by default, that
> > > > > > inodes in the snapshot to stay the same as in the snapshotted
> > > > > > filesystem.
> > > > > 
> > > > > As I said: we need to challenge and revise user-space (and meat-space)
> > > > > expectations. 
> > > > 
> > > > The example that came to mind is people that export a snapshot, then
> > > > replace it with an updated snapshot, and expect that to be transparent
> > > > to clients.
> > > > 
> > > > Our client will error out with ESTALE if it notices an inode number
> > > > changed out from under it.
> > > 
> > > Will it?
> > 
> > See fs/nfs/inode.c:nfs_check_inode_attributes():
> > 
> > 	if (nfsi->fileid != fattr->fileid) {
> >                 /* Is this perhaps the mounted-on fileid? */
> >                 if ((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) &&
> >                     nfsi->fileid == fattr->mounted_on_fileid)
> >                         return 0;
> >                 return -ESTALE;
> >         }
> 
> That code fires if the fileid (inode number) reported for a particular
> filehandle changes.  I'm saying that won't happen.
> 
> If you reflink (aka snaphot) a btrfs subtree (aka "subvol"), then the
> new sub-tree will ALREADY have different filehandles than the original
> subvol.

Whoops, you're right, sorry for the noise....

--b.

> Whether it has the same inode numbers or different ones is
> irrelevant to NFS.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02 22:14                           ` J. Bruce Fields
@ 2021-08-02 22:36                             ` NeilBrown
  2021-08-03  0:15                               ` J. Bruce Fields
  0 siblings, 1 reply; 21+ messages in thread
From: NeilBrown @ 2021-08-02 22:36 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Miklos Szeredi, Al Viro, Christoph Hellwig, Josef Bacik,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

On Tue, 03 Aug 2021, J. Bruce Fields wrote:
> On Tue, Aug 03, 2021 at 07:59:30AM +1000, NeilBrown wrote:
> > On Tue, 03 Aug 2021, J. Bruce Fields wrote:
> > > On Tue, Aug 03, 2021 at 07:10:44AM +1000, NeilBrown wrote:
> > > > On Mon, 02 Aug 2021, J. Bruce Fields wrote:
> > > > > On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> > > > > > For btrfs, the "location" is root.objectid ++ file.objectid.  I think
> > > > > > the inode should become (file.objectid ^ swab64(root.objectid)).  This
> > > > > > will provide numbers that are unique until you get very large subvols,
> > > > > > and very many subvols.
> > > > > 
> > > > > If you snapshot a filesystem, I'd expect, at least by default, that
> > > > > inodes in the snapshot to stay the same as in the snapshotted
> > > > > filesystem.
> > > > 
> > > > As I said: we need to challenge and revise user-space (and meat-space)
> > > > expectations. 
> > > 
> > > The example that came to mind is people that export a snapshot, then
> > > replace it with an updated snapshot, and expect that to be transparent
> > > to clients.
> > > 
> > > Our client will error out with ESTALE if it notices an inode number
> > > changed out from under it.
> > 
> > Will it?
> 
> See fs/nfs/inode.c:nfs_check_inode_attributes():
> 
> 	if (nfsi->fileid != fattr->fileid) {
>                 /* Is this perhaps the mounted-on fileid? */
>                 if ((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) &&
>                     nfsi->fileid == fattr->mounted_on_fileid)
>                         return 0;
>                 return -ESTALE;
>         }

That code fires if the fileid (inode number) reported for a particular
filehandle changes.  I'm saying that won't happen.

If you reflink (aka snaphot) a btrfs subtree (aka "subvol"), then the
new sub-tree will ALREADY have different filehandles than the original
subvol.  Whether it has the same inode numbers or different ones is
irrelevant to NFS.

(on reflection, I didn't say that as clearly as I could have done last time)

NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02 21:59                         ` NeilBrown
@ 2021-08-02 22:14                           ` J. Bruce Fields
  2021-08-02 22:36                             ` NeilBrown
  0 siblings, 1 reply; 21+ messages in thread
From: J. Bruce Fields @ 2021-08-02 22:14 UTC (permalink / raw)
  To: NeilBrown
  Cc: Miklos Szeredi, Al Viro, Christoph Hellwig, Josef Bacik,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

On Tue, Aug 03, 2021 at 07:59:30AM +1000, NeilBrown wrote:
> On Tue, 03 Aug 2021, J. Bruce Fields wrote:
> > On Tue, Aug 03, 2021 at 07:10:44AM +1000, NeilBrown wrote:
> > > On Mon, 02 Aug 2021, J. Bruce Fields wrote:
> > > > On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> > > > > For btrfs, the "location" is root.objectid ++ file.objectid.  I think
> > > > > the inode should become (file.objectid ^ swab64(root.objectid)).  This
> > > > > will provide numbers that are unique until you get very large subvols,
> > > > > and very many subvols.
> > > > 
> > > > If you snapshot a filesystem, I'd expect, at least by default, that
> > > > inodes in the snapshot to stay the same as in the snapshotted
> > > > filesystem.
> > > 
> > > As I said: we need to challenge and revise user-space (and meat-space)
> > > expectations. 
> > 
> > The example that came to mind is people that export a snapshot, then
> > replace it with an updated snapshot, and expect that to be transparent
> > to clients.
> > 
> > Our client will error out with ESTALE if it notices an inode number
> > changed out from under it.
> 
> Will it?

See fs/nfs/inode.c:nfs_check_inode_attributes():

	if (nfsi->fileid != fattr->fileid) {
                /* Is this perhaps the mounted-on fileid? */
                if ((fattr->valid & NFS_ATTR_FATTR_MOUNTED_ON_FILEID) &&
                    nfsi->fileid == fattr->mounted_on_fileid)
                        return 0;
                return -ESTALE;
        }

--b.

> If the inode number changed, then the filehandle would change.
> Unless the filesystem were exported with subtreecheck, the old filehandle
> would continue to work (unless the old snapshot was deleted).  File-name
> lookups from the root would find new files...
> 
> "replace with an updated snapshot" is no different from "replace with an
> updated directory tree".  If you delete the old tree, then
> currently-open files will break.  If you don't you get a reasonably
> clean transition.
> 
> > 
> > I don't know if there are other such cases.  It seems like surprising
> > behavior to me, though.
> 
> If you refuse to risk breaking anything, then you cannot make progress.
> Providing people can choose when things break, and have advanced
> warning, they often cope remarkable well.
> 
> Thanks,
> NeilBrown
> 
> 
> > 
> > --b.
> > 
> > > In btrfs, you DO NOT snapshot a FILESYSTEM.  Rather, you effectively
> > > create a 'reflink' for a subtree (only works on subtrees that have been
> > > correctly created with the poorly named "btrfs subvolume" command).
> > > 
> > > As with any reflink, the original has the same inode number that it did
> > > before, the new version has a different inode number (though in current
> > > BTRFS, half of the inode number is hidden from user-space, so it looks
> > > like the inode number hasn't changed).
> > 
> > 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02 21:50                       ` J. Bruce Fields
@ 2021-08-02 21:59                         ` NeilBrown
  2021-08-02 22:14                           ` J. Bruce Fields
  0 siblings, 1 reply; 21+ messages in thread
From: NeilBrown @ 2021-08-02 21:59 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Miklos Szeredi, Al Viro, Christoph Hellwig, Josef Bacik,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

On Tue, 03 Aug 2021, J. Bruce Fields wrote:
> On Tue, Aug 03, 2021 at 07:10:44AM +1000, NeilBrown wrote:
> > On Mon, 02 Aug 2021, J. Bruce Fields wrote:
> > > On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> > > > For btrfs, the "location" is root.objectid ++ file.objectid.  I think
> > > > the inode should become (file.objectid ^ swab64(root.objectid)).  This
> > > > will provide numbers that are unique until you get very large subvols,
> > > > and very many subvols.
> > > 
> > > If you snapshot a filesystem, I'd expect, at least by default, that
> > > inodes in the snapshot to stay the same as in the snapshotted
> > > filesystem.
> > 
> > As I said: we need to challenge and revise user-space (and meat-space)
> > expectations. 
> 
> The example that came to mind is people that export a snapshot, then
> replace it with an updated snapshot, and expect that to be transparent
> to clients.
> 
> Our client will error out with ESTALE if it notices an inode number
> changed out from under it.

Will it?  If the inode number changed, then the filehandle would change.
Unless the filesystem were exported with subtreecheck, the old filehandle
would continue to work (unless the old snapshot was deleted).  File-name
lookups from the root would find new files...

"replace with an updated snapshot" is no different from "replace with an
updated directory tree".  If you delete the old tree, then
currently-open files will break.  If you don't you get a reasonably
clean transition.

> 
> I don't know if there are other such cases.  It seems like surprising
> behavior to me, though.

If you refuse to risk breaking anything, then you cannot make progress.
Providing people can choose when things break, and have advanced
warning, they often cope remarkable well.

Thanks,
NeilBrown


> 
> --b.
> 
> > In btrfs, you DO NOT snapshot a FILESYSTEM.  Rather, you effectively
> > create a 'reflink' for a subtree (only works on subtrees that have been
> > correctly created with the poorly named "btrfs subvolume" command).
> > 
> > As with any reflink, the original has the same inode number that it did
> > before, the new version has a different inode number (though in current
> > BTRFS, half of the inode number is hidden from user-space, so it looks
> > like the inode number hasn't changed).
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02 21:10                     ` NeilBrown
@ 2021-08-02 21:50                       ` J. Bruce Fields
  2021-08-02 21:59                         ` NeilBrown
  0 siblings, 1 reply; 21+ messages in thread
From: J. Bruce Fields @ 2021-08-02 21:50 UTC (permalink / raw)
  To: NeilBrown
  Cc: Miklos Szeredi, Al Viro, Christoph Hellwig, Josef Bacik,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

On Tue, Aug 03, 2021 at 07:10:44AM +1000, NeilBrown wrote:
> On Mon, 02 Aug 2021, J. Bruce Fields wrote:
> > On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> > > For btrfs, the "location" is root.objectid ++ file.objectid.  I think
> > > the inode should become (file.objectid ^ swab64(root.objectid)).  This
> > > will provide numbers that are unique until you get very large subvols,
> > > and very many subvols.
> > 
> > If you snapshot a filesystem, I'd expect, at least by default, that
> > inodes in the snapshot to stay the same as in the snapshotted
> > filesystem.
> 
> As I said: we need to challenge and revise user-space (and meat-space)
> expectations. 

The example that came to mind is people that export a snapshot, then
replace it with an updated snapshot, and expect that to be transparent
to clients.

Our client will error out with ESTALE if it notices an inode number
changed out from under it.

I don't know if there are other such cases.  It seems like surprising
behavior to me, though.

--b.

> In btrfs, you DO NOT snapshot a FILESYSTEM.  Rather, you effectively
> create a 'reflink' for a subtree (only works on subtrees that have been
> correctly created with the poorly named "btrfs subvolume" command).
> 
> As with any reflink, the original has the same inode number that it did
> before, the new version has a different inode number (though in current
> BTRFS, half of the inode number is hidden from user-space, so it looks
> like the inode number hasn't changed).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02  7:15                   ` Martin Steigerwald
@ 2021-08-02 21:40                     ` NeilBrown
  0 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2021-08-02 21:40 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: Miklos Szeredi, Al Viro, Christoph Hellwig, Josef Bacik,
	J. Bruce Fields, Chuck Lever, Chris Mason, David Sterba,
	linux-fsdevel, Linux NFS list, Btrfs BTRFS

On Mon, 02 Aug 2021, Martin Steigerwald wrote:
> Hi Neil!
> 
> Wow, this is a bit overwhelming for me. However, I got a very specific 
> question for userspace developers in order to probably provide valuable 
> input to the KDE Baloo desktop search developers:
> 
> NeilBrown - 02.08.21, 06:18:29 CEST:
> > The "obvious" choice for a replacement is the file handle provided by
> > name_to_handle_at() (falling back to st_ino if name_to_handle_at isn't
> > supported by the filesystem).  This returns an extensible opaque
> > byte-array.  It is *already* more reliable than st_ino.  Comparing
> > st_ino is only a reliable way to check if two files are the same if
> > you have both of them open.  If you don't, then one of the files
> > might have been deleted and the inode number reused for the other.  A
> > filehandle contains a generation number which protects against this.
> > 
> > So I think we need to strongly encourage user-space to start using
> > name_to_handle_at() whenever there is a need to test if two things are
> > the same.
> 
> How could that work for Baloo's use case to see whether a file it 
> encounters is already in its database or whether it is a new file.
> 
> Would Baloo compare the whole file handle or just certain fields or make a 
> hash of the filehandle or what ever? Could you, in pseudo code or 
> something, describe the approach you'd suggest. I'd then share it on:

Yes, the whole filehandle.

 struct file_handle {
        unsigned int handle_bytes; /* Size of f_handle [in, out] */
        int           handle_type;    /* Handle type [out] */
        unsigned char f_handle[0]; /* File identifier (sized by
                                     caller) [out] */
 };

i.e.  compare handle_type, handle_bytes, and handle_bytes worth of
f_handle.
This file_handle is local to the filesytem.  Two different filesystems
can use the same filehandle for different files.  So the identity of the
filesystem need to be combined with the file_handle.

> 
> Bug 438434 - Baloo appears to be indexing twice the number of files than 
> are actually in my home directory
> 
> https://bugs.kde.org/438434

This bug wouldn't be address by using the filehandle.  Using a
filehandle allows you to compare two files within a single filesystem.
This bug is about comparing two filesystems either side of a reboot, to
see if they are the same.

As has already been mentioned in that bug, statfs().f_fsid is the best
solution (unless comparing the mount point is satisfactory).

NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02  7:54                       ` Amir Goldstein
  2021-08-02 13:53                         ` Josef Bacik
  2021-08-02 14:47                         ` Frank Filz
@ 2021-08-02 21:24                         ` NeilBrown
  2 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2021-08-02 21:24 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Al Viro, Miklos Szeredi, Christoph Hellwig, Josef Bacik,
	J. Bruce Fields, Chuck Lever, Chris Mason, David Sterba,
	linux-fsdevel, Linux NFS list, Btrfs BTRFS

On Mon, 02 Aug 2021, Amir Goldstein wrote:
> On Mon, Aug 2, 2021 at 8:41 AM NeilBrown <neilb@suse.de> wrote:
> >
> > On Mon, 02 Aug 2021, Al Viro wrote:
> > > On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> > >
> > > > It think we need to bite-the-bullet and decide that 64bits is not
> > > > enough, and in fact no number of bits will ever be enough.  overlayfs
> > > > makes this clear.
> > >
> > > Sure - let's go for broke and use XML.  Oh, wait - it's 8 months too
> > > early...
> > >
> > > > So I think we need to strongly encourage user-space to start using
> > > > name_to_handle_at() whenever there is a need to test if two things are
> > > > the same.
> > >
> > > ... and forgetting the inconvenient facts, such as that two different
> > > fhandles may correspond to the same object.
> >
> > Can they?  They certainly can if the "connectable" flag is passed.
> > name_to_handle_at() cannot set that flag.
> > nfsd can, so using name_to_handle_at() on an NFS filesystem isn't quite
> > perfect.  However it is the best that can be done over NFS.
> >
> > Or is there some other situation where two different filehandles can be
> > reported for the same inode?
> >
> > Do you have a better suggestion?
> >
> 
> Neil,
> 
> I think the plan of "changing the world" is not very realistic.

I disagree.  It has happened before, it will happen again.  The only
difference about my proposal is that I'm suggesting the change be
proactive rather than reactive.

> Sure, *some* tools can be changed, but all of them?

We only need to change the tools that notice there is a problem.  So it
is important to minimize the effect on existing tools, even when we
cannot reduce it to zero.  We then fix things that are likely to see a
problem, or that actually do.  And we clearly document the behaviour and
how to deal with it, for code that we cannot directly affect.

Remember: there is ALREADY breakage that has been fixed.  btrfs does
*not* behave like a "normal" filesystem.  Nor does NFS.  Multiple tools
have been adjusted to work with them.  Let's not pretend that will never
happen again, but instead use the dynamic to drive evolution in the way
we choose.

> 
> I went back to read your initial cover letter to understand the
> problem and what I mostly found there was that the view of
> /proc/x/mountinfo was hiding information that is important for
> some tools to understand what is going on with btrfs subvols.

That was where I started, but not where I ended.  There are *lots* of
places that currently report inconsistent information for btrfs subvols.

> 
> Well I am not a UNIX history expert, but I suppose that
> /proc/PID/mountinfo was created because /proc/mounts and
> /proc/PID/mounts no longer provided tool with all the information
> about Linux mounts.
> 
> Maybe it's time for a new interface to query the more advanced
> sb/mount topology? fsinfo() maybe? With mount2 compatible API for
> traversing mounts that is not limited to reporting all entries inside
> a single page. I suppose we could go for some hierarchical view
> under /proc/PID/mounttree. I don't know - new API is hard.

Yes, exactly - but not just for mounts.  Yes, we need new APIs (Because
the old ones have been broken in various ways).  That is exactly what
I'm proposing.  But "fixing" mountinfo turns out to be little more than
rearranging deck-chairs on the Titanic.

> 
> In any case, instead of changing st_dev and st_ino or changing the
> world to work with file handles, why not add inode generation (and
> maybe subvol id) to statx().

The enormous benefit of filehandles is that they are supported by
kernels running today.  As others have commented, they also work over
NFS.
But I would be quite happy to see more information made available
through statx - providing the meaning of that information was clearly
specified - both what can be assumed about it and what cannot.

Thanks,
NeilBrown

> filesystem that care enough will provide this information and tools that
> care enough will use it.
> 
> Thanks,
> Amir.
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02 12:39                   ` J. Bruce Fields
  2021-08-02 20:32                     ` Patrick Goetz
@ 2021-08-02 21:10                     ` NeilBrown
  2021-08-02 21:50                       ` J. Bruce Fields
  1 sibling, 1 reply; 21+ messages in thread
From: NeilBrown @ 2021-08-02 21:10 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Miklos Szeredi, Al Viro, Christoph Hellwig, Josef Bacik,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

On Mon, 02 Aug 2021, J. Bruce Fields wrote:
> On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> > For btrfs, the "location" is root.objectid ++ file.objectid.  I think
> > the inode should become (file.objectid ^ swab64(root.objectid)).  This
> > will provide numbers that are unique until you get very large subvols,
> > and very many subvols.
> 
> If you snapshot a filesystem, I'd expect, at least by default, that
> inodes in the snapshot to stay the same as in the snapshotted
> filesystem.

As I said: we need to challenge and revise user-space (and meat-space)
expectations. 

In btrfs, you DO NOT snapshot a FILESYSTEM.  Rather, you effectively
create a 'reflink' for a subtree (only works on subtrees that have been
correctly created with the poorly named "btrfs subvolume" command).

As with any reflink, the original has the same inode number that it did
before, the new version has a different inode number (though in current
BTRFS, half of the inode number is hidden from user-space, so it looks
like the inode number hasn't changed).

NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02 20:32                     ` Patrick Goetz
@ 2021-08-02 20:41                       ` J. Bruce Fields
  0 siblings, 0 replies; 21+ messages in thread
From: J. Bruce Fields @ 2021-08-02 20:41 UTC (permalink / raw)
  To: Patrick Goetz
  Cc: NeilBrown, Miklos Szeredi, Al Viro, Christoph Hellwig,
	Josef Bacik, Chuck Lever, Chris Mason, David Sterba,
	linux-fsdevel, Linux NFS list, Btrfs BTRFS

On Mon, Aug 02, 2021 at 03:32:45PM -0500, Patrick Goetz wrote:
> On 8/2/21 7:39 AM, J. Bruce Fields wrote:
> >On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> >>For btrfs, the "location" is root.objectid ++ file.objectid.  I think
> >>the inode should become (file.objectid ^ swab64(root.objectid)).  This
> >>will provide numbers that are unique until you get very large subvols,
> >>and very many subvols.
> >
> >If you snapshot a filesystem, I'd expect, at least by default, that
> >inodes in the snapshot to stay the same as in the snapshotted
> >filesystem.
> 
> For copy on right systems like ZFS, how could it be otherwise?

I'm reacting to Neil's suggesting above, which (as I understand it)
would result in different inode numbers.

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02 12:39                   ` J. Bruce Fields
@ 2021-08-02 20:32                     ` Patrick Goetz
  2021-08-02 20:41                       ` J. Bruce Fields
  2021-08-02 21:10                     ` NeilBrown
  1 sibling, 1 reply; 21+ messages in thread
From: Patrick Goetz @ 2021-08-02 20:32 UTC (permalink / raw)
  To: J. Bruce Fields, NeilBrown
  Cc: Miklos Szeredi, Al Viro, Christoph Hellwig, Josef Bacik,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS



On 8/2/21 7:39 AM, J. Bruce Fields wrote:
> On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
>> For btrfs, the "location" is root.objectid ++ file.objectid.  I think
>> the inode should become (file.objectid ^ swab64(root.objectid)).  This
>> will provide numbers that are unique until you get very large subvols,
>> and very many subvols.
> 
> If you snapshot a filesystem, I'd expect, at least by default, that
> inodes in the snapshot to stay the same as in the snapshotted
> filesystem.
> 
> --b.
> 

For copy on right systems like ZFS, how could it be otherwise?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02  7:54                       ` Amir Goldstein
  2021-08-02 13:53                         ` Josef Bacik
@ 2021-08-02 14:47                         ` Frank Filz
  2021-08-02 21:24                         ` NeilBrown
  2 siblings, 0 replies; 21+ messages in thread
From: Frank Filz @ 2021-08-02 14:47 UTC (permalink / raw)
  To: 'Amir Goldstein', 'NeilBrown'
  Cc: 'Al Viro', 'Miklos Szeredi',
	'Christoph Hellwig', 'Josef Bacik',
	'J. Bruce Fields', 'Chuck Lever',
	'Chris Mason', 'David Sterba',
	'linux-fsdevel', 'Linux NFS list',
	'Btrfs BTRFS'

> In any case, instead of changing st_dev and st_ino or changing the world to
> work with file handles, why not add inode generation (and maybe subvol id) to
> statx().
> filesystem that care enough will provide this information and tools that care
> enough will use it.

And how is NFS (especially V2 and V3 - V4.2 at least can add attributes) going to provide these values for statx if applications are going to start depending on them, and especially, will this work for those applications that need to distinguish inodes that are working on an NFS exported btrfs filesystem?

Frank


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02  7:54                       ` Amir Goldstein
@ 2021-08-02 13:53                         ` Josef Bacik
  2021-08-03 22:29                           ` Qu Wenruo
  2021-08-02 14:47                         ` Frank Filz
  2021-08-02 21:24                         ` NeilBrown
  2 siblings, 1 reply; 21+ messages in thread
From: Josef Bacik @ 2021-08-02 13:53 UTC (permalink / raw)
  To: Amir Goldstein, NeilBrown
  Cc: Al Viro, Miklos Szeredi, Christoph Hellwig, J. Bruce Fields,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

On 8/2/21 3:54 AM, Amir Goldstein wrote:
> On Mon, Aug 2, 2021 at 8:41 AM NeilBrown <neilb@suse.de> wrote:
>>
>> On Mon, 02 Aug 2021, Al Viro wrote:
>>> On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
>>>
>>>> It think we need to bite-the-bullet and decide that 64bits is not
>>>> enough, and in fact no number of bits will ever be enough.  overlayfs
>>>> makes this clear.
>>>
>>> Sure - let's go for broke and use XML.  Oh, wait - it's 8 months too
>>> early...
>>>
>>>> So I think we need to strongly encourage user-space to start using
>>>> name_to_handle_at() whenever there is a need to test if two things are
>>>> the same.
>>>
>>> ... and forgetting the inconvenient facts, such as that two different
>>> fhandles may correspond to the same object.
>>
>> Can they?  They certainly can if the "connectable" flag is passed.
>> name_to_handle_at() cannot set that flag.
>> nfsd can, so using name_to_handle_at() on an NFS filesystem isn't quite
>> perfect.  However it is the best that can be done over NFS.
>>
>> Or is there some other situation where two different filehandles can be
>> reported for the same inode?
>>
>> Do you have a better suggestion?
>>
> 
> Neil,
> 
> I think the plan of "changing the world" is not very realistic.
> Sure, *some* tools can be changed, but all of them?
> 
> I went back to read your initial cover letter to understand the
> problem and what I mostly found there was that the view of
> /proc/x/mountinfo was hiding information that is important for
> some tools to understand what is going on with btrfs subvols.
> 
> Well I am not a UNIX history expert, but I suppose that
> /proc/PID/mountinfo was created because /proc/mounts and
> /proc/PID/mounts no longer provided tool with all the information
> about Linux mounts.
> 
> Maybe it's time for a new interface to query the more advanced
> sb/mount topology? fsinfo() maybe? With mount2 compatible API for
> traversing mounts that is not limited to reporting all entries inside
> a single page. I suppose we could go for some hierarchical view
> under /proc/PID/mounttree. I don't know - new API is hard.
> 
> In any case, instead of changing st_dev and st_ino or changing the
> world to work with file handles, why not add inode generation (and
> maybe subvol id) to statx().
> filesystem that care enough will provide this information and tools that
> care enough will use it.
> 

Can y'all wait till I'm back from vacation, goddamn ;)

This is what I'm aiming for, I spent some time looking at how many 
places we string parse /proc/<whatever>/mounts and my head hurts.

Btrfs already has a reasonable solution for this, we have UUID's for 
everything.  UUID's aren't a strictly btrfs thing either, all the file 
systems have some sort of UUID identifier, hell its built into blkid.  I 
would love if we could do a better job about letting applications query 
information about where they are.  And we could expose this with the 
relatively common UUID format.  You ask what fs you're in, you get the 
FS UUID, and then if you're on Btrfs you get the specific subvolume UUID 
you're in.  That way you could do more fancy things like know if you've 
wandered into a new file system completely or just a different subvolume.

We have to keep the st_ino/st_dev thing for backwards compatibility, but 
make it easier to get more info out of the file system.

We could in theory expose just the subvolid also, since that's a nice 
simple u64, but it limits our ability to do new fancy shit in the 
future.  It's not a bad solution, but like I said I think we need to 
take a step back and figure out what problem we're specifically trying 
to solve, and work from there.  Starting from automounts and working our 
way back is not going very well.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02  4:18                 ` A Third perspective on BTRFS nfsd subvol dev/inode number issues NeilBrown
  2021-08-02  5:25                   ` Al Viro
  2021-08-02  7:15                   ` Martin Steigerwald
@ 2021-08-02 12:39                   ` J. Bruce Fields
  2021-08-02 20:32                     ` Patrick Goetz
  2021-08-02 21:10                     ` NeilBrown
  2 siblings, 2 replies; 21+ messages in thread
From: J. Bruce Fields @ 2021-08-02 12:39 UTC (permalink / raw)
  To: NeilBrown
  Cc: Miklos Szeredi, Al Viro, Christoph Hellwig, Josef Bacik,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> For btrfs, the "location" is root.objectid ++ file.objectid.  I think
> the inode should become (file.objectid ^ swab64(root.objectid)).  This
> will provide numbers that are unique until you get very large subvols,
> and very many subvols.

If you snapshot a filesystem, I'd expect, at least by default, that
inodes in the snapshot to stay the same as in the snapshotted
filesystem.

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02  5:40                     ` NeilBrown
@ 2021-08-02  7:54                       ` Amir Goldstein
  2021-08-02 13:53                         ` Josef Bacik
                                           ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Amir Goldstein @ 2021-08-02  7:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: Al Viro, Miklos Szeredi, Christoph Hellwig, Josef Bacik,
	J. Bruce Fields, Chuck Lever, Chris Mason, David Sterba,
	linux-fsdevel, Linux NFS list, Btrfs BTRFS

On Mon, Aug 2, 2021 at 8:41 AM NeilBrown <neilb@suse.de> wrote:
>
> On Mon, 02 Aug 2021, Al Viro wrote:
> > On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> >
> > > It think we need to bite-the-bullet and decide that 64bits is not
> > > enough, and in fact no number of bits will ever be enough.  overlayfs
> > > makes this clear.
> >
> > Sure - let's go for broke and use XML.  Oh, wait - it's 8 months too
> > early...
> >
> > > So I think we need to strongly encourage user-space to start using
> > > name_to_handle_at() whenever there is a need to test if two things are
> > > the same.
> >
> > ... and forgetting the inconvenient facts, such as that two different
> > fhandles may correspond to the same object.
>
> Can they?  They certainly can if the "connectable" flag is passed.
> name_to_handle_at() cannot set that flag.
> nfsd can, so using name_to_handle_at() on an NFS filesystem isn't quite
> perfect.  However it is the best that can be done over NFS.
>
> Or is there some other situation where two different filehandles can be
> reported for the same inode?
>
> Do you have a better suggestion?
>

Neil,

I think the plan of "changing the world" is not very realistic.
Sure, *some* tools can be changed, but all of them?

I went back to read your initial cover letter to understand the
problem and what I mostly found there was that the view of
/proc/x/mountinfo was hiding information that is important for
some tools to understand what is going on with btrfs subvols.

Well I am not a UNIX history expert, but I suppose that
/proc/PID/mountinfo was created because /proc/mounts and
/proc/PID/mounts no longer provided tool with all the information
about Linux mounts.

Maybe it's time for a new interface to query the more advanced
sb/mount topology? fsinfo() maybe? With mount2 compatible API for
traversing mounts that is not limited to reporting all entries inside
a single page. I suppose we could go for some hierarchical view
under /proc/PID/mounttree. I don't know - new API is hard.

In any case, instead of changing st_dev and st_ino or changing the
world to work with file handles, why not add inode generation (and
maybe subvol id) to statx().
filesystem that care enough will provide this information and tools that
care enough will use it.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02  4:18                 ` A Third perspective on BTRFS nfsd subvol dev/inode number issues NeilBrown
  2021-08-02  5:25                   ` Al Viro
@ 2021-08-02  7:15                   ` Martin Steigerwald
  2021-08-02 21:40                     ` NeilBrown
  2021-08-02 12:39                   ` J. Bruce Fields
  2 siblings, 1 reply; 21+ messages in thread
From: Martin Steigerwald @ 2021-08-02  7:15 UTC (permalink / raw)
  To: Miklos Szeredi, NeilBrown
  Cc: Al Viro, Christoph Hellwig, Josef Bacik, J. Bruce Fields,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

Hi Neil!

Wow, this is a bit overwhelming for me. However, I got a very specific 
question for userspace developers in order to probably provide valuable 
input to the KDE Baloo desktop search developers:

NeilBrown - 02.08.21, 06:18:29 CEST:
> The "obvious" choice for a replacement is the file handle provided by
> name_to_handle_at() (falling back to st_ino if name_to_handle_at isn't
> supported by the filesystem).  This returns an extensible opaque
> byte-array.  It is *already* more reliable than st_ino.  Comparing
> st_ino is only a reliable way to check if two files are the same if
> you have both of them open.  If you don't, then one of the files
> might have been deleted and the inode number reused for the other.  A
> filehandle contains a generation number which protects against this.
> 
> So I think we need to strongly encourage user-space to start using
> name_to_handle_at() whenever there is a need to test if two things are
> the same.

How could that work for Baloo's use case to see whether a file it 
encounters is already in its database or whether it is a new file.

Would Baloo compare the whole file handle or just certain fields or make a 
hash of the filehandle or what ever? Could you, in pseudo code or 
something, describe the approach you'd suggest. I'd then share it on:

Bug 438434 - Baloo appears to be indexing twice the number of files than 
are actually in my home directory

https://bugs.kde.org/438434

Best,
-- 
Martin



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02  5:25                   ` Al Viro
@ 2021-08-02  5:40                     ` NeilBrown
  2021-08-02  7:54                       ` Amir Goldstein
  0 siblings, 1 reply; 21+ messages in thread
From: NeilBrown @ 2021-08-02  5:40 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Christoph Hellwig, Josef Bacik, J. Bruce Fields,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

On Mon, 02 Aug 2021, Al Viro wrote:
> On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:
> 
> > It think we need to bite-the-bullet and decide that 64bits is not
> > enough, and in fact no number of bits will ever be enough.  overlayfs
> > makes this clear.
> 
> Sure - let's go for broke and use XML.  Oh, wait - it's 8 months too
> early...
> 
> > So I think we need to strongly encourage user-space to start using
> > name_to_handle_at() whenever there is a need to test if two things are
> > the same.
> 
> ... and forgetting the inconvenient facts, such as that two different
> fhandles may correspond to the same object.

Can they?  They certainly can if the "connectable" flag is passed.
name_to_handle_at() cannot set that flag.
nfsd can, so using name_to_handle_at() on an NFS filesystem isn't quite
perfect.  However it is the best that can be done over NFS.

Or is there some other situation where two different filehandles can be
reported for the same inode?

Do you have a better suggestion?

NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-08-02  4:18                 ` A Third perspective on BTRFS nfsd subvol dev/inode number issues NeilBrown
@ 2021-08-02  5:25                   ` Al Viro
  2021-08-02  5:40                     ` NeilBrown
  2021-08-02  7:15                   ` Martin Steigerwald
  2021-08-02 12:39                   ` J. Bruce Fields
  2 siblings, 1 reply; 21+ messages in thread
From: Al Viro @ 2021-08-02  5:25 UTC (permalink / raw)
  To: NeilBrown
  Cc: Miklos Szeredi, Christoph Hellwig, Josef Bacik, J. Bruce Fields,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:

> It think we need to bite-the-bullet and decide that 64bits is not
> enough, and in fact no number of bits will ever be enough.  overlayfs
> makes this clear.

Sure - let's go for broke and use XML.  Oh, wait - it's 8 months too
early...

> So I think we need to strongly encourage user-space to start using
> name_to_handle_at() whenever there is a need to test if two things are
> the same.

... and forgetting the inconvenient facts, such as that two different
fhandles may correspond to the same object.

> I accept that I'm proposing some BIG changes here, and they might break
> things.  But btrfs is already broken in various ways.  I think we need a
> goal to work towards which will eventually remove all breakage and still
> have room for expansion.  I think that must include:
> 
> - providing as-unique-as-practical inode numbers across the whole
>   filesystem, and deprecating the internal use of different device
>   numbers.  Make it possible to mount without them ASAP, and aim to
>   make that the default eventually.
> - working with user-space tool/library developers to use
>   name_to_handle_at() to identify inodes, only using st_ino
>   as a fall-back
> - adding filehandles to various /proc etc files as needed, either
>   duplicating lines or duplicating files.  And helping application which
>   use these files to migrate (I would *NOT* change the dev numbers in
>   the current file to report the internal btrfs dev numbers the way that
>   SUSE does.  I would prefer that current breakage could be used to
>   motivate developers towards depending instead on fhandles).
> - exporting subtree (aka subvol) id to user-space, possibly paralleling
>   proj_id in some way, and extending various tools to understand
>   subtrees
> 
> Who's with me??

Cf. "Poe Law"...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* A Third perspective on BTRFS nfsd subvol dev/inode number issues.
  2021-07-30  7:59               ` Miklos Szeredi
@ 2021-08-02  4:18                 ` NeilBrown
  2021-08-02  5:25                   ` Al Viro
                                     ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: NeilBrown @ 2021-08-02  4:18 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Al Viro, Christoph Hellwig, Josef Bacik, J. Bruce Fields,
	Chuck Lever, Chris Mason, David Sterba, linux-fsdevel,
	Linux NFS list, Btrfs BTRFS

On Fri, 30 Jul 2021, Miklos Szeredi wrote:
> On Fri, 30 Jul 2021 at 09:34, NeilBrown <neilb@suse.de> wrote:
> 
> > But I'm curious about your reference to "some sort of subvolume
> > structure that the VFS knows about".  Do you have any references, or can
> > you suggest a search term I could try?
> 
> Found this:
> https://lore.kernel.org/linux-fsdevel/20180508180436.716-1-mfasheh@suse.de/
> 

Excellent, thanks.  Very useful.

OK.  Time for a third perspective.

With its current framing the problem is unsolvable.  So it needs to be
reframed.

By "current framing", I mean that we are trying to get btrfs to behave
in a way that meets current user-space expectations.  Specially, the
expectation that each object in any filesystem can be uniquely
identified by a 64bit inode number.  btrfs provides functionality which
needs more than 64bits.  So it simple does not fit.  btrfs currently
fudges with device numbers to hide the problem.  This is at best an
incomplete solution, and is already being shown to be insufficient.

Therefore we need to change user-space expectations.  This has been done
before multiple times - often by breaking things and leaving it up to
user-space to fix it.  My favourite example is that NFSv2 broke the
creation of lock files with O_CREAT|O_EXCL.  USER-space starting using
hard-links to achieve the same result.  When NFSv3 added reliable
O_CREAT|O_EXCL support, it hardly mattered.... but I digress.

It think we need to bite-the-bullet and decide that 64bits is not
enough, and in fact no number of bits will ever be enough.  overlayfs
makes this clear.  overlayfs merges multiple filesystems, and so needs
strictly more bits to uniquely identify all inodes than any of the
filesystems use.  Currently it over-loads the high bits and hopes the
filesystem doesn't use them.

The "obvious" choice for a replacement is the file handle provided by
name_to_handle_at() (falling back to st_ino if name_to_handle_at isn't
supported by the filesystem).  This returns an extensible opaque
byte-array.  It is *already* more reliable than st_ino.  Comparing
st_ino is only a reliable way to check if two files are the same if you
have both of them open.  If you don't, then one of the files might have
been deleted and the inode number reused for the other.  A filehandle
contains a generation number which protects against this.

So I think we need to strongly encourage user-space to start using
name_to_handle_at() whenever there is a need to test if two things are
the same.

This frees us to be a little less precise about assuring st_ino is
always unique, but only a little.  We still want to minimize conflicts
and avoid them in common situations.

A filehandle typically has some bytes used to locate the inode -
"location" - and some to validate it - "generation".  In general, st_ino
must now be seen as a hash of the "location".  It could be a generic hash
(xxhash? jhash?) or it could be a careful xor of the bits.

For btrfs, the "location" is root.objectid ++ file.objectid.  I think
the inode should become (file.objectid ^ swab64(root.objectid)).  This
will provide numbers that are unique until you get very large subvols,
and very many subvols.  It also ensures that two inodes in the same
subvol will be guaranteed to have different st_ino.

This will quickly cause problems for overlayfs as it means that if btrfs
is used with overlayfs, the top few bits won't be zero.  Possibly btrfs
could be polite and shift the swab64(root.objectid) down 8 bits to make
room.  Possible overlayfs should handle this case (top N-bits not all
zero), and switch to a generic hash of the inode number (or preferably
the filehandle) to (64-N bits).

If we convince user-space to use filehandles to compare objects, the NFS
problems I initially was trying to address go away.  Even before that,
if btrfs switches to a hashed (i.e. xor) inode number, then the problems
also go away.  but they aren't the only problems here.

Accessing the fhandle isn't always possible.  For example reading
/proc/locks reports major:minor:inode-number for each file (This is the
major:minor from the superblock, so btrfs internal dev numbers aren't
used).  The filehandle is simply not available.  I think the only way
to address this is to create a new file. "/proc/locks2" :-)
Similarly the "lock:" lines in /proc/$PID/fdinfo/$FD need to be duplicated
as "lock2:" lines with filehandle instead of inode number.  Ditto for
'inotify:' lines and possibly others.

Similarly /proc/$PID/maps contains the inode number with no fhandle.
The situation isn't so bad there as there is a path name, and you can
even call name_to_handle_at("/proc/$PID/map_files/$RANGE") to get the
fhandle.  It might be better to provide a new file though.

Next we come to the use of different device numbers in the one btrfs
filesystem.  I'm of the opinion that this was an unfortunately choice
that we need to deprecate.  Tools that use fhandle won't need it to
differentiate inodes, but there is more to the story than just that
need.

As has been mentioned, people depend on "du -x" and "find -mount" (aka
"-xdev") to stay within a "subvol".  We need to provide a clean
alternate before discouraging that usage.

xfs, ext4, fuse, and f2fs each (can) maintain a "project id" for each
inode, which effectively groups inodes into a tree.  This is used for
project quotas.  At one level this is conceptually very similar to the
btrfs subtree.root.objectid.  It is different in that it is only 32 bits
(:-[) and is mapped between user name-spaces like uids and gids.  It is
similar in that it identifies a group of inodes that are accounted
together and are (generally) contiguous in a tree.

If we encouraged "du" to have a "--proj" option (-j) which stays within
a project, and gave a similar option to find, that could be broadly
useful.  Then if btrfs provided the subvol objectid as fsx_projid
(available in FS_IOC_FSGETXATTR ioctl), then "du --proj" on btrfs would
stay in a subvol.  Longer term it might make sense to add a 64bit
project-id to statx.  I don't think it would make sense for btrfs to
have a 'project' concept that is different from the "subvolume".

It would be cool if "df" could have a "--proj" (or similar) flag so that
it would report the usage of a "subtree" (given a path).  Unfortunately
there isn't really an interface for this.  Going through the quota
system might be nice, I don't think it would work.

Another thought about btrfs device numbers is that, providing inode
numbers are (nearly) unique, we don't really need more than 2.  A btrfs
filesystem could allocate 2 anon device numbers.  One would be assigned
to the root, and each subvolume would get whichever device number its
parent doesn't have.  This would stop "du -x" and "find -mount" and
similar from crossing into subvols.  There could be a mount option to
select between "1", "2", and "many" device numbers for a filesystem.

- I note that cephfs place games with st_dev too....  I wonder if we can
  learn anything from that. 
- audit uses sb->s_dev without asking the filesystem.  So it won't
  handle  btrfs correctly.  I wonder if there is room for it to use
  file handles.

I accept that I'm proposing some BIG changes here, and they might break
things.  But btrfs is already broken in various ways.  I think we need a
goal to work towards which will eventually remove all breakage and still
have room for expansion.  I think that must include:

- providing as-unique-as-practical inode numbers across the whole
  filesystem, and deprecating the internal use of different device
  numbers.  Make it possible to mount without them ASAP, and aim to
  make that the default eventually.
- working with user-space tool/library developers to use
  name_to_handle_at() to identify inodes, only using st_ino
  as a fall-back
- adding filehandles to various /proc etc files as needed, either
  duplicating lines or duplicating files.  And helping application which
  use these files to migrate (I would *NOT* change the dev numbers in
  the current file to report the internal btrfs dev numbers the way that
  SUSE does.  I would prefer that current breakage could be used to
  motivate developers towards depending instead on fhandles).
- exporting subtree (aka subvol) id to user-space, possibly paralleling
  proj_id in some way, and extending various tools to understand
  subtrees

Who's with me??

NeilBrown

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-08-03 22:30 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-02  9:11 A Third perspective on BTRFS nfsd subvol dev/inode number issues Forza
2021-08-02 21:50 ` NeilBrown
  -- strict thread matches above, loose matches on Subject: below --
2021-07-27 22:37 [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly NeilBrown
2021-07-27 22:37 ` [PATCH 01/11] VFS: show correct dev num in mountinfo NeilBrown
2021-07-30  0:25   ` Al Viro
2021-07-30  5:28     ` NeilBrown
2021-07-30  5:54       ` Miklos Szeredi
2021-07-30  6:13         ` NeilBrown
2021-07-30  7:18           ` Miklos Szeredi
2021-07-30  7:33             ` NeilBrown
2021-07-30  7:59               ` Miklos Szeredi
2021-08-02  4:18                 ` A Third perspective on BTRFS nfsd subvol dev/inode number issues NeilBrown
2021-08-02  5:25                   ` Al Viro
2021-08-02  5:40                     ` NeilBrown
2021-08-02  7:54                       ` Amir Goldstein
2021-08-02 13:53                         ` Josef Bacik
2021-08-03 22:29                           ` Qu Wenruo
2021-08-02 14:47                         ` Frank Filz
2021-08-02 21:24                         ` NeilBrown
2021-08-02  7:15                   ` Martin Steigerwald
2021-08-02 21:40                     ` NeilBrown
2021-08-02 12:39                   ` J. Bruce Fields
2021-08-02 20:32                     ` Patrick Goetz
2021-08-02 20:41                       ` J. Bruce Fields
2021-08-02 21:10                     ` NeilBrown
2021-08-02 21:50                       ` J. Bruce Fields
2021-08-02 21:59                         ` NeilBrown
2021-08-02 22:14                           ` J. Bruce Fields
2021-08-02 22:36                             ` NeilBrown
2021-08-03  0:15                               ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).