All of lore.kernel.org
 help / color / mirror / Atom feed
* nfs subvolume access?
@ 2021-03-10  7:46 ` Ulli Horlacher
  2021-03-10  7:59   ` Hugo Mills
                     ` (3 more replies)
  0 siblings, 4 replies; 94+ messages in thread
From: Ulli Horlacher @ 2021-03-10  7:46 UTC (permalink / raw)
  To: linux-btrfs

When I try to access a btrfs filesystem via nfs, I get the error:

root@tsmsrvi:~# mount tsmsrvj:/data/fex /nfs/tsmsrvj/fex
root@tsmsrvi:~# time find /nfs/tsmsrvj/fex | wc -l
find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same file system loop as '/nfs/tsmsrvj/fex'.
1
root@tsmsrvi:~# 



On tsmsrvj I have in /etc/exports:

/data/fex       tsmsrvi(rw,async,no_subtree_check,no_root_squash)

This is a btrfs subvolume with snapshots:

root@tsmsrvj:~# btrfs subvolume list /data
ID 257 gen 35 top level 5 path fex
ID 270 gen 36 top level 257 path fex/spool
ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test

root@tsmsrvj:~# find /data/fex | wc -l
489887
root@tsmsrvj:~# 

What must I add to /etc/exports to enable subvolume access for the nfs
client?

tsmsrvi and tsmsrvj (nfs client and server) both run Ubuntu 20.04 with
btrfs-progs v5.4.1 

-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<20210310074620.GA2158@tik.uni-stuttgart.de>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: nfs subvolume access?
  2021-03-10  7:46 ` nfs subvolume access? Ulli Horlacher
@ 2021-03-10  7:59   ` Hugo Mills
  2021-03-10  8:09     ` Ulli Horlacher
  2021-03-10  8:17   ` Ulli Horlacher
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 94+ messages in thread
From: Hugo Mills @ 2021-03-10  7:59 UTC (permalink / raw)
  To: linux-btrfs

On Wed, Mar 10, 2021 at 08:46:20AM +0100, Ulli Horlacher wrote:
> When I try to access a btrfs filesystem via nfs, I get the error:
> 
> root@tsmsrvi:~# mount tsmsrvj:/data/fex /nfs/tsmsrvj/fex
> root@tsmsrvi:~# time find /nfs/tsmsrvj/fex | wc -l
> find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same file system loop as '/nfs/tsmsrvj/fex'.
> 1
> root@tsmsrvi:~# 
> 
> 
> 
> On tsmsrvj I have in /etc/exports:
> 
> /data/fex       tsmsrvi(rw,async,no_subtree_check,no_root_squash)
> 
> This is a btrfs subvolume with snapshots:
> 
> root@tsmsrvj:~# btrfs subvolume list /data
> ID 257 gen 35 top level 5 path fex
> ID 270 gen 36 top level 257 path fex/spool
> ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
> ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
> ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
> ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test
> 
> root@tsmsrvj:~# find /data/fex | wc -l
> 489887
> root@tsmsrvj:~# 
> 
> What must I add to /etc/exports to enable subvolume access for the nfs
> client?
> 
> tsmsrvi and tsmsrvj (nfs client and server) both run Ubuntu 20.04 with
> btrfs-progs v5.4.1 

   I can't remember if this is why, but I've had to put a distinct
fsid field in each separate subvolume being exported:

/srv/nfs/home     -rw,async,fsid=0x1730,no_subtree_check,no_root_squash

   It doesn't matter what value you use, as long as each one's
different.

   Hugo.

-- 
Hugo Mills             | Alert status mauve ocelot: Slight chance of
hugo@... carfax.org.uk | brimstone. Be prepared to make a nice cup of tea.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: nfs subvolume access?
  2021-03-10  7:59   ` Hugo Mills
@ 2021-03-10  8:09     ` Ulli Horlacher
  2021-03-10  9:35       ` Graham Cobb
  0 siblings, 1 reply; 94+ messages in thread
From: Ulli Horlacher @ 2021-03-10  8:09 UTC (permalink / raw)
  To: linux-btrfs

On Wed 2021-03-10 (07:59), Hugo Mills wrote:

> > On tsmsrvj I have in /etc/exports:
> > 
> > /data/fex       tsmsrvi(rw,async,no_subtree_check,no_root_squash)
> > 
> > This is a btrfs subvolume with snapshots:
> > 
> > root@tsmsrvj:~# btrfs subvolume list /data
> > ID 257 gen 35 top level 5 path fex
> > ID 270 gen 36 top level 257 path fex/spool
> > ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
> > ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
> > ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
> > ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test
> > 
> > root@tsmsrvj:~# find /data/fex | wc -l
> > 489887

>    I can't remember if this is why, but I've had to put a distinct
> fsid field in each separate subvolume being exported:
> 
> /srv/nfs/home     -rw,async,fsid=0x1730,no_subtree_check,no_root_squash

I must export EACH subvolume?!
The snapshots are generated automatically (via cron)!
I cannot add them to /etc/exports


-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<20210310075957.GG22502@savella.carfax.org.uk>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: nfs subvolume access?
  2021-03-10  7:46 ` nfs subvolume access? Ulli Horlacher
  2021-03-10  7:59   ` Hugo Mills
@ 2021-03-10  8:17   ` Ulli Horlacher
  2021-03-11  7:46   ` Ulli Horlacher
       [not found]   ` <162632387205.13764.6196748476850020429@noble.neil.brown.name>
  3 siblings, 0 replies; 94+ messages in thread
From: Ulli Horlacher @ 2021-03-10  8:17 UTC (permalink / raw)
  To: linux-btrfs

On Wed 2021-03-10 (08:46), Ulli Horlacher wrote:
> When I try to access a btrfs filesystem via nfs, I get the error:
> 
> root@tsmsrvi:~# mount tsmsrvj:/data/fex /nfs/tsmsrvj/fex
> root@tsmsrvi:~# time find /nfs/tsmsrvj/fex | wc -l
> find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same file system loop as '/nfs/tsmsrvj/fex'.
> 1

> tsmsrvi and tsmsrvj (nfs client and server) both run Ubuntu 20.04 with
> btrfs-progs v5.4.1 

On Ubuntu 18.04 this setup works without errors:

root@mutter:/backup/rsync# grep tandem /etc/exports 
/backup/rsync/tandem            176.9.135.138(rw,async,no_subtree_check,no_root_squash)

root@mutter:/backup/rsync# btrfs subvolume list /backup/rsync | grep tandem
ID 257 gen 62652 top level 5 path tandem
ID 5898 gen 62284 top level 257 path tandem/.snapshot/2021-03-01_0300.rsync
ID 5906 gen 62284 top level 257 path tandem/.snapshot/2021-03-02_0300.rsync
ID 5914 gen 62284 top level 257 path tandem/.snapshot/2021-03-03_0300.rsync
ID 5924 gen 62284 top level 257 path tandem/.snapshot/2021-03-04_0300.rsync
ID 5932 gen 62284 top level 257 path tandem/.snapshot/2021-03-05_0300.rsync
ID 5941 gen 62284 top level 257 path tandem/.snapshot/2021-03-06_0300.rsync
ID 5950 gen 62284 top level 257 path tandem/.snapshot/2021-03-07_0300.rsync
ID 5962 gen 62413 top level 257 path tandem/.snapshot/2021-03-08_0300.rsync
ID 5970 gen 62522 top level 257 path tandem/.snapshot/2021-03-09_0300.rsync
ID 5978 gen 62626 top level 257 path tandem/.snapshot/2021-03-10_0300.rsync

root@mutter:/backup/rsync# btrfs version
btrfs-progs v4.15.1

root@tandem:/backup# mount | grep backup
mutter:/backup/rsync/tandem on /backup type nfs (ro,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=1,sec=sys,mountaddr=176.9.68.251,mountvers=3,mountport=52943,mountproto=tcp,local_lock=none,addr=176.9.68.251)

root@tandem:/backup# ls -l .snapshot/
total 0
drwxr-xr-x 1 root root 392 Mar  1 03:00 2021-03-01_0300.rsync
drwxr-xr-x 1 root root 392 Mar  2 03:00 2021-03-02_0300.rsync
drwxr-xr-x 1 root root 392 Mar  3 03:00 2021-03-03_0300.rsync
drwxr-xr-x 1 root root 392 Mar  4 03:00 2021-03-04_0300.rsync
drwxr-xr-x 1 root root 392 Mar  5 03:00 2021-03-05_0300.rsync
drwxr-xr-x 1 root root 392 Mar  6 03:00 2021-03-06_0300.rsync
drwxr-xr-x 1 root root 392 Mar  7 03:00 2021-03-07_0300.rsync
drwxr-xr-x 1 root root 392 Mar  8 03:00 2021-03-08_0300.rsync
drwxr-xr-x 1 root root 392 Mar  9 03:00 2021-03-09_0300.rsync
drwxr-xr-x 1 root root 392 Mar 10 03:00 2021-03-10_0300.rsync

So, it is an issue with the newer btrfs version on Ubuntu 20.04? 


-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<20210310074620.GA2158@tik.uni-stuttgart.de>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: Re: nfs subvolume access?
  2021-03-10  8:09     ` Ulli Horlacher
@ 2021-03-10  9:35       ` Graham Cobb
  2021-03-10 15:55         ` Ulli Horlacher
  0 siblings, 1 reply; 94+ messages in thread
From: Graham Cobb @ 2021-03-10  9:35 UTC (permalink / raw)
  To: linux-btrfs

On 10/03/2021 08:09, Ulli Horlacher wrote:
> On Wed 2021-03-10 (07:59), Hugo Mills wrote:
> 
>>> On tsmsrvj I have in /etc/exports:
>>>
>>> /data/fex       tsmsrvi(rw,async,no_subtree_check,no_root_squash)
>>>
>>> This is a btrfs subvolume with snapshots:
>>>
>>> root@tsmsrvj:~# btrfs subvolume list /data
>>> ID 257 gen 35 top level 5 path fex
>>> ID 270 gen 36 top level 257 path fex/spool
>>> ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
>>> ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
>>> ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
>>> ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test
>>>
>>> root@tsmsrvj:~# find /data/fex | wc -l
>>> 489887
> 
>>    I can't remember if this is why, but I've had to put a distinct
>> fsid field in each separate subvolume being exported:
>>
>> /srv/nfs/home     -rw,async,fsid=0x1730,no_subtree_check,no_root_squash
> 
> I must export EACH subvolume?!

I have had similar problems. I *think* the current case is that modern
NFS, using NFS V4, can cope with the whole disk being accessible without
giving each subvolume its own FSID (which I have stopped doing).

HOWEVER, I think that find (and anything else which uses fsids and inode
numbers) will see subvolumes as having duplicated inodes.

> The snapshots are generated automatically (via cron)!
> I cannot add them to /etc/exports

Well, you could write some scripts... but I don't think it is necessary.
I *think* it is only necessary if you want `find` to be able to cross
between subvolumes on the NFS mounted disks.

However, I am NOT an NFS expert, nor have I done a lot of work on this.
I might be wrong. But I do NFS mount my snapshots disk remotely and use
it. And I do see occasional complaints from find, but I live with it.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: nfs subvolume access?
  2021-03-10  9:35       ` Graham Cobb
@ 2021-03-10 15:55         ` Ulli Horlacher
  2021-03-10 17:29           ` Forza
  0 siblings, 1 reply; 94+ messages in thread
From: Ulli Horlacher @ 2021-03-10 15:55 UTC (permalink / raw)
  To: linux-btrfs

On Wed 2021-03-10 (09:35), Graham Cobb wrote:

> >>> root@tsmsrvj:~# find /data/fex | wc -l
> >>> 489887
> > 
> >>    I can't remember if this is why, but I've had to put a distinct
> >> fsid field in each separate subvolume being exported:
> >>
> >> /srv/nfs/home     -rw,async,fsid=0x1730,no_subtree_check,no_root_squash
> > 
> > I must export EACH subvolume?!
> 
> I have had similar problems. I *think* the current case is that modern
> NFS, using NFS V4, can cope with the whole disk being accessible without
> giving each subvolume its own FSID (which I have stopped doing).

I cannot use NFS4 (for several reasons). I must use NFS3


> > The snapshots are generated automatically (via cron)!
> > I cannot add them to /etc/exports
> 
> Well, you could write some scripts... but I don't think it is necessary.
> I *think* it is only necessary if you want `find` to be able to cross
> between subvolumes on the NFS mounted disks.

It is not only a find problem:

root@fex:/nfs/tsmsrvj/fex# ls -R
:
spool
ls: ./spool: not listing already-listed directory


And as I wrote: there is no such problem with Ubuntu 18.04!
So, is it a btrfs or a nfs bug?


-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<5bded122-8adf-e5e7-dceb-37a3875f1a4b@cobb.uk.net>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: nfs subvolume access?
  2021-03-10 15:55         ` Ulli Horlacher
@ 2021-03-10 17:29           ` Forza
  2021-03-10 17:46             ` Ulli Horlacher
  0 siblings, 1 reply; 94+ messages in thread
From: Forza @ 2021-03-10 17:29 UTC (permalink / raw)
  To: Ulli Horlacher, linux-btrfs



---- From: Ulli Horlacher <framstag@rus.uni-stuttgart.de> -- Sent: 2021-03-10 - 16:55 ----

> On Wed 2021-03-10 (09:35), Graham Cobb wrote:
> 
>> >>> root@tsmsrvj:~# find /data/fex | wc -l
>> >>> 489887
>> > 
>> >>    I can't remember if this is why, but I've had to put a distinct
>> >> fsid field in each separate subvolume being exported:
>> >>
>> >> /srv/nfs/home     -rw,async,fsid=0x1730,no_subtree_check,no_root_squash
>> > 
>> > I must export EACH subvolume?!
>> 
>> I have had similar problems. I *think* the current case is that modern
>> NFS, using NFS V4, can cope with the whole disk being accessible without
>> giving each subvolume its own FSID (which I have stopped doing).
> 
> I cannot use NFS4 (for several reasons). I must use NFS3
> 
> 
>> > The snapshots are generated automatically (via cron)!
>> > I cannot add them to /etc/exports
>> 
>> Well, you could write some scripts... but I don't think it is necessary.
>> I *think* it is only necessary if you want `find` to be able to cross
>> between subvolumes on the NFS mounted disks.
> 
> It is not only a find problem:
> 
> root@fex:/nfs/tsmsrvj/fex# ls -R
> :
> spool
> ls: ./spool: not listing already-listed directory
> 
> 
> And as I wrote: there is no such problem with Ubuntu 18.04!
> So, is it a btrfs or a nfs bug?
> 
>

Did you try the fsid on the export? (not separate exports for all subvols) Without it the NFS server tries to enumerate it from the filesystem itself, which can cause weird issues. It is good practice to always use fsid on all exports in any case. 

At least with NFS4 server on my Ubuntu NFS servers at work, there are no issues with subvols for clients the mount with vers=3

You may want to enable debug logging on your server. https://wiki.tnonline.net/w/Blog/NFS_Server_Logging

/Forza


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: nfs subvolume access?
  2021-03-10 17:29           ` Forza
@ 2021-03-10 17:46             ` Ulli Horlacher
  0 siblings, 0 replies; 94+ messages in thread
From: Ulli Horlacher @ 2021-03-10 17:46 UTC (permalink / raw)
  To: linux-btrfs

On Wed 2021-03-10 (18:29), Forza wrote:

> Did you try the fsid on the export?

Yes:

root@tsmsrvj:/etc# grep tsm exports 
/data/fex       tsmsrvi(rw,async,no_subtree_check,no_root_squash,fsid=0x0011)

root@tsmsrvj:/etc# exportfs -va
exporting fex.rus.uni-stuttgart.de:/data/fex
exporting tsmsrvi.rus.uni-stuttgart.de:/data/fex


root@tsmsrvi:~# umount /nfs/tsmsrvj/fex

root@tsmsrvi:~# mount -o nfsvers=3,proto=tcp tsmsrvj:/data/fex /nfs/tsmsrvj/fex

root@tsmsrvi:~# find /nfs/tsmsrvj/fex
/nfs/tsmsrvj/fex
find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same file system loop as '/nfs/tsmsrvj/fex'.



> You may want to enable debug logging on your server.
> https://wiki.tnonline.net/w/Blog/NFS_Server_Logging

root@tsmsrvj:/etc# rpcdebug -m nfsd all
nfsd       sock fh export svc proc fileop auth repcache xdr lockd

root@tsmsrvj:/var/log# tailf kern.log
2021-03-10 18:45:17 [106259.649850] nfsd_dispatch: vers 3 proc 1
2021-03-10 18:45:17 [106259.649854] nfsd: GETATTR(3)  8: 00010001 00000011 00000000 00000000 00000000 00000000
2021-03-10 18:45:17 [106259.649856] nfsd: fh_verify(8: 00010001 00000011 00000000 00000000 00000000 00000000)
2021-03-10 18:45:17 [106259.650306] nfsd_dispatch: vers 3 proc 4
2021-03-10 18:45:17 [106259.650310] nfsd: ACCESS(3)   8: 00010001 00000011 00000000 00000000 00000000 00000000 0x1f
2021-03-10 18:45:17 [106259.650313] nfsd: fh_verify(8: 00010001 00000011 00000000 00000000 00000000 00000000)
2021-03-10 18:45:17 [106259.650869] nfsd_dispatch: vers 3 proc 17
2021-03-10 18:45:17 [106259.650874] nfsd: READDIR+(3) 8: 00010001 00000011 00000000 00000000 00000000 00000000 32768 bytes at 0
2021-03-10 18:45:17 [106259.650877] nfsd: fh_verify(8: 00010001 00000011 00000000 00000000 00000000 00000000)
2021-03-10 18:45:17 [106259.650883] nfsd: fh_verify(8: 00010001 00000011 00000000 00000000 00000000 00000000)
2021-03-10 18:45:17 [106259.650903] nfsd: fh_compose(exp 00:31/256 /fex, ino=256)
2021-03-10 18:45:17 [106259.650907] nfsd: fh_compose(exp 00:31/256 /, ino=256)
2021-03-10 18:45:17 [106259.651454] nfsd_dispatch: vers 3 proc 3
2021-03-10 18:45:17 [106259.651459] nfsd: LOOKUP(3)   8: 00010001 00000011 00000000 00000000 00000000 00000000 spool
2021-03-10 18:45:17 [106259.651463] nfsd: fh_verify(8: 00010001 00000011 00000000 00000000 00000000 00000000)
2021-03-10 18:45:17 [106259.651471] nfsd: nfsd_lookup(fh 8: 00010001 00000011 00000000 00000000 00000000 00000000, spool)
2021-03-10 18:45:17 [106259.651477] nfsd: fh_compose(exp 00:31/256 fex/spool, ino=256)

Hmmm... and now?

-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<55bb7f3.9ce44d1.1781d2fedd6@tnonline.net>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: nfs subvolume access?
  2021-03-10  7:46 ` nfs subvolume access? Ulli Horlacher
  2021-03-10  7:59   ` Hugo Mills
  2021-03-10  8:17   ` Ulli Horlacher
@ 2021-03-11  7:46   ` Ulli Horlacher
  2021-07-08 22:17     ` cannot use btrfs for nfs server Ulli Horlacher
       [not found]   ` <162632387205.13764.6196748476850020429@noble.neil.brown.name>
  3 siblings, 1 reply; 94+ messages in thread
From: Ulli Horlacher @ 2021-03-11  7:46 UTC (permalink / raw)
  To: linux-btrfs

On Wed 2021-03-10 (08:46), Ulli Horlacher wrote:
> When I try to access a btrfs filesystem via nfs, I get the error:
> 
> root@tsmsrvi:~# mount tsmsrvj:/data/fex /nfs/tsmsrvj/fex
> root@tsmsrvi:~# time find /nfs/tsmsrvj/fex | wc -l
> find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same file system loop as '/nfs/tsmsrvj/fex'.

It is even worse:

root@tsmsrvj:# grep localhost /etc/exports
/data/fex       localhost(rw,async,no_subtree_check,no_root_squash)

root@tsmsrvj:# mount localhost:/data/fex /nfs/localhost/fex

root@tsmsrvj:# du -s /data/fex
64282240        /data/fex

root@tsmsrvj:# du -s /nfs/localhost/fex
du: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following directory is part of the cycle:
  /nfs/localhost/fex/spool

0       /nfs/localhost/fex

root@tsmsrvj:# btrfs subvolume list /data
ID 257 gen 42 top level 5 path fex
ID 270 gen 42 top level 257 path fex/spool
ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test

root@tsmsrvj:# uname -a
Linux tsmsrvj 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

root@tsmsrvj:# btrfs version
btrfs-progs v5.4.1

root@tsmsrvj:# dpkg -l | grep nfs-
ii  nfs-common                             1:1.3.4-2.5ubuntu3.3              amd64        NFS support files common to client and server
ii  nfs-kernel-server                      1:1.3.4-2.5ubuntu3.3              amd64        support for NFS kernel server

The same bug appears if nfs server and client are different hosts or the
client is an older Ubuntu 18.04 system.


-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<20210310074620.GA2158@tik.uni-stuttgart.de>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* any idea about auto export multiple btrfs snapshots?
@ 2021-06-13  3:53 Wang Yugui
  2021-03-10  7:46 ` nfs subvolume access? Ulli Horlacher
  2021-06-14 22:50 ` any idea about auto export multiple btrfs snapshots? NeilBrown
  0 siblings, 2 replies; 94+ messages in thread
From: Wang Yugui @ 2021-06-13  3:53 UTC (permalink / raw)
  To: linux-nfs; +Cc: neilb

Hi,

Any idea about auto export multiple btrfs snapshots?

One related patch is yet not merged to nfs-utils 2.5.3.
From:   "NeilBrown" <neilb@suse.de>
Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.

In this patch, an UUID is auto generated when a tmpfs have no UUID.

for btrfs, multiple subvolume snapshot have the same filesystem UUID.
Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/13



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-13  3:53 any idea about auto export multiple btrfs snapshots? Wang Yugui
  2021-03-10  7:46 ` nfs subvolume access? Ulli Horlacher
@ 2021-06-14 22:50 ` NeilBrown
  2021-06-15 15:13   ` Wang Yugui
  2021-06-17  2:15   ` Wang Yugui
  1 sibling, 2 replies; 94+ messages in thread
From: NeilBrown @ 2021-06-14 22:50 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-nfs

On Sun, 13 Jun 2021, Wang Yugui wrote:
> Hi,
> 
> Any idea about auto export multiple btrfs snapshots?
> 
> One related patch is yet not merged to nfs-utils 2.5.3.
> From:   "NeilBrown" <neilb@suse.de>
> Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> 
> In this patch, an UUID is auto generated when a tmpfs have no UUID.
> 
> for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?

You really need to ask this question of btrfs developers.  'mountd'
already has a special-case exception for btrfs, to prefer the uuid
provided by statfs64() rather than the uuid extracted from the block
device.  It would be quite easy to add another exception.
But it would only be reasonable to do that if the btrfs team told us how
that wanted us to generate a UUID for a given mount point, and promised
that would always provide a unique stable result.

This is completely separate from the tmpfs patch you identified.

NeilBrown


> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/06/13
> 
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-14 22:50 ` any idea about auto export multiple btrfs snapshots? NeilBrown
@ 2021-06-15 15:13   ` Wang Yugui
  2021-06-15 15:41     ` Wang Yugui
                       ` (2 more replies)
  2021-06-17  2:15   ` Wang Yugui
  1 sibling, 3 replies; 94+ messages in thread
From: Wang Yugui @ 2021-06-15 15:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

Hi, NeilBrown

> On Sun, 13 Jun 2021, Wang Yugui wrote:
> > Hi,
> > 
> > Any idea about auto export multiple btrfs snapshots?
> > 
> > One related patch is yet not merged to nfs-utils 2.5.3.
> > From:   "NeilBrown" <neilb@suse.de>
> > Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> > 
> > In this patch, an UUID is auto generated when a tmpfs have no UUID.
> > 
> > for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> > Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?
> 
> You really need to ask this question of btrfs developers.  'mountd'
> already has a special-case exception for btrfs, to prefer the uuid
> provided by statfs64() rather than the uuid extracted from the block
> device.  It would be quite easy to add another exception.
> But it would only be reasonable to do that if the btrfs team told us how
> that wanted us to generate a UUID for a given mount point, and promised
> that would always provide a unique stable result.
> This is completely separate from the tmpfs patch you identified.

Thanks a lot for the replay.

Now btrfs statfs64() return 8 byte unique/stable result.

It is based on two parts.
1) 16 byte blkid of file system. this is uniq/stable between btrfs filesystems.
2) 8 byte of btrfs sub volume objectid. this is uniq/stable inside a
btrfs filesystem.

the code of linux/fs/btrfs
static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)

    /* We treat it as constant endianness (it doesn't matter _which_)
       because we want the fsid to come out the same whether mounted
       on a big-endian or little-endian host */
    buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ be32_to_cpu(fsid[2]);
    buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ be32_to_cpu(fsid[3]);
    /* Mask in the root object ID too, to disambiguate subvols */
    buf->f_fsid.val[0] ^=
        BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
    buf->f_fsid.val[1] ^=
        BTRFS_I(d_inode(dentry))->root->root_key.objectid;


for nfs, we need a 16 byte UUID now.

The best way I though:
16 byte blkid , math add 8 byte btrfs sub volume objectid.
but there is yet no a simple/easy way to get the raw value of 'btrfs sub
volume objectid'.

A simple but good enough way:
1) first 8 byte copy from blkid
2) second 8 byte copy from btrfs_statfs()
	the uniq/stable of multiple subvolume inside a btrfs filesystem is kept.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/15


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-15 15:13   ` Wang Yugui
@ 2021-06-15 15:41     ` Wang Yugui
  2021-06-16  5:47     ` Wang Yugui
  2021-06-17  3:02     ` NeilBrown
  2 siblings, 0 replies; 94+ messages in thread
From: Wang Yugui @ 2021-06-15 15:41 UTC (permalink / raw)
  To: NeilBrown, linux-nfs

Hi, NeilBrown

> > On Sun, 13 Jun 2021, Wang Yugui wrote:
> > > Hi,
> > > 
> > > Any idea about auto export multiple btrfs snapshots?
> > > 
> > > One related patch is yet not merged to nfs-utils 2.5.3.
> > > From:   "NeilBrown" <neilb@suse.de>
> > > Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> > > 
> > > In this patch, an UUID is auto generated when a tmpfs have no UUID.
> > > 
> > > for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> > > Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?
> > 
> > You really need to ask this question of btrfs developers.  'mountd'
> > already has a special-case exception for btrfs, to prefer the uuid
> > provided by statfs64() rather than the uuid extracted from the block
> > device.  It would be quite easy to add another exception.
> > But it would only be reasonable to do that if the btrfs team told us how
> > that wanted us to generate a UUID for a given mount point, and promised
> > that would always provide a unique stable result.
> > This is completely separate from the tmpfs patch you identified.
> 
> Thanks a lot for the replay.
> 
> Now btrfs statfs64() return 8 byte unique/stable result.
> 
> It is based on two parts.
> 1) 16 byte blkid of file system. this is uniq/stable between btrfs filesystems.
> 2) 8 byte of btrfs sub volume objectid. this is uniq/stable inside a
> btrfs filesystem.
> 
> the code of linux/fs/btrfs
> static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> 
>     /* We treat it as constant endianness (it doesn't matter _which_)
>        because we want the fsid to come out the same whether mounted
>        on a big-endian or little-endian host */
>     buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ be32_to_cpu(fsid[2]);
>     buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ be32_to_cpu(fsid[3]);
>     /* Mask in the root object ID too, to disambiguate subvols */
>     buf->f_fsid.val[0] ^=
>         BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
>     buf->f_fsid.val[1] ^=
>         BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> 
> 
> for nfs, we need a 16 byte UUID now.
> 
> The best way I though:
> 16 byte blkid , math add 8 byte btrfs sub volume objectid.
> but there is yet no a simple/easy way to get the raw value of 'btrfs sub
> volume objectid'.
> 
> A simple but good enough way:
> 1) first 8 byte copy from blkid
> 2) second 8 byte copy from btrfs_statfs()
> 	the uniq/stable of multiple subvolume inside a btrfs filesystem is kept.

By the way, the random 16 byte UUID still have very little chance to
conflict.

Could we keep the first 4 byte of the UUID of nfs/tmpfs alwasy ZERO?  ,
the first 4 byte zero will limit the conflict inside nfs/tmpfs, and it is
easy to diag.

Here we use the first same 8 byte for UUID of btrfs and nfs/btrfs, 
so it is easy to diag too.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/15



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-15 15:13   ` Wang Yugui
  2021-06-15 15:41     ` Wang Yugui
@ 2021-06-16  5:47     ` Wang Yugui
  2021-06-17  3:02     ` NeilBrown
  2 siblings, 0 replies; 94+ messages in thread
From: Wang Yugui @ 2021-06-16  5:47 UTC (permalink / raw)
  To: NeilBrown, linux-nfs

Hi,

> Hi, NeilBrown
> 
> > On Sun, 13 Jun 2021, Wang Yugui wrote:
> > > Hi,
> > > 
> > > Any idea about auto export multiple btrfs snapshots?
> > > 
> > > One related patch is yet not merged to nfs-utils 2.5.3.
> > > From:   "NeilBrown" <neilb@suse.de>
> > > Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> > > 
> > > In this patch, an UUID is auto generated when a tmpfs have no UUID.
> > > 
> > > for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> > > Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?
> > 
> > You really need to ask this question of btrfs developers.  'mountd'
> > already has a special-case exception for btrfs, to prefer the uuid
> > provided by statfs64() rather than the uuid extracted from the block
> > device.  It would be quite easy to add another exception.
> > But it would only be reasonable to do that if the btrfs team told us how
> > that wanted us to generate a UUID for a given mount point, and promised
> > that would always provide a unique stable result.
> > This is completely separate from the tmpfs patch you identified.
> 
> Thanks a lot for the replay.
> 
> Now btrfs statfs64() return 8 byte unique/stable result.
> 
> It is based on two parts.
> 1) 16 byte blkid of file system. this is uniq/stable between btrfs filesystems.
> 2) 8 byte of btrfs sub volume objectid. this is uniq/stable inside a
> btrfs filesystem.
> 
> the code of linux/fs/btrfs
> static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> 
>     /* We treat it as constant endianness (it doesn't matter _which_)
>        because we want the fsid to come out the same whether mounted
>        on a big-endian or little-endian host */
>     buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ be32_to_cpu(fsid[2]);
>     buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ be32_to_cpu(fsid[3]);
>     /* Mask in the root object ID too, to disambiguate subvols */
>     buf->f_fsid.val[0] ^=
>         BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
>     buf->f_fsid.val[1] ^=
>         BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> 
> 
> for nfs, we need a 16 byte UUID now.
> 
> The best way I though:
> 16 byte blkid , math add 8 byte btrfs sub volume objectid.
> but there is yet no a simple/easy way to get the raw value of 'btrfs sub
> volume objectid'.

The btrfs subvol objectid (8byte) can be extracted from the
statfs.f_fsid(8 byte) with the help of blkid(16btye) of the btrfs file
system just do a revert cacl in btrfs_statfs().

if we need 8 byte id for btrfs subvol, just use statfs.f_fsid.

if we need 16 byte id for btrfs subvol, use the blkid(16btye) of the
btrfs filesystem, plus the btrfs subvol objectid (8byte) , and keep 
the result in 16 byte.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/16



> A simple but good enough way:
> 1) first 8 byte copy from blkid
> 2) second 8 byte copy from btrfs_statfs()
> 	the uniq/stable of multiple subvolume inside a btrfs filesystem is kept.
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/06/15



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-14 22:50 ` any idea about auto export multiple btrfs snapshots? NeilBrown
  2021-06-15 15:13   ` Wang Yugui
@ 2021-06-17  2:15   ` Wang Yugui
  1 sibling, 0 replies; 94+ messages in thread
From: Wang Yugui @ 2021-06-17  2:15 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

Hi,

nfs need to treat btrfs subvols as different filesystems, so nfs need
crossmnt feature to support multiple btrfs subvol auto export?

It is yet not clear what prevent nfs/crossmnt from work well.

1, stat() and the result 'struct stat'
	btrfs subvol support it well.
	multiple subvols will have different st_dev of 'struct stat'.
	/bin/find works well too.

2, statfs() and the result 'struct statfs'
	btrfs subvol support it well.
	multiple subvols will have different f_fsid of 'struct statfs'.

3, stx_mnt_id of statx
	btrfs subvol does NOT support it well.
	but stx_mnt_id seems yet not used.

4, d_mountpoint() in kernel
	d_mountpoint() seems not support btrfs subvol.
	but we can add some dirty fix such as.

+//#define BTRFS_FIRST_FREE_OBJECTID 256ULL
+//#define BTRFS_SUPER_MAGIC    0x9123683E
+static inline bool is_btrfs_subvol_d(const struct dentry *dentry)
+{
+    return dentry->d_inode && dentry->d_inode->i_ino == 256ULL &&
+       dentry->d_sb && dentry->d_sb->s_magic == 0x9123683E;
+}

so the problem list is yet not clear.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/17

> On Sun, 13 Jun 2021, Wang Yugui wrote:
> > Hi,
> > 
> > Any idea about auto export multiple btrfs snapshots?
> > 
> > One related patch is yet not merged to nfs-utils 2.5.3.
> > From:   "NeilBrown" <neilb@suse.de>
> > Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> > 
> > In this patch, an UUID is auto generated when a tmpfs have no UUID.
> > 
> > for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> > Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?
> 
> You really need to ask this question of btrfs developers.  'mountd'
> already has a special-case exception for btrfs, to prefer the uuid
> provided by statfs64() rather than the uuid extracted from the block
> device.  It would be quite easy to add another exception.
> But it would only be reasonable to do that if the btrfs team told us how
> that wanted us to generate a UUID for a given mount point, and promised
> that would always provide a unique stable result.
> 
> This is completely separate from the tmpfs patch you identified.
> 
> NeilBrown
> 
> 
> > 
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2021/06/13
> > 
> > 
> > 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-15 15:13   ` Wang Yugui
  2021-06-15 15:41     ` Wang Yugui
  2021-06-16  5:47     ` Wang Yugui
@ 2021-06-17  3:02     ` NeilBrown
  2021-06-17  4:28       ` Wang Yugui
  2 siblings, 1 reply; 94+ messages in thread
From: NeilBrown @ 2021-06-17  3:02 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-nfs

On Wed, 16 Jun 2021, Wang Yugui wrote:
> Hi, NeilBrown
> 
> > On Sun, 13 Jun 2021, Wang Yugui wrote:
> > > Hi,
> > > 
> > > Any idea about auto export multiple btrfs snapshots?
> > > 
> > > One related patch is yet not merged to nfs-utils 2.5.3.
> > > From:   "NeilBrown" <neilb@suse.de>
> > > Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> > > 
> > > In this patch, an UUID is auto generated when a tmpfs have no UUID.
> > > 
> > > for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> > > Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?
> > 
> > You really need to ask this question of btrfs developers.  'mountd'
> > already has a special-case exception for btrfs, to prefer the uuid
> > provided by statfs64() rather than the uuid extracted from the block
> > device.  It would be quite easy to add another exception.
> > But it would only be reasonable to do that if the btrfs team told us how
> > that wanted us to generate a UUID for a given mount point, and promised
> > that would always provide a unique stable result.
> > This is completely separate from the tmpfs patch you identified.
> 
> Thanks a lot for the replay.
> 
> Now btrfs statfs64() return 8 byte unique/stable result.
> 
> It is based on two parts.
> 1) 16 byte blkid of file system. this is uniq/stable between btrfs filesystems.
> 2) 8 byte of btrfs sub volume objectid. this is uniq/stable inside a
> btrfs filesystem.
> 
> the code of linux/fs/btrfs
> static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> 
>     /* We treat it as constant endianness (it doesn't matter _which_)
>        because we want the fsid to come out the same whether mounted
>        on a big-endian or little-endian host */
>     buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ be32_to_cpu(fsid[2]);
>     buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ be32_to_cpu(fsid[3]);
>     /* Mask in the root object ID too, to disambiguate subvols */
>     buf->f_fsid.val[0] ^=
>         BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
>     buf->f_fsid.val[1] ^=
>         BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> 
> 
> for nfs, we need a 16 byte UUID now.
> 
> The best way I though:
> 16 byte blkid , math add 8 byte btrfs sub volume objectid.
> but there is yet no a simple/easy way to get the raw value of 'btrfs sub
> volume objectid'.

I'm a bit confused now.  You started out talking about snapshots, but
now you are talking about sub volumes.  Are they the same thing?

NFS export of btrfs sub volumes has worked for the past 10 years I
believe.

Can we go back to the beginning.  What, exactly, is the problem you are
trying to solve?  How can you demonstrate the problem?

NeilBrown


> 
> A simple but good enough way:
> 1) first 8 byte copy from blkid
> 2) second 8 byte copy from btrfs_statfs()
> 	the uniq/stable of multiple subvolume inside a btrfs filesystem is kept.
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/06/15
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-17  3:02     ` NeilBrown
@ 2021-06-17  4:28       ` Wang Yugui
  2021-06-18  0:32         ` NeilBrown
  0 siblings, 1 reply; 94+ messages in thread
From: Wang Yugui @ 2021-06-17  4:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

Hi,

> On Wed, 16 Jun 2021, Wang Yugui wrote:
> > Hi, NeilBrown
> > 
> > > On Sun, 13 Jun 2021, Wang Yugui wrote:
> > > > Hi,
> > > > 
> > > > Any idea about auto export multiple btrfs snapshots?
> > > > 
> > > > One related patch is yet not merged to nfs-utils 2.5.3.
> > > > From:   "NeilBrown" <neilb@suse.de>
> > > > Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> > > > 
> > > > In this patch, an UUID is auto generated when a tmpfs have no UUID.
> > > > 
> > > > for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> > > > Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?
> > > 
> > > You really need to ask this question of btrfs developers.  'mountd'
> > > already has a special-case exception for btrfs, to prefer the uuid
> > > provided by statfs64() rather than the uuid extracted from the block
> > > device.  It would be quite easy to add another exception.
> > > But it would only be reasonable to do that if the btrfs team told us how
> > > that wanted us to generate a UUID for a given mount point, and promised
> > > that would always provide a unique stable result.
> > > This is completely separate from the tmpfs patch you identified.
> > 
> > Thanks a lot for the replay.
> > 
> > Now btrfs statfs64() return 8 byte unique/stable result.
> > 
> > It is based on two parts.
> > 1) 16 byte blkid of file system. this is uniq/stable between btrfs filesystems.
> > 2) 8 byte of btrfs sub volume objectid. this is uniq/stable inside a
> > btrfs filesystem.
> > 
> > the code of linux/fs/btrfs
> > static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> > 
> >     /* We treat it as constant endianness (it doesn't matter _which_)
> >        because we want the fsid to come out the same whether mounted
> >        on a big-endian or little-endian host */
> >     buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ be32_to_cpu(fsid[2]);
> >     buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ be32_to_cpu(fsid[3]);
> >     /* Mask in the root object ID too, to disambiguate subvols */
> >     buf->f_fsid.val[0] ^=
> >         BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
> >     buf->f_fsid.val[1] ^=
> >         BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> > 
> > 
> > for nfs, we need a 16 byte UUID now.
> > 
> > The best way I though:
> > 16 byte blkid , math add 8 byte btrfs sub volume objectid.
> > but there is yet no a simple/easy way to get the raw value of 'btrfs sub
> > volume objectid'.
> 
> I'm a bit confused now.  You started out talking about snapshots, but
> now you are talking about sub volumes.  Are they the same thing?
> 
> NFS export of btrfs sub volumes has worked for the past 10 years I
> believe.
> 
> Can we go back to the beginning.  What, exactly, is the problem you are
> trying to solve?  How can you demonstrate the problem?
> 
> NeilBrown

I nfs/exported a btrfs with 2 subvols and 2 snapshot(subvol).

# btrfs subvolume list /mnt/test
ID 256 gen 53 top level 5 path sub1
ID 260 gen 56 top level 5 path sub2
ID 261 gen 57 top level 5 path .snapshot/sub1-s1
ID 262 gen 57 top level 5 path .snapshot/sub2-s1

and then mount.nfs4 it to /nfs/test.

# /bin/find /nfs/test/
/nfs/test/
find: File system loop detected; ‘/nfs/test/sub1’ is part of the same file system loop as ‘/nfs/test/’.
/nfs/test/.snapshot
find: File system loop detected; ‘/nfs/test/.snapshot/sub1-s1’ is part of the same file system loop as ‘/nfs/test/’.
find: File system loop detected; ‘/nfs/test/.snapshot/sub2-s1’ is part of the same file system loop as ‘/nfs/test/’.
/nfs/test/dir1
/nfs/test/dir1/a.txt
find: File system loop detected; ‘/nfs/test/sub2’ is part of the same file system loop as ‘/nfs/test/’

/bin/find report 'File system loop detected'. so I though there is
something wrong.

but when I checked the file content through /mnt/test and /nfs/test,
the file through /mnt/test/xxx and /nfs/test/xxx return the same result.

I have used nfs/crossmnt, and then I thought that btrfs subvol/snapshot
support is through 'nfs/crossmnt' feature. but in fact, it is not
through nfs/crossmnt feature?

/bin/find report 'File system loop detected', it means that vfs cache on
nfs client side will have some problem?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/17



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-17  4:28       ` Wang Yugui
@ 2021-06-18  0:32         ` NeilBrown
  2021-06-18  7:26           ` Wang Yugui
  0 siblings, 1 reply; 94+ messages in thread
From: NeilBrown @ 2021-06-18  0:32 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-nfs

On Thu, 17 Jun 2021, Wang Yugui wrote:
> > Can we go back to the beginning.  What, exactly, is the problem you are
> > trying to solve?  How can you demonstrate the problem?
> > 
> > NeilBrown
> 
> I nfs/exported a btrfs with 2 subvols and 2 snapshot(subvol).
> 
> # btrfs subvolume list /mnt/test
> ID 256 gen 53 top level 5 path sub1
> ID 260 gen 56 top level 5 path sub2
> ID 261 gen 57 top level 5 path .snapshot/sub1-s1
> ID 262 gen 57 top level 5 path .snapshot/sub2-s1
> 
> and then mount.nfs4 it to /nfs/test.
> 
> # /bin/find /nfs/test/
> /nfs/test/
> find: File system loop detected; '/nfs/test/sub1' is part of the same file system loop as '/nfs/test/'.
> /nfs/test/.snapshot
> find: File system loop detected; '/nfs/test/.snapshot/sub1-s1' is part of the same file system loop as '/nfs/test/'.
> find: File system loop detected; '/nfs/test/.snapshot/sub2-s1' is part of the same file system loop as '/nfs/test/'.
> /nfs/test/dir1
> /nfs/test/dir1/a.txt
> find: File system loop detected; '/nfs/test/sub2' is part of the same file system loop as '/nfs/test/'
> 
> /bin/find report 'File system loop detected'. so I though there is
> something wrong.

Certainly something is wrong.  The error message implies that some
directory is reporting the same dev an ino as an ancestor directory.
Presumably /nfs/test and /nfs/test/sub1.
Can you confirm that please. e.g. run the command

   stat /nfs/test /nfs/test/sub1

and examine the output.

As sub1 is considered a different file system, it should have a
different dev number.  NFS will assign a different device number only
when the server reports a different fsid.  The Linux NFS server will
report a different fsid if d_mountpoint() is 'true' for the dentry, and
follow_down() results in no change the the vfsmnt,dentry in a 'struct
path'.

You have already said that d_mountpoint doesn't work for btrfs, so that
is part of the problem.  NFSD doesn't trust d_mountpoint completely as
it only reports that the dentry is a mountpoint in some namespace, not
necessarily in this namespace.  So you really need to fix
nfsd_mountpoint.

I suggest you try adding your "dirty fix" to nfsd_mountpoint() so that
it reports the root of a btrfs subvol as a mountpoint, and see if that
fixes the problem.  It should change the problem at least.  You would
need to get nfsd_mountpoint() to return '1' in this case, not '2'.

NeilBrown


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-18  0:32         ` NeilBrown
@ 2021-06-18  7:26           ` Wang Yugui
  2021-06-18 13:34             ` Wang Yugui
                               ` (2 more replies)
  0 siblings, 3 replies; 94+ messages in thread
From: Wang Yugui @ 2021-06-18  7:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

Hi,

> On Thu, 17 Jun 2021, Wang Yugui wrote:
> > > Can we go back to the beginning.  What, exactly, is the problem you are
> > > trying to solve?  How can you demonstrate the problem?
> > > 
> > > NeilBrown
> > 
> > I nfs/exported a btrfs with 2 subvols and 2 snapshot(subvol).
> > 
> > # btrfs subvolume list /mnt/test
> > ID 256 gen 53 top level 5 path sub1
> > ID 260 gen 56 top level 5 path sub2
> > ID 261 gen 57 top level 5 path .snapshot/sub1-s1
> > ID 262 gen 57 top level 5 path .snapshot/sub2-s1
> > 
> > and then mount.nfs4 it to /nfs/test.
> > 
> > # /bin/find /nfs/test/
> > /nfs/test/
> > find: File system loop detected; '/nfs/test/sub1' is part of the same file system loop as '/nfs/test/'.
> > /nfs/test/.snapshot
> > find: File system loop detected; '/nfs/test/.snapshot/sub1-s1' is part of the same file system loop as '/nfs/test/'.
> > find: File system loop detected; '/nfs/test/.snapshot/sub2-s1' is part of the same file system loop as '/nfs/test/'.
> > /nfs/test/dir1
> > /nfs/test/dir1/a.txt
> > find: File system loop detected; '/nfs/test/sub2' is part of the same file system loop as '/nfs/test/'
> > 
> > /bin/find report 'File system loop detected'. so I though there is
> > something wrong.
> 
> Certainly something is wrong.  The error message implies that some
> directory is reporting the same dev an ino as an ancestor directory.
> Presumably /nfs/test and /nfs/test/sub1.
> Can you confirm that please. e.g. run the command
> 
>    stat /nfs/test /nfs/test/sub1
> and examine the output.

# stat /nfs/test /nfs/test/sub1
  File: /nfs/test
  Size: 42              Blocks: 32         IO Block: 32768  directory
Device: 36h/54d Inode: 256         Links: 1
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-06-18 13:50:55.409457648 +0800
Modify: 2021-06-13 10:05:10.830825901 +0800
Change: 2021-06-13 10:05:10.830825901 +0800
 Birth: -
  File: /nfs/test/sub1
  Size: 8               Blocks: 0          IO Block: 32768  directory
Device: 36h/54d Inode: 256         Links: 1
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-06-18 13:51:14.463621411 +0800
Modify: 2021-06-12 21:59:10.598089917 +0800
Change: 2021-06-12 21:59:10.598089917 +0800
 Birth: -

same 'Device/Inode' are reported.


but the local btrfs mount,
# stat /mnt/test/ /mnt/test/sub1
  File: /mnt/test/
  Size: 42              Blocks: 32         IO Block: 4096   directory
Device: 33h/51d Inode: 256         Links: 1
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-06-18 13:50:55.409457648 +0800
Modify: 2021-06-13 10:05:10.830825901 +0800
Change: 2021-06-13 10:05:10.830825901 +0800
 Birth: -
  File: /mnt/test/sub1
  Size: 8               Blocks: 0          IO Block: 4096   directory
Device: 34h/52d Inode: 256         Links: 1
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-06-18 13:51:14.463621411 +0800
Modify: 2021-06-12 21:59:10.598089917 +0800
Change: 2021-06-12 21:59:10.598089917 +0800
 Birth: -

'stat' command should cause nfs/crossmnt to happen auto, and then return
the 'stat' result?


> As sub1 is considered a different file system, it should have a
> different dev number.  NFS will assign a different device number only
> when the server reports a different fsid.  The Linux NFS server will
> report a different fsid if d_mountpoint() is 'true' for the dentry, and
> follow_down() results in no change the the vfsmnt,dentry in a 'struct
> path'.
> 
> You have already said that d_mountpoint doesn't work for btrfs, so that
> is part of the problem.  NFSD doesn't trust d_mountpoint completely as
> it only reports that the dentry is a mountpoint in some namespace, not
> necessarily in this namespace.  So you really need to fix
> nfsd_mountpoint.
> 
> I suggest you try adding your "dirty fix" to nfsd_mountpoint() so that
> it reports the root of a btrfs subvol as a mountpoint, and see if that
> fixes the problem.  It should change the problem at least.  You would
> need to get nfsd_mountpoint() to return '1' in this case, not '2'.
> 
> NeilBrown

I changed the return value from 2 to 1.
        if (nfsd4_is_junction(dentry))
                return 1;
+       if (is_btrfs_subvol_d(dentry))
+               return 1;
        if (d_mountpoint(dentry))

but the crossmnt still does not happen auto.

I tried to mount the subvol manual, 
# mount.nfs4 T7610:/mnt/test/sub1 /nfs/test/sub1
mount.nfs4: Stale file handle

we add trace to is_btrfs_subvol_d(), it works as expected.
+static inline bool is_btrfs_subvol_d(const struct dentry *dentry)
+{
+    bool ret= dentry->d_inode && dentry->d_inode->i_ino == 256ULL &&
+		dentry->d_sb && dentry->d_sb->s_magic == 0x9123683E;
+	printk(KERN_INFO "is_btrfs_subvol_d(%s)=%d\n", dentry->d_name.name, ret);
+	return ret;
+}

It seems more fixes are needed.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/18



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-18  7:26           ` Wang Yugui
@ 2021-06-18 13:34             ` Wang Yugui
  2021-06-19  6:47               ` Wang Yugui
  2021-06-20 12:27             ` Wang Yugui
  2021-06-21  4:52             ` NeilBrown
  2 siblings, 1 reply; 94+ messages in thread
From: Wang Yugui @ 2021-06-18 13:34 UTC (permalink / raw)
  To: NeilBrown, linux-nfs

Hi,

> > On Thu, 17 Jun 2021, Wang Yugui wrote:
> > > > Can we go back to the beginning.  What, exactly, is the problem you are
> > > > trying to solve?  How can you demonstrate the problem?
> > > > 
> > > > NeilBrown
> > > 
> > > I nfs/exported a btrfs with 2 subvols and 2 snapshot(subvol).
> > > 
> > > # btrfs subvolume list /mnt/test
> > > ID 256 gen 53 top level 5 path sub1
> > > ID 260 gen 56 top level 5 path sub2
> > > ID 261 gen 57 top level 5 path .snapshot/sub1-s1
> > > ID 262 gen 57 top level 5 path .snapshot/sub2-s1
> > > 
> > > and then mount.nfs4 it to /nfs/test.
> > > 
> > > # /bin/find /nfs/test/
> > > /nfs/test/
> > > find: File system loop detected; '/nfs/test/sub1' is part of the same file system loop as '/nfs/test/'.
> > > /nfs/test/.snapshot
> > > find: File system loop detected; '/nfs/test/.snapshot/sub1-s1' is part of the same file system loop as '/nfs/test/'.
> > > find: File system loop detected; '/nfs/test/.snapshot/sub2-s1' is part of the same file system loop as '/nfs/test/'.
> > > /nfs/test/dir1
> > > /nfs/test/dir1/a.txt
> > > find: File system loop detected; '/nfs/test/sub2' is part of the same file system loop as '/nfs/test/'
> > > 
> > > /bin/find report 'File system loop detected'. so I though there is
> > > something wrong.
> > 
> > Certainly something is wrong.  The error message implies that some
> > directory is reporting the same dev an ino as an ancestor directory.
> > Presumably /nfs/test and /nfs/test/sub1.
> > Can you confirm that please. e.g. run the command
> > 
> >    stat /nfs/test /nfs/test/sub1
> > and examine the output.
> 
> # stat /nfs/test /nfs/test/sub1
>   File: /nfs/test
>   Size: 42              Blocks: 32         IO Block: 32768  directory
> Device: 36h/54d Inode: 256         Links: 1
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2021-06-18 13:50:55.409457648 +0800
> Modify: 2021-06-13 10:05:10.830825901 +0800
> Change: 2021-06-13 10:05:10.830825901 +0800
>  Birth: -
>   File: /nfs/test/sub1
>   Size: 8               Blocks: 0          IO Block: 32768  directory
> Device: 36h/54d Inode: 256         Links: 1
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2021-06-18 13:51:14.463621411 +0800
> Modify: 2021-06-12 21:59:10.598089917 +0800
> Change: 2021-06-12 21:59:10.598089917 +0800
>  Birth: -
> 
> same 'Device/Inode' are reported.
> 
> 
> but the local btrfs mount,
> # stat /mnt/test/ /mnt/test/sub1
>   File: /mnt/test/
>   Size: 42              Blocks: 32         IO Block: 4096   directory
> Device: 33h/51d Inode: 256         Links: 1
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2021-06-18 13:50:55.409457648 +0800
> Modify: 2021-06-13 10:05:10.830825901 +0800
> Change: 2021-06-13 10:05:10.830825901 +0800
>  Birth: -
>   File: /mnt/test/sub1
>   Size: 8               Blocks: 0          IO Block: 4096   directory
> Device: 34h/52d Inode: 256         Links: 1
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2021-06-18 13:51:14.463621411 +0800
> Modify: 2021-06-12 21:59:10.598089917 +0800
> Change: 2021-06-12 21:59:10.598089917 +0800
>  Birth: -
> 
> 'stat' command should cause nfs/crossmnt to happen auto, and then return
> the 'stat' result?
> 
> 
> > As sub1 is considered a different file system, it should have a
> > different dev number.  NFS will assign a different device number only
> > when the server reports a different fsid.  The Linux NFS server will
> > report a different fsid if d_mountpoint() is 'true' for the dentry, and
> > follow_down() results in no change the the vfsmnt,dentry in a 'struct
> > path'.
> > 
> > You have already said that d_mountpoint doesn't work for btrfs, so that
> > is part of the problem.  NFSD doesn't trust d_mountpoint completely as
> > it only reports that the dentry is a mountpoint in some namespace, not
> > necessarily in this namespace.  So you really need to fix
> > nfsd_mountpoint.
> > 
> > I suggest you try adding your "dirty fix" to nfsd_mountpoint() so that
> > it reports the root of a btrfs subvol as a mountpoint, and see if that
> > fixes the problem.  It should change the problem at least.  You would
> > need to get nfsd_mountpoint() to return '1' in this case, not '2'.
> > 
> > NeilBrown
> 
> I changed the return value from 2 to 1.
>         if (nfsd4_is_junction(dentry))
>                 return 1;
> +       if (is_btrfs_subvol_d(dentry))
> +               return 1;
>         if (d_mountpoint(dentry))
> 
> but the crossmnt still does not happen auto.
> 
> I tried to mount the subvol manual, 
> # mount.nfs4 T7610:/mnt/test/sub1 /nfs/test/sub1
> mount.nfs4: Stale file handle
> 
> we add trace to is_btrfs_subvol_d(), it works as expected.
> +static inline bool is_btrfs_subvol_d(const struct dentry *dentry)
> +{
> +    bool ret= dentry->d_inode && dentry->d_inode->i_ino == 256ULL &&
> +		dentry->d_sb && dentry->d_sb->s_magic == 0x9123683E;
> +	printk(KERN_INFO "is_btrfs_subvol_d(%s)=%d\n", dentry->d_name.name, ret);
> +	return ret;
> +}
> 
> It seems more fixes are needed.

for a normal crossmnt,
	/mnt/test			btrfs
	/mn/test/xfs1		xfs
this xfs1 have 2 inode,
1) inode in xfs /mn/test/xfs, as the root.
2) inode in btrfs /mnt/test, as a directory.
when /mn/test/xfs1 is mounted, nfs client with nocrossmnt option will
show 2).

but for a btrfs subvol,
	/mnt/test		btrfs
	/mnt/test/sub1	 btrfs subvol
this sub1 have just 1 inode
1) inode in /mnt/test/sub1, as the root

This difference break the nfs support of btrfs multiple subvol?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/18


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-18 13:34             ` Wang Yugui
@ 2021-06-19  6:47               ` Wang Yugui
  0 siblings, 0 replies; 94+ messages in thread
From: Wang Yugui @ 2021-06-19  6:47 UTC (permalink / raw)
  To: NeilBrown, linux-nfs

Hi,

> > > On Thu, 17 Jun 2021, Wang Yugui wrote:
> > > > > Can we go back to the beginning.  What, exactly, is the problem you are
> > > > > trying to solve?  How can you demonstrate the problem?
> > > > > 
> > > > > NeilBrown
> > > > 
> > > > I nfs/exported a btrfs with 2 subvols and 2 snapshot(subvol).
> > > > 
> > > > # btrfs subvolume list /mnt/test
> > > > ID 256 gen 53 top level 5 path sub1
> > > > ID 260 gen 56 top level 5 path sub2
> > > > ID 261 gen 57 top level 5 path .snapshot/sub1-s1
> > > > ID 262 gen 57 top level 5 path .snapshot/sub2-s1
> > > > 
> > > > and then mount.nfs4 it to /nfs/test.
> > > > 
> > > > # /bin/find /nfs/test/
> > > > /nfs/test/
> > > > find: File system loop detected; '/nfs/test/sub1' is part of the same file system loop as '/nfs/test/'.
> > > > /nfs/test/.snapshot
> > > > find: File system loop detected; '/nfs/test/.snapshot/sub1-s1' is part of the same file system loop as '/nfs/test/'.
> > > > find: File system loop detected; '/nfs/test/.snapshot/sub2-s1' is part of the same file system loop as '/nfs/test/'.
> > > > /nfs/test/dir1
> > > > /nfs/test/dir1/a.txt
> > > > find: File system loop detected; '/nfs/test/sub2' is part of the same file system loop as '/nfs/test/'
> > > > 
> > > > /bin/find report 'File system loop detected'. so I though there is
> > > > something wrong.
> > > 
> > > Certainly something is wrong.  The error message implies that some
> > > directory is reporting the same dev an ino as an ancestor directory.
> > > Presumably /nfs/test and /nfs/test/sub1.
> > > Can you confirm that please. e.g. run the command
> > > 
> > >    stat /nfs/test /nfs/test/sub1
> > > and examine the output.
> > 
> > # stat /nfs/test /nfs/test/sub1
> >   File: /nfs/test
> >   Size: 42              Blocks: 32         IO Block: 32768  directory
> > Device: 36h/54d Inode: 256         Links: 1
> > Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> > Access: 2021-06-18 13:50:55.409457648 +0800
> > Modify: 2021-06-13 10:05:10.830825901 +0800
> > Change: 2021-06-13 10:05:10.830825901 +0800
> >  Birth: -
> >   File: /nfs/test/sub1
> >   Size: 8               Blocks: 0          IO Block: 32768  directory
> > Device: 36h/54d Inode: 256         Links: 1
> > Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> > Access: 2021-06-18 13:51:14.463621411 +0800
> > Modify: 2021-06-12 21:59:10.598089917 +0800
> > Change: 2021-06-12 21:59:10.598089917 +0800
> >  Birth: -
> > 
> > same 'Device/Inode' are reported.
> > 
> > 
> > but the local btrfs mount,
> > # stat /mnt/test/ /mnt/test/sub1
> >   File: /mnt/test/
> >   Size: 42              Blocks: 32         IO Block: 4096   directory
> > Device: 33h/51d Inode: 256         Links: 1
> > Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> > Access: 2021-06-18 13:50:55.409457648 +0800
> > Modify: 2021-06-13 10:05:10.830825901 +0800
> > Change: 2021-06-13 10:05:10.830825901 +0800
> >  Birth: -
> >   File: /mnt/test/sub1
> >   Size: 8               Blocks: 0          IO Block: 4096   directory
> > Device: 34h/52d Inode: 256         Links: 1
> > Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> > Access: 2021-06-18 13:51:14.463621411 +0800
> > Modify: 2021-06-12 21:59:10.598089917 +0800
> > Change: 2021-06-12 21:59:10.598089917 +0800
> >  Birth: -
> > 
> > 'stat' command should cause nfs/crossmnt to happen auto, and then return
> > the 'stat' result?
> > 
> > 
> > > As sub1 is considered a different file system, it should have a
> > > different dev number.  NFS will assign a different device number only
> > > when the server reports a different fsid.  The Linux NFS server will
> > > report a different fsid if d_mountpoint() is 'true' for the dentry, and
> > > follow_down() results in no change the the vfsmnt,dentry in a 'struct
> > > path'.
> > > 
> > > You have already said that d_mountpoint doesn't work for btrfs, so that
> > > is part of the problem.  NFSD doesn't trust d_mountpoint completely as
> > > it only reports that the dentry is a mountpoint in some namespace, not
> > > necessarily in this namespace.  So you really need to fix
> > > nfsd_mountpoint.
> > > 
> > > I suggest you try adding your "dirty fix" to nfsd_mountpoint() so that
> > > it reports the root of a btrfs subvol as a mountpoint, and see if that
> > > fixes the problem.  It should change the problem at least.  You would
> > > need to get nfsd_mountpoint() to return '1' in this case, not '2'.
> > > 
> > > NeilBrown
> > 
> > I changed the return value from 2 to 1.
> >         if (nfsd4_is_junction(dentry))
> >                 return 1;
> > +       if (is_btrfs_subvol_d(dentry))
> > +               return 1;
> >         if (d_mountpoint(dentry))
> > 
> > but the crossmnt still does not happen auto.
> > 
> > I tried to mount the subvol manual, 
> > # mount.nfs4 T7610:/mnt/test/sub1 /nfs/test/sub1
> > mount.nfs4: Stale file handle
> > 
> > we add trace to is_btrfs_subvol_d(), it works as expected.
> > +static inline bool is_btrfs_subvol_d(const struct dentry *dentry)
> > +{
> > +    bool ret= dentry->d_inode && dentry->d_inode->i_ino == 256ULL &&
> > +		dentry->d_sb && dentry->d_sb->s_magic == 0x9123683E;
> > +	printk(KERN_INFO "is_btrfs_subvol_d(%s)=%d\n", dentry->d_name.name, ret);
> > +	return ret;
> > +}
> > 
> > It seems more fixes are needed.
> 
> for a normal crossmnt,
> 	/mnt/test			btrfs
> 	/mn/test/xfs1		xfs
> this xfs1 have 2 inode,
> 1) inode in xfs /mn/test/xfs, as the root.
> 2) inode in btrfs /mnt/test, as a directory.
> when /mn/test/xfs1 is mounted, nfs client with nocrossmnt option will
> show 2).
> 
> but for a btrfs subvol,
> 	/mnt/test		btrfs
> 	/mnt/test/sub1	 btrfs subvol
> this sub1 have just 1 inode
> 1) inode in /mnt/test/sub1, as the root
> 
> This difference break the nfs support of btrfs multiple subvol?

When os shutdown, btrfs subvol in nfs client is firstly unmounted,
then the btrfs subvol entry in nfs client will have a unmounted status.

this btrfs subvol unmounted status in nfs client does NOT exist in btrfs,
we could use a dummy inode value(BTRFS_LAST_FREE_OBJECTID/-255ULL).

btrfs subvol entry mounted status
	st_dev	: from subvol
	st_ino	 : form subvol (255ULL) 

btrfs subvol entry mounted status
	st_dev	: from parent
	st_ino	 : dummy inode ( -255ULL) 

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/19



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-18  7:26           ` Wang Yugui
  2021-06-18 13:34             ` Wang Yugui
@ 2021-06-20 12:27             ` Wang Yugui
  2021-06-21  4:52             ` NeilBrown
  2 siblings, 0 replies; 94+ messages in thread
From: Wang Yugui @ 2021-06-20 12:27 UTC (permalink / raw)
  To: NeilBrown, linux-nfs

Hi,

> It seems more fixes are needed.

when compare btrfs subvol with xfs crossmnt, we found out
a new feature difference.

/mnt/test		xfs
/mnt/text/xfs2	 another xfs(crossmnt)
nfsd4_encode_dirent_fattr() report "/mnt/test/xfs2" + "/";


but 
/mnt/test		btrfs
/mnt/test/sub1	 btrfs subvol
nfsd4_encode_dirent_fattr() report "/mnt/test/" + "sub1";

for '/mnt/test/sub1',  nfsd should treat the mountpoint as
'/mn/test/sub1', rather than '/mnt/test'?

I'm sorry that yet no patch is avaliable, kernel source is quite
difficult for me.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/20


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-18  7:26           ` Wang Yugui
  2021-06-18 13:34             ` Wang Yugui
  2021-06-20 12:27             ` Wang Yugui
@ 2021-06-21  4:52             ` NeilBrown
  2021-06-21  5:13               ` NeilBrown
  2021-06-21 14:35               ` Frank Filz
  2 siblings, 2 replies; 94+ messages in thread
From: NeilBrown @ 2021-06-21  4:52 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-nfs

On Fri, 18 Jun 2021, Wang Yugui wrote:
> Hi,
> 
> > On Thu, 17 Jun 2021, Wang Yugui wrote:
> > > > Can we go back to the beginning.  What, exactly, is the problem you are
> > > > trying to solve?  How can you demonstrate the problem?
> > > > 
> > > > NeilBrown
> > > 
> > > I nfs/exported a btrfs with 2 subvols and 2 snapshot(subvol).
> > > 
> > > # btrfs subvolume list /mnt/test
> > > ID 256 gen 53 top level 5 path sub1
> > > ID 260 gen 56 top level 5 path sub2
> > > ID 261 gen 57 top level 5 path .snapshot/sub1-s1
> > > ID 262 gen 57 top level 5 path .snapshot/sub2-s1
> > > 
> > > and then mount.nfs4 it to /nfs/test.
> > > 
> > > # /bin/find /nfs/test/
> > > /nfs/test/
> > > find: File system loop detected; '/nfs/test/sub1' is part of the same file system loop as '/nfs/test/'.
> > > /nfs/test/.snapshot
> > > find: File system loop detected; '/nfs/test/.snapshot/sub1-s1' is part of the same file system loop as '/nfs/test/'.
> > > find: File system loop detected; '/nfs/test/.snapshot/sub2-s1' is part of the same file system loop as '/nfs/test/'.
> > > /nfs/test/dir1
> > > /nfs/test/dir1/a.txt
> > > find: File system loop detected; '/nfs/test/sub2' is part of the same file system loop as '/nfs/test/'
> > > 
> > > /bin/find report 'File system loop detected'. so I though there is
> > > something wrong.
> > 
> > Certainly something is wrong.  The error message implies that some
> > directory is reporting the same dev an ino as an ancestor directory.
> > Presumably /nfs/test and /nfs/test/sub1.
> > Can you confirm that please. e.g. run the command
> > 
> >    stat /nfs/test /nfs/test/sub1
> > and examine the output.
> 
> # stat /nfs/test /nfs/test/sub1
>   File: /nfs/test
>   Size: 42              Blocks: 32         IO Block: 32768  directory
> Device: 36h/54d Inode: 256         Links: 1
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2021-06-18 13:50:55.409457648 +0800
> Modify: 2021-06-13 10:05:10.830825901 +0800
> Change: 2021-06-13 10:05:10.830825901 +0800
>  Birth: -
>   File: /nfs/test/sub1
>   Size: 8               Blocks: 0          IO Block: 32768  directory
> Device: 36h/54d Inode: 256         Links: 1
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2021-06-18 13:51:14.463621411 +0800
> Modify: 2021-06-12 21:59:10.598089917 +0800
> Change: 2021-06-12 21:59:10.598089917 +0800
>  Birth: -
> 
> same 'Device/Inode' are reported.
> 
> 
> but the local btrfs mount,
> # stat /mnt/test/ /mnt/test/sub1
>   File: /mnt/test/
>   Size: 42              Blocks: 32         IO Block: 4096   directory
> Device: 33h/51d Inode: 256         Links: 1
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2021-06-18 13:50:55.409457648 +0800
> Modify: 2021-06-13 10:05:10.830825901 +0800
> Change: 2021-06-13 10:05:10.830825901 +0800
>  Birth: -
>   File: /mnt/test/sub1
>   Size: 8               Blocks: 0          IO Block: 4096   directory
> Device: 34h/52d Inode: 256         Links: 1
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2021-06-18 13:51:14.463621411 +0800
> Modify: 2021-06-12 21:59:10.598089917 +0800
> Change: 2021-06-12 21:59:10.598089917 +0800
>  Birth: -
> 
> 'stat' command should cause nfs/crossmnt to happen auto, and then return
> the 'stat' result?
> 
> 
> > As sub1 is considered a different file system, it should have a
> > different dev number.  NFS will assign a different device number only
> > when the server reports a different fsid.  The Linux NFS server will
> > report a different fsid if d_mountpoint() is 'true' for the dentry, and
> > follow_down() results in no change the the vfsmnt,dentry in a 'struct
> > path'.
> > 
> > You have already said that d_mountpoint doesn't work for btrfs, so that
> > is part of the problem.  NFSD doesn't trust d_mountpoint completely as
> > it only reports that the dentry is a mountpoint in some namespace, not
> > necessarily in this namespace.  So you really need to fix
> > nfsd_mountpoint.
> > 
> > I suggest you try adding your "dirty fix" to nfsd_mountpoint() so that
> > it reports the root of a btrfs subvol as a mountpoint, and see if that
> > fixes the problem.  It should change the problem at least.  You would
> > need to get nfsd_mountpoint() to return '1' in this case, not '2'.
> > 
> > NeilBrown
> 
> I changed the return value from 2 to 1.
>         if (nfsd4_is_junction(dentry))
>                 return 1;
> +       if (is_btrfs_subvol_d(dentry))
> +               return 1;
>         if (d_mountpoint(dentry))
> 
> but the crossmnt still does not happen auto.
> 
> I tried to mount the subvol manual, 
> # mount.nfs4 T7610:/mnt/test/sub1 /nfs/test/sub1
> mount.nfs4: Stale file handle
> 
> we add trace to is_btrfs_subvol_d(), it works as expected.
> +static inline bool is_btrfs_subvol_d(const struct dentry *dentry)
> +{
> +    bool ret= dentry->d_inode && dentry->d_inode->i_ino == 256ULL &&
> +		dentry->d_sb && dentry->d_sb->s_magic == 0x9123683E;
> +	printk(KERN_INFO "is_btrfs_subvol_d(%s)=%d\n", dentry->d_name.name, ret);
> +	return ret;
> +}
> 
> It seems more fixes are needed.

I think the problem is that the submount doesn't appear in /proc/mounts.
"nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
filesystem to the mount point.  To do this it walks through /proc/mounts
checking the uuid of each filesystem.  If a filesystem isn't listed
there, it obviously fails.

I guess you could add code to nfs-utils to do whatever "btrfs subvol
list" does to make up for the fact that btrfs doesn't register in
/proc/mounts.

NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-21  4:52             ` NeilBrown
@ 2021-06-21  5:13               ` NeilBrown
  2021-06-21  8:34                 ` Wang Yugui
  2021-06-21 14:35               ` Frank Filz
  1 sibling, 1 reply; 94+ messages in thread
From: NeilBrown @ 2021-06-21  5:13 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-nfs

> > It seems more fixes are needed.
> 
> I think the problem is that the submount doesn't appear in /proc/mounts.
> "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> filesystem to the mount point.  To do this it walks through /proc/mounts
> checking the uuid of each filesystem.  If a filesystem isn't listed
> there, it obviously fails.
> 
> I guess you could add code to nfs-utils to do whatever "btrfs subvol
> list" does to make up for the fact that btrfs doesn't register in
> /proc/mounts.

Another approach might be to just change svcxdr_encode_fattr3() and
nfsd4_encode_fattr() in the 'FSIDSOJURCE_UUID' case to check if
dentry->d_inode has a different btrfs volume id to
exp->ex_path.dentry->d_inode.
If it does, then mix the volume id into the fsid somehow.

With that, you wouldn't want the first change I suggested.

NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-21  5:13               ` NeilBrown
@ 2021-06-21  8:34                 ` Wang Yugui
  2021-06-22  1:28                   ` NeilBrown
  0 siblings, 1 reply; 94+ messages in thread
From: Wang Yugui @ 2021-06-21  8:34 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 1065 bytes --]

Hi,

> > > It seems more fixes are needed.
> > 
> > I think the problem is that the submount doesn't appear in /proc/mounts.
> > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> > filesystem to the mount point.  To do this it walks through /proc/mounts
> > checking the uuid of each filesystem.  If a filesystem isn't listed
> > there, it obviously fails.
> > 
> > I guess you could add code to nfs-utils to do whatever "btrfs subvol
> > list" does to make up for the fact that btrfs doesn't register in
> > /proc/mounts.
> 
> Another approach might be to just change svcxdr_encode_fattr3() and
> nfsd4_encode_fattr() in the 'FSIDSOJURCE_UUID' case to check if
> dentry->d_inode has a different btrfs volume id to
> exp->ex_path.dentry->d_inode.
> If it does, then mix the volume id into the fsid somehow.
> 
> With that, you wouldn't want the first change I suggested.

This is what I have done. and it is based on linux 5.10.44

but it still not work, so still more jobs needed.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/21


[-- Attachment #2: 0001-nfsd-btrfs-subvol-support.patch --]
[-- Type: application/octet-stream, Size: 4741 bytes --]

From 57e6b3cec9b8ac396b661c190511af80839ddbe5 Mon Sep 17 00:00:00 2001
From: wangyugui <wangyugui@e16-tech.com>
Date: Thu, 17 Jun 2021 08:33:06 +0800
Subject: [PATCH] nfsd: btrfs subvol support

(struct statfs).f_fsid: 	uniq between btrfs subvols
(struct stat).st_dev: 		uniq between btrfs subvols
(struct statx).stx_mnt_id:	NOT uniq between btrfs subvols, but yet not used in nfs/nfsd
	kernel samples/vfs/test-statx.c
		stx_rdev_major/stx_rdev_minor seems be truncated by something
		like old_encode_dev()/old_decode_dev()?

TODO: (struct nfs_fattr).fsid
TODO: FSIDSOURCE_FSID in nfs3xdr.c/nfsxdr.c
---
 fs/namei.c         |  2 ++
 fs/nfsd/nfs3xdr.c  |  2 +-
 fs/nfsd/nfs4xdr.c  | 10 ++++++++--
 fs/nfsd/vfs.c      |  6 ++++++
 include/linux/fs.h | 13 +++++++++++++
 5 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index fe132e3..6974a95 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1105,6 +1105,8 @@ int follow_up(struct path *path)
 	struct mount *parent;
 	struct dentry *mountpoint;
 
+	if(unlikely(d_is_btrfs_subvol(path->dentry)))
+		return 0;
 	read_seqlock_excl(&mount_lock);
 	parent = mnt->mnt_parent;
 	if (parent == mnt) {
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 716566d..45666b3 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -877,7 +877,7 @@ compose_entry_fh(struct nfsd3_readdirres *cd, struct svc_fh *fhp,
 		dchild = lookup_positive_unlocked(name, dparent, namlen);
 	if (IS_ERR(dchild))
 		return rv;
-	if (d_mountpoint(dchild))
+	if (d_mountpoint(dchild) || unlikely(d_is_btrfs_subvol(dchild)))
 		goto out;
 	if (dchild->d_inode->i_ino != ino)
 		goto out;
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 5f5169b..939d095 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2728,6 +2728,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 		.dentry	= dentry,
 	};
 	struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
+	bool is_btrfs_subvol= d_is_btrfs_subvol(dentry);
 
 	BUG_ON(bmval1 & NFSD_WRITEONLY_ATTRS_WORD1);
 	BUG_ON(!nfsd_attrs_supported(minorversion, bmval));
@@ -2744,7 +2745,8 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 	if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
 			FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
 	    (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
-		       FATTR4_WORD1_SPACE_TOTAL))) {
+		       FATTR4_WORD1_SPACE_TOTAL)) ||
+		unlikely(is_btrfs_subvol)) {
 		err = vfs_statfs(&path, &statfs);
 		if (err)
 			goto out_nfserr;
@@ -2885,7 +2887,11 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 			p = xdr_encode_hyper(p, NFS4_REFERRAL_FSID_MINOR);
 		} else switch(fsid_source(fhp)) {
 		case FSIDSOURCE_FSID:
-			p = xdr_encode_hyper(p, (u64)exp->ex_fsid);
+			if (unlikely(is_btrfs_subvol)){
+				*p++ = cpu_to_be32(statfs.f_fsid.val[0]);
+				*p++ = cpu_to_be32(statfs.f_fsid.val[1]);
+			} else
+				p = xdr_encode_hyper(p, (u64)exp->ex_fsid);
 			p = xdr_encode_hyper(p, (u64)0);
 			break;
 		case FSIDSOURCE_DEV:
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 1ecacee..ae34ffc 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -68,6 +68,10 @@ nfsd_cross_mnt(struct svc_rqst *rqstp, struct dentry **dpp,
 	err = follow_down(&path);
 	if (err < 0)
 		goto out;
+	if (unlikely(d_is_btrfs_subvol(dentry))){
+		path_put(&path);
+		goto out;
+	} else
 	if (path.mnt == exp->ex_path.mnt && path.dentry == dentry &&
 	    nfsd_mountpoint(dentry, exp) == 2) {
 		/* This is only a mountpoint in some other namespace */
@@ -160,6 +164,8 @@ int nfsd_mountpoint(struct dentry *dentry, struct svc_export *exp)
 		return 1;
 	if (nfsd4_is_junction(dentry))
 		return 1;
+	if (d_is_btrfs_subvol(dentry))
+		return 1;
 	if (d_mountpoint(dentry))
 		/*
 		 * Might only be a mountpoint in a different namespace,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8bde32c..b0d52e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3399,6 +3399,19 @@ static inline bool is_root_inode(struct inode *inode)
 	return inode == inode->i_sb->s_root->d_inode;
 }
 
+/*
+ * same logical as fs/btrfs is_subvolume_inode(struct inode *inode)
+ * #define BTRFS_FIRST_FREE_OBJECTID 256ULL
+ * #define BTRFS_SUPER_MAGIC       0x9123683E
+ */
+static inline bool d_is_btrfs_subvol(const struct dentry *dentry)
+{
+    bool ret = dentry->d_inode && unlikely(dentry->d_inode->i_ino == 256ULL) &&
+		dentry->d_sb && dentry->d_sb->s_magic == 0x9123683E;
+	//printk(KERN_INFO "d_is_btrfs_subvol(%s)=%d\n", dentry->d_name.name, ret);
+	return ret;
+}
+
 static inline bool dir_emit(struct dir_context *ctx,
 			    const char *name, int namelen,
 			    u64 ino, unsigned type)
-- 
2.30.2


[-- Attachment #3: 0002-trace-nfsd-btrfs-subvol-support.txt --]
[-- Type: application/octet-stream, Size: 3431 bytes --]

From 6e709554e4c7d5efd3b030fc08b6e6493879a025 Mon Sep 17 00:00:00 2001
From: wangyugui <wangyugui@e16-tech.com>
Date: Thu, 17 Jun 2021 08:33:06 +0800
Subject: [PATCH] trace nfsd: btrfs subvol support

[  268.994169] follow_down(xfs2)=0
[  268.997405] nfsd_cross_mnt(xfs2)=0
*[  269.000840] nfsd4_encode_dirent_fattr(/) ignore_crossmnt=0
	why /
*[  269.006477] nfs_d_automount(xfs2)
	why not happen when btrfs subvol
[  269.009892] follow_down(xfs2)=0
[  269.013055] nfsd_cross_mnt(xfs2)=0

[  483.037221] follow_down(sub1)=0
[  483.040428] nfsd_cross_mnt(sub1)=0
[  483.043855] nfsd4_encode_dirent_fattr(sub1) ignore_crossmnt=0
[  483.049635] nfsd4_encode_dirent_fattr(.snapshot) ignore_crossmnt=0
[  483.055847] nfsd4_encode_dirent_fattr(dir1) ignore_crossmnt=0
[  483.062669] follow_down(sub2)=0
[  483.066800] nfsd_cross_mnt(sub2)=0

btrfs subvols =>force crossmnt
	subvol nfs/umount: os shutdonw or manual nfs/umount?
		special status(BTRFS_LAST_FREE_OBJECTID,only return to nfs)?
		#define BTRFS_LAST_FREE_OBJECTID -256ULL
		(struct file )->(struct inode *f_inode)->(struct super_block *i_sb;)->(unsigned long s_magic)
btrfs->xfs	=>still need crossmnt
xfs->btrfs	=>still need crossmnt

NFSEXP_CROSSMOUNT
NFSD_JUNCTION_XATTR_NAME
AT_NO_AUTOMOUNT
NFS_ATTR_FATTR_MOUNTPOINT
S_AUTOMOUNT

---
 fs/nfs/dir.c       | 2 ++
 fs/nfs/namespace.c | 1 +
 fs/nfsd/nfs4xdr.c  | 3 +++
 fs/nfsd/vfs.c      | 1 +
 4 files changed, 7 insertions(+)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index c837675..975440d 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1799,6 +1799,8 @@ nfs4_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
 
 	if (!(flags & LOOKUP_OPEN) || (flags & LOOKUP_DIRECTORY))
 		goto full_reval;
+	if (dentry->d_inode && dentry->d_inode->i_ino == 256ULL && dentry->d_sb)
+		printk(KERN_INFO "nfs4_do_lookup_revalidate(%s)=%lx\n", dentry->d_name.name, dentry->d_sb->s_magic);
 	if (d_mountpoint(dentry))
 		goto full_reval;
 
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index 2bcbe38..f69715c 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -152,6 +152,7 @@ struct vfsmount *nfs_d_automount(struct path *path)
 	int timeout = READ_ONCE(nfs_mountpoint_expiry_timeout);
 	int ret;
 
+	printk(KERN_INFO "nfs_d_automount(%s)\n", path->dentry->d_name.name);
 	if (IS_ROOT(path->dentry))
 		return ERR_PTR(-ESTALE);
 
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 939d095..cb5b328 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -3316,6 +3316,8 @@ nfsd4_encode_dirent_fattr(struct xdr_stream *xdr, struct nfsd4_readdir *cd,
 	 */
 	if (nfsd_mountpoint(dentry, exp)) {
 		int err;
+		// if(d_is_btrfs_subvol(dentry))
+		//	cd->rd_bmval[1] |= FATTR4_WORD1_MOUNTED_ON_FILEID;
 
 		if (!(exp->ex_flags & NFSEXP_V4ROOT)
 				&& !attributes_need_mount(cd->rd_bmval)) {
@@ -3343,6 +3345,7 @@ nfsd4_encode_dirent_fattr(struct xdr_stream *xdr, struct nfsd4_readdir *cd,
 out_put:
 	dput(dentry);
 	exp_put(exp);
+	printk(KERN_INFO "nfsd4_encode_dirent_fattr(%s) ignore_crossmnt=%d\n", dentry->d_name.name, ignore_crossmnt);
 	return nfserr;
 }
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ae34ffc..c12a394 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -111,6 +111,7 @@ nfsd_cross_mnt(struct svc_rqst *rqstp, struct dentry **dpp,
 	path_put(&path);
 	exp_put(exp2);
 out:
+	printk(KERN_INFO "nfsd_cross_mnt(%s)=%d\n", dentry->d_name.name, err);
 	return err;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* RE: any idea about auto export multiple btrfs snapshots?
  2021-06-21  4:52             ` NeilBrown
  2021-06-21  5:13               ` NeilBrown
@ 2021-06-21 14:35               ` Frank Filz
  2021-06-21 14:55                 ` Wang Yugui
  1 sibling, 1 reply; 94+ messages in thread
From: Frank Filz @ 2021-06-21 14:35 UTC (permalink / raw)
  To: 'NeilBrown', 'Wang Yugui'; +Cc: linux-nfs

> I think the problem is that the submount doesn't appear in /proc/mounts.
> "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a filesystem to
> the mount point.  To do this it walks through /proc/mounts checking the uuid of
> each filesystem.  If a filesystem isn't listed there, it obviously fails.
> 
> I guess you could add code to nfs-utils to do whatever "btrfs subvol list" does to
> make up for the fact that btrfs doesn't register in /proc/mounts.
> 
> NeilBrown

I've been watching this with interest for the nfs-ganesha project. We recently were made aware that we weren't working with btrfs subvols, and I added code so that in addition to using getmntent (essentially /proc/mounts) to populate filesystems, we also scan for btrfs subvols and with that we are able to export subvols. My question is does a snapshot look any different than a subvol? If they show up in the subvol list then we shouldn't need to do anything more for nfs-ganesha, but if there's something else needed to discover them, then we may need additional code in nfs-ganesha. I have not yet had a chance to check out exporting a snapshot yet.

Thanks

Frank Filz


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-21 14:35               ` Frank Filz
@ 2021-06-21 14:55                 ` Wang Yugui
  2021-06-21 17:49                   ` Frank Filz
  0 siblings, 1 reply; 94+ messages in thread
From: Wang Yugui @ 2021-06-21 14:55 UTC (permalink / raw)
  To: Frank Filz; +Cc: 'NeilBrown', linux-nfs

Hi,

> > I think the problem is that the submount doesn't appear in /proc/mounts.
> > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a filesystem to
> > the mount point.  To do this it walks through /proc/mounts checking the uuid of
> > each filesystem.  If a filesystem isn't listed there, it obviously fails.
> > 
> > I guess you could add code to nfs-utils to do whatever "btrfs subvol list" does to
> > make up for the fact that btrfs doesn't register in /proc/mounts.
> > 
> > NeilBrown
> 
> I've been watching this with interest for the nfs-ganesha project. We recently were made aware that we weren't working with btrfs subvols, and I added code so that in addition to using getmntent (essentially /proc/mounts) to populate filesystems, we also scan for btrfs subvols and with that we are able to export subvols. My question is does a snapshot look any different than a subvol? If they show up in the subvol list then we shouldn't need to do anything more for nfs-ganesha, but if there's something else needed to discover them, then we may need additional code in nfs-ganesha. I have not yet had a chance to check out exporting a snapshot yet.

>  My question is does a snapshot look any different than a subvol?

No difference between btrfs subvol and snapshot in theory.

but the btrfs subvol number in product env is usually fixed,
and the btrfs snapshot number is usually dynamic.

For fixed-number btrfs subvol/snapshot, it is OK to put them in the same
hierarchy, and then export all in /etc/exports static, as a walk around.

For dynamic-number btrfs snapshot number, we needs a dynamic way to
export them in nfs.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/21



^ permalink raw reply	[flat|nested] 94+ messages in thread

* RE: any idea about auto export multiple btrfs snapshots?
  2021-06-21 14:55                 ` Wang Yugui
@ 2021-06-21 17:49                   ` Frank Filz
  2021-06-21 22:41                     ` Wang Yugui
  0 siblings, 1 reply; 94+ messages in thread
From: Frank Filz @ 2021-06-21 17:49 UTC (permalink / raw)
  To: 'Wang Yugui'; +Cc: 'NeilBrown', linux-nfs

> > > I think the problem is that the submount doesn't appear in
/proc/mounts.
> > > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> > > filesystem to the mount point.  To do this it walks through
> > > /proc/mounts checking the uuid of each filesystem.  If a filesystem
isn't listed
> there, it obviously fails.
> > >
> > > I guess you could add code to nfs-utils to do whatever "btrfs subvol
> > > list" does to make up for the fact that btrfs doesn't register in
/proc/mounts.
> > >
> > > NeilBrown
> >
> > I've been watching this with interest for the nfs-ganesha project. We
recently
> were made aware that we weren't working with btrfs subvols, and I added
code
> so that in addition to using getmntent (essentially /proc/mounts) to
populate
> filesystems, we also scan for btrfs subvols and with that we are able to
export
> subvols. My question is does a snapshot look any different than a subvol?
If they
> show up in the subvol list then we shouldn't need to do anything more for
nfs-
> ganesha, but if there's something else needed to discover them, then we
may
> need additional code in nfs-ganesha. I have not yet had a chance to check
out
> exporting a snapshot yet.
> 
> >  My question is does a snapshot look any different than a subvol?
> 
> No difference between btrfs subvol and snapshot in theory.
> 
> but the btrfs subvol number in product env is usually fixed, and the btrfs
> snapshot number is usually dynamic.
> 
> For fixed-number btrfs subvol/snapshot, it is OK to put them in the same
> hierarchy, and then export all in /etc/exports static, as a walk around.
> 
> For dynamic-number btrfs snapshot number, we needs a dynamic way to export
> them in nfs.

OK thanks for the information. I think they will just work in nfs-ganesha as
long as the snapshots or subvols are mounted within an nfs-ganesha export or
are exported explicitly. nfs-ganesha has the equivalent of knfsd's
nohide/crossmnt options and when nfs-ganesha detects crossing a filesystem
boundary will lookup the filesystem via getmntend and listing btrfs subvols
and then expose that filesystem (via the fsid attribute) to the clients
where at least the Linux nfs client will detect a filesystem boundary and
create a new mount entry for it.

Frank


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-21 17:49                   ` Frank Filz
@ 2021-06-21 22:41                     ` Wang Yugui
  2021-06-22 17:34                       ` Frank Filz
  0 siblings, 1 reply; 94+ messages in thread
From: Wang Yugui @ 2021-06-21 22:41 UTC (permalink / raw)
  To: Frank Filz; +Cc: 'NeilBrown', linux-nfs

Hi,

> 
> OK thanks for the information. I think they will just work in nfs-ganesha as
> long as the snapshots or subvols are mounted within an nfs-ganesha export or
> are exported explicitly. nfs-ganesha has the equivalent of knfsd's
> nohide/crossmnt options and when nfs-ganesha detects crossing a filesystem
> boundary will lookup the filesystem via getmntend and listing btrfs subvols
> and then expose that filesystem (via the fsid attribute) to the clients
> where at least the Linux nfs client will detect a filesystem boundary and
> create a new mount entry for it.


Not only exported explicitly, but also kept in the same hierarchy.

If we export 
/mnt/test		#the btrfs
/mnt/test/sub1	# the btrfs subvol 1
/mnt/test/sub2	# the btrfs subvol 2

we need to make sure we will not access '/mnt/test/sub1' through '/mnt/test'
from nfs client.

current safe export:
#/mnt/test		#the btrfs, not exported
/mnt/test/sub1	# the btrfs subvol 1
/mnt/test/sub2	# the btrfs subvol 2


Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/22



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-21  8:34                 ` Wang Yugui
@ 2021-06-22  1:28                   ` NeilBrown
  2021-06-22  3:22                     ` Wang Yugui
  0 siblings, 1 reply; 94+ messages in thread
From: NeilBrown @ 2021-06-22  1:28 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-nfs

On Mon, 21 Jun 2021, Wang Yugui wrote:
> Hi,
> 
> > > > It seems more fixes are needed.
> > > 
> > > I think the problem is that the submount doesn't appear in /proc/mounts.
> > > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> > > filesystem to the mount point.  To do this it walks through /proc/mounts
> > > checking the uuid of each filesystem.  If a filesystem isn't listed
> > > there, it obviously fails.
> > > 
> > > I guess you could add code to nfs-utils to do whatever "btrfs subvol
> > > list" does to make up for the fact that btrfs doesn't register in
> > > /proc/mounts.
> > 
> > Another approach might be to just change svcxdr_encode_fattr3() and
> > nfsd4_encode_fattr() in the 'FSIDSOJURCE_UUID' case to check if
> > dentry->d_inode has a different btrfs volume id to
> > exp->ex_path.dentry->d_inode.
> > If it does, then mix the volume id into the fsid somehow.
> > 
> > With that, you wouldn't want the first change I suggested.
> 
> This is what I have done. and it is based on linux 5.10.44
> 
> but it still not work, so still more jobs needed.
> 

The following is more what I had in mind.  It doesn't quite work and I
cannot work out why.

If you 'stat' a file inside the subvol, then 'find' will not complete.
If you don't, then it will.

Doing that 'stat' changes the st_dev number of the main filesystem,
which seems really weird.
I'm probably missing something obvious.  Maybe a more careful analysis
of what is changing when will help.

NeilBrown


diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index 9421dae22737..790a3357525d 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/namei.h>
 #include <linux/module.h>
+#include <linux/statfs.h>
 #include <linux/exportfs.h>
 #include <linux/sunrpc/svc_xprt.h>
 
@@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 	int err;
 	struct auth_domain *dom = NULL;
 	struct svc_export exp = {}, *expp;
+	struct kstatfs statfs;
 	int an_int;
 
 	if (mesg[mlen-1] != '\n')
@@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 	err = kern_path(buf, 0, &exp.ex_path);
 	if (err)
 		goto out1;
+	err = vfs_statfs(&exp.ex_path, &statfs);
+	if (err)
+		goto out3;
+	exp.ex_fsid64 = statfs.f_fsid;
 
 	exp.ex_client = dom;
 	exp.cd = cd;
@@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
 	new->ex_anon_uid = item->ex_anon_uid;
 	new->ex_anon_gid = item->ex_anon_gid;
 	new->ex_fsid = item->ex_fsid;
+	new->ex_fsid64 = item->ex_fsid64;
 	new->ex_devid_map = item->ex_devid_map;
 	item->ex_devid_map = NULL;
 	new->ex_uuid = item->ex_uuid;
diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index ee0e3aba4a6e..d3eb9a599918 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -68,6 +68,7 @@ struct svc_export {
 	kuid_t			ex_anon_uid;
 	kgid_t			ex_anon_gid;
 	int			ex_fsid;
+	__kernel_fsid_t		ex_fsid64;
 	unsigned char *		ex_uuid; /* 16 byte fsid */
 	struct nfsd4_fs_locations ex_fslocs;
 	uint32_t		ex_nflavors;
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7abeccb975b2..8144e6037eae 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2869,6 +2869,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 	if (err)
 		goto out_nfserr;
 	if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
+		       FATTR4_WORD0_FSID |
 			FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
 	    (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
 		       FATTR4_WORD1_SPACE_TOTAL))) {
@@ -3024,6 +3025,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 		case FSIDSOURCE_UUID:
 			p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
 								EX_UUID_LEN);
+			if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
+			    statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
+				/* looks like a btrfs subvol */
+				p[-2] ^= statfs.f_fsid.val[0];
+				p[-1] ^= statfs.f_fsid.val[1];
+			}
 			break;
 		}
 	}


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-22  1:28                   ` NeilBrown
@ 2021-06-22  3:22                     ` Wang Yugui
  2021-06-22  7:14                       ` Wang Yugui
  0 siblings, 1 reply; 94+ messages in thread
From: Wang Yugui @ 2021-06-22  3:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 4595 bytes --]

Hi,


> On Mon, 21 Jun 2021, Wang Yugui wrote:
> > Hi,
> > 
> > > > > It seems more fixes are needed.
> > > > 
> > > > I think the problem is that the submount doesn't appear in /proc/mounts.
> > > > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> > > > filesystem to the mount point.  To do this it walks through /proc/mounts
> > > > checking the uuid of each filesystem.  If a filesystem isn't listed
> > > > there, it obviously fails.
> > > > 
> > > > I guess you could add code to nfs-utils to do whatever "btrfs subvol
> > > > list" does to make up for the fact that btrfs doesn't register in
> > > > /proc/mounts.
> > > 
> > > Another approach might be to just change svcxdr_encode_fattr3() and
> > > nfsd4_encode_fattr() in the 'FSIDSOJURCE_UUID' case to check if
> > > dentry->d_inode has a different btrfs volume id to
> > > exp->ex_path.dentry->d_inode.
> > > If it does, then mix the volume id into the fsid somehow.
> > > 
> > > With that, you wouldn't want the first change I suggested.
> > 
> > This is what I have done. and it is based on linux 5.10.44
> > 
> > but it still not work, so still more jobs needed.
> > 
> 
> The following is more what I had in mind.  It doesn't quite work and I
> cannot work out why.
> 
> If you 'stat' a file inside the subvol, then 'find' will not complete.
> If you don't, then it will.
> 
> Doing that 'stat' changes the st_dev number of the main filesystem,
> which seems really weird.
> I'm probably missing something obvious.  Maybe a more careful analysis
> of what is changing when will help.

we compare the trace output between crossmnt and btrfs subvol with some
trace, we found out that we need to add the subvol support to
follow_down().

btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/22


> NeilBrown
> 
> 
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index 9421dae22737..790a3357525d 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -15,6 +15,7 @@
>  #include <linux/slab.h>
>  #include <linux/namei.h>
>  #include <linux/module.h>
> +#include <linux/statfs.h>
>  #include <linux/exportfs.h>
>  #include <linux/sunrpc/svc_xprt.h>
>  
> @@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
>  	int err;
>  	struct auth_domain *dom = NULL;
>  	struct svc_export exp = {}, *expp;
> +	struct kstatfs statfs;
>  	int an_int;
>  
>  	if (mesg[mlen-1] != '\n')
> @@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
>  	err = kern_path(buf, 0, &exp.ex_path);
>  	if (err)
>  		goto out1;
> +	err = vfs_statfs(&exp.ex_path, &statfs);
> +	if (err)
> +		goto out3;
> +	exp.ex_fsid64 = statfs.f_fsid;
>  
>  	exp.ex_client = dom;
>  	exp.cd = cd;
> @@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
>  	new->ex_anon_uid = item->ex_anon_uid;
>  	new->ex_anon_gid = item->ex_anon_gid;
>  	new->ex_fsid = item->ex_fsid;
> +	new->ex_fsid64 = item->ex_fsid64;
>  	new->ex_devid_map = item->ex_devid_map;
>  	item->ex_devid_map = NULL;
>  	new->ex_uuid = item->ex_uuid;
> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
> index ee0e3aba4a6e..d3eb9a599918 100644
> --- a/fs/nfsd/export.h
> +++ b/fs/nfsd/export.h
> @@ -68,6 +68,7 @@ struct svc_export {
>  	kuid_t			ex_anon_uid;
>  	kgid_t			ex_anon_gid;
>  	int			ex_fsid;
> +	__kernel_fsid_t		ex_fsid64;
>  	unsigned char *		ex_uuid; /* 16 byte fsid */
>  	struct nfsd4_fs_locations ex_fslocs;
>  	uint32_t		ex_nflavors;
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 7abeccb975b2..8144e6037eae 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2869,6 +2869,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  	if (err)
>  		goto out_nfserr;
>  	if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
> +		       FATTR4_WORD0_FSID |
>  			FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
>  	    (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
>  		       FATTR4_WORD1_SPACE_TOTAL))) {
> @@ -3024,6 +3025,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  		case FSIDSOURCE_UUID:
>  			p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
>  								EX_UUID_LEN);
> +			if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> +			    statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
> +				/* looks like a btrfs subvol */
> +				p[-2] ^= statfs.f_fsid.val[0];
> +				p[-1] ^= statfs.f_fsid.val[1];
> +			}
>  			break;
>  		}
>  	}


[-- Attachment #2: 0001-nfsd-btrfs-subvol-support.txt --]
[-- Type: application/octet-stream, Size: 5521 bytes --]

From 57e6b3cec9b8ac396b661c190511af80839ddbe5 Mon Sep 17 00:00:00 2001
From: wangyugui <wangyugui@e16-tech.com>
Date: Thu, 17 Jun 2021 08:33:06 +0800
Subject: [PATCH] nfsd: btrfs subvol support

(struct statfs).f_fsid: 	uniq between btrfs subvols
(struct stat).st_dev: 		uniq between btrfs subvols
(struct statx).stx_mnt_id:	NOT uniq between btrfs subvols, but yet not used in nfs/nfsd
	kernel samples/vfs/test-statx.c
		stx_rdev_major/stx_rdev_minor seems be truncated by something
		like old_encode_dev()/old_decode_dev()?

TODO: (struct nfs_fattr).fsid
TODO: FSIDSOURCE_FSID in nfs3xdr.c/nfsxdr.c
---
 fs/nfsd/nfs3xdr.c |  2 +-
 fs/nfsd/nfs4xdr.c | 16 ++++++++++++----
 fs/nfsd/nfsd.h    | 27 +++++++++++++++++++++++++++
 fs/nfsd/vfs.c     | 10 ++++++++--
 4 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 716566d..0de2953 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -877,7 +877,7 @@ compose_entry_fh(struct nfsd3_readdirres *cd, struct svc_fh *fhp,
 		dchild = lookup_positive_unlocked(name, dparent, namlen);
 	if (IS_ERR(dchild))
 		return rv;
-	if (d_mountpoint(dchild))
+	if (d_mountpoint(dchild) || unlikely(d_is_btrfs_subvol(dchild)))
 		goto out;
 	if (dchild->d_inode->i_ino != ino)
 		goto out;
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 5f5169b..ee335fc 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2457,7 +2457,7 @@ static __be32 nfsd4_encode_path(struct xdr_stream *xdr,
 		if (path_equal(&cur, root))
 			break;
 		if (cur.dentry == cur.mnt->mnt_root) {
-			if (follow_up(&cur))
+			if (nfsd_follow_up(&cur))
 				continue;
 			goto out_free;
 		}
@@ -2648,7 +2648,7 @@ static int get_parent_attributes(struct svc_export *exp, struct kstat *stat)
 	int err;
 
 	path_get(&path);
-	while (follow_up(&path)) {
+	while (nfsd_follow_up(&path)) {
 		if (path.dentry != path.mnt->mnt_root)
 			break;
 	}
@@ -2728,6 +2728,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 		.dentry	= dentry,
 	};
 	struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
+	bool is_btrfs_subvol= d_is_btrfs_subvol(dentry);
 
 	BUG_ON(bmval1 & NFSD_WRITEONLY_ATTRS_WORD1);
 	BUG_ON(!nfsd_attrs_supported(minorversion, bmval));
@@ -2744,7 +2745,8 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 	if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
 			FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
 	    (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
-		       FATTR4_WORD1_SPACE_TOTAL))) {
+		       FATTR4_WORD1_SPACE_TOTAL)) ||
+		unlikely(is_btrfs_subvol)) {
 		err = vfs_statfs(&path, &statfs);
 		if (err)
 			goto out_nfserr;
@@ -2895,7 +2897,13 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 			*p++ = cpu_to_be32(MINOR(stat.dev));
 			break;
 		case FSIDSOURCE_UUID:
-			p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
+			if (unlikely(is_btrfs_subvol)){
+				*p++ = cpu_to_be32(statfs.f_fsid.val[0]);
+				*p++ = cpu_to_be32(statfs.f_fsid.val[1]);
+				*p++ = cpu_to_be32(0);
+				*p++ = cpu_to_be32(0);
+			} else
+				p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
 								EX_UUID_LEN);
 			break;
 		}
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index cb742e1..27baabb 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -487,4 +487,31 @@ static inline int nfsd4_is_junction(struct dentry *dentry)
 
 #endif /* CONFIG_NFSD_V4 */
 
+/* btrfs subvol support */
+/*
+ * same logical as fs/btrfs is_subvolume_inode(struct inode *inode)
+ * #define BTRFS_FIRST_FREE_OBJECTID 256ULL
+ * #define BTRFS_SUPER_MAGIC       0x9123683E
+ */
+static inline bool d_is_btrfs_subvol(const struct dentry *dentry)
+{
+    bool ret = dentry->d_inode && unlikely(dentry->d_inode->i_ino == 256ULL) &&
+		dentry->d_sb && dentry->d_sb->s_magic == BTRFS_SUPER_MAGIC;
+	//printk(KERN_INFO "d_is_btrfs_subvol(%s)=%d\n", dentry->d_name.name, ret);
+	return ret;
+}
+#include <linux/namei.h>
+/* add btrfs subvol support that only used in nfsd */
+static inline int nfsd_follow_down(struct path *path)
+{
+	return follow_down(path);
+}
+/* add btrfs subvol support that only used in nfsd */
+static inline int nfsd_follow_up(struct path *path)
+{
+	if(unlikely(d_is_btrfs_subvol(path->dentry)))
+		return 0;
+	return follow_up(path);
+}
+
 #endif /* LINUX_NFSD_NFSD_H */
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 1ecacee..3ab9b7f 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -65,9 +65,13 @@ nfsd_cross_mnt(struct svc_rqst *rqstp, struct dentry **dpp,
 			    .dentry = dget(dentry)};
 	int err = 0;
 
-	err = follow_down(&path);
+	err = nfsd_follow_down(&path);
 	if (err < 0)
 		goto out;
+	if (unlikely(d_is_btrfs_subvol(dentry))){
+		path_put(&path);
+		goto out;
+	} else
 	if (path.mnt == exp->ex_path.mnt && path.dentry == dentry &&
 	    nfsd_mountpoint(dentry, exp) == 2) {
 		/* This is only a mountpoint in some other namespace */
@@ -114,7 +118,7 @@ static void follow_to_parent(struct path *path)
 {
 	struct dentry *dp;
 
-	while (path->dentry == path->mnt->mnt_root && follow_up(path))
+	while (path->dentry == path->mnt->mnt_root && nfsd_follow_up(path))
 		;
 	dp = dget_parent(path->dentry);
 	dput(path->dentry);
@@ -160,6 +164,8 @@ int nfsd_mountpoint(struct dentry *dentry, struct svc_export *exp)
 		return 1;
 	if (nfsd4_is_junction(dentry))
 		return 1;
+	if (d_is_btrfs_subvol(dentry))
+		return 1;
 	if (d_mountpoint(dentry))
 		/*
 		 * Might only be a mountpoint in a different namespace,
-- 
2.30.2


[-- Attachment #3: 0002-trace-nfsd-btrfs-subvol-support.txt --]
[-- Type: application/octet-stream, Size: 5897 bytes --]

From 639489a60b84f9d16955143f52fc6316205ac57a Mon Sep 17 00:00:00 2001
From: wangyugui <wangyugui@e16-tech.com>
Date: Thu, 17 Jun 2021 08:33:06 +0800
Subject: [PATCH] trace nfsd: btrfs subvol support

[  235.831136] set_version_and_fsid_type fsid_type=7
[  235.842483] nfsd_cross_mnt(test)=0
[  235.845882] nfsd: nfsd_lookup(fh 28: 00070001 00440001 00000000 73fb4b0a 31596b2e 7be9789b, test)=/
[  235.854902] set_version_and_fsid_type fsid_type=6
[  235.859686] nfs_d_automount(test)
[  235.863069] nfsd_cross_mnt(test)=0
[  235.866478] nfsd: nfsd_lookup(fh 28: 00070001 00440001 00000000 73fb4b0a 31596b2e 7be9789b, test)=/
[  235.875500] set_version_and_fsid_type fsid_type=6
[  239.204677] lookup_positive_unlocked(name=xfs2) dentry=xfs2
[  239.210311] nfsd_cross_mnt(xfs2)=0
[  239.213708] set_version_and_fsid_type fsid_type=6
[  239.218406] nfsd4_encode_dirent_fattr(/) FATTR4_WORD0_FSID=1  FATTR4_WORD1_MOUNTED_ON_FILEID=1
	why /?
[  239.227078] nfs_d_automount(xfs2)
	why?
[  239.230437] nfsd_cross_mnt(xfs2)=0
[  239.233838] nfsd: nfsd_lookup(fh 20: 00060001 2b031f7d c249fdd0 1aa84b8e 045d774a 00000000, xfs2)=/
[  239.242854] set_version_and_fsid_type fsid_type=6

[  373.332124] set_version_and_fsid_type fsid_type=7
[  373.337639] nfsd_cross_mnt(test)=0
[  373.341035] nfsd: nfsd_lookup(fh 28: 00070001 00440001 00000000 73fb4b0a 31596b2e 7be9789b, test)=/
[  373.350047] set_version_and_fsid_type fsid_type=6
[  373.354781] nfs_d_automount(test)
[  373.358125] nfsd_cross_mnt(test)=0
[  373.361524] nfsd: nfsd_lookup(fh 28: 00070001 00440001 00000000 73fb4b0a 31596b2e 7be9789b, test)=/
[  373.370537] set_version_and_fsid_type fsid_type=6
[  377.521908] lookup_positive_unlocked(name=sub1) dentry=sub1
[  377.527477] nfsd_cross_mnt(sub1)=0
[  377.530879] set_version_and_fsid_type fsid_type=6
[  377.535572] nfsd4_encode_dirent_fattr(sub1) FATTR4_WORD0_FSID=1  FATTR4_WORD1_MOUNTED_ON_FILEID=1
[  377.544420] lookup_positive_unlocked(name=.snapshot) dentry=.snapshot

btrfs subvols =>force crossmnt
	subvol nfs/umount: os shutdonw or manual nfs/umount?
		special status(BTRFS_LAST_FREE_OBJECTID,only return to nfs)?
		#define BTRFS_LAST_FREE_OBJECTID -256ULL
		(struct file )->(struct inode *f_inode)->(struct super_block *i_sb;)->(unsigned long s_magic)
btrfs->xfs	=>still need crossmnt
xfs->btrfs	=>still need crossmnt

NFSEXP_CROSSMOUNT
NFSD_JUNCTION_XATTR_NAME
AT_NO_AUTOMOUNT
NFS_ATTR_FATTR_MOUNTPOINT
S_AUTOMOUNT
---
 fs/nfs/dir.c       | 2 ++
 fs/nfs/namespace.c | 1 +
 fs/nfsd/nfs4xdr.c  | 5 +++++
 fs/nfsd/nfsfh.c    | 1 +
 fs/nfsd/vfs.c      | 5 +++++
 5 files changed, 14 insertions(+)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index c837675..975440d 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1799,6 +1799,8 @@ nfs4_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
 
 	if (!(flags & LOOKUP_OPEN) || (flags & LOOKUP_DIRECTORY))
 		goto full_reval;
+	if (dentry->d_inode && dentry->d_inode->i_ino == 256ULL && dentry->d_sb)
+		printk(KERN_INFO "nfs4_do_lookup_revalidate(%s)=%lx\n", dentry->d_name.name, dentry->d_sb->s_magic);
 	if (d_mountpoint(dentry))
 		goto full_reval;
 
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index 2bcbe38..f69715c 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -152,6 +152,7 @@ struct vfsmount *nfs_d_automount(struct path *path)
 	int timeout = READ_ONCE(nfs_mountpoint_expiry_timeout);
 	int ret;
 
+	printk(KERN_INFO "nfs_d_automount(%s)\n", path->dentry->d_name.name);
 	if (IS_ROOT(path->dentry))
 		return ERR_PTR(-ESTALE);
 
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 6255b06..257ee17 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -3307,6 +3307,7 @@ nfsd4_encode_dirent_fattr(struct xdr_stream *xdr, struct nfsd4_readdir *cd,
 	dentry = lookup_positive_unlocked(name, cd->rd_fhp->fh_dentry, namlen);
 	if (IS_ERR(dentry))
 		return nfserrno(PTR_ERR(dentry));
+	printk(KERN_INFO "lookup_positive_unlocked(name=%s) dentry=%s\n", name, dentry->d_name.name);
 
 	exp_get(exp);
 	/*
@@ -3345,6 +3346,10 @@ nfsd4_encode_dirent_fattr(struct xdr_stream *xdr, struct nfsd4_readdir *cd,
 out_put:
 	dput(dentry);
 	exp_put(exp);
+	printk(KERN_INFO "nfsd4_encode_dirent_fattr(%s) FATTR4_WORD0_FSID=%d  FATTR4_WORD1_MOUNTED_ON_FILEID=%d\n",
+		 dentry->d_name.name,
+		 !!(cd->rd_bmval[0]&FATTR4_WORD0_FSID),
+		!!(cd->rd_bmval[1]&FATTR4_WORD1_MOUNTED_ON_FILEID));
 	return nfserr;
 }
 
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index c81dbba..28eaea3 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -530,6 +530,7 @@ static void set_version_and_fsid_type(struct svc_fh *fhp, struct svc_export *exp
 	fhp->fh_handle.fh_version = version;
 	if (version)
 		fhp->fh_handle.fh_fsid_type = fsid_type;
+	printk(KERN_INFO "set_version_and_fsid_type fsid_type=%d\n", fsid_type);
 }
 
 __be32
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ae34ffc..6c55010 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -66,6 +66,8 @@ nfsd_cross_mnt(struct svc_rqst *rqstp, struct dentry **dpp,
 	int err = 0;
 
 	err = nfsd_follow_down(&path);
+	printk(KERN_INFO "follow_down()=%d path.mnt=%s path.dentry=%s\n", err,
+		path.mnt->mnt_root->d_name.name, path.dentry->d_name.name);
 	if (err < 0)
 		goto out;
 	if (unlikely(d_is_btrfs_subvol(dentry))){
@@ -111,6 +113,7 @@ nfsd_cross_mnt(struct svc_rqst *rqstp, struct dentry **dpp,
 	path_put(&path);
 	exp_put(exp2);
 out:
+	printk(KERN_INFO "nfsd_cross_mnt(%s)=%d\n", dentry->d_name.name, err);
 	return err;
 }
 
@@ -233,9 +236,11 @@ nfsd_lookup_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	}
 	*dentry_ret = dentry;
 	*exp_ret = exp;
+	// printk(KERN_INFO "nfsd: nfsd_lookup(fh %s, %.*s)=%s\n", SVCFH_fmt(fhp), len, name, dentry->d_name.name);
 	return 0;
 
 out_nfserr:
+	// printk(KERN_INFO "nfsd: nfsd_lookup(fh %s, %.*s) error\n", SVCFH_fmt(fhp), len, name);
 	exp_put(exp);
 	return nfserrno(host_err);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-22  3:22                     ` Wang Yugui
@ 2021-06-22  7:14                       ` Wang Yugui
  2021-06-23  0:59                         ` NeilBrown
  0 siblings, 1 reply; 94+ messages in thread
From: Wang Yugui @ 2021-06-22  7:14 UTC (permalink / raw)
  To: NeilBrown, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 2210 bytes --]

Hi,

> > > 
> > > > > > It seems more fixes are needed.
> > > > > 
> > > > > I think the problem is that the submount doesn't appear in /proc/mounts.
> > > > > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> > > > > filesystem to the mount point.  To do this it walks through /proc/mounts
> > > > > checking the uuid of each filesystem.  If a filesystem isn't listed
> > > > > there, it obviously fails.
> > > > > 
> > > > > I guess you could add code to nfs-utils to do whatever "btrfs subvol
> > > > > list" does to make up for the fact that btrfs doesn't register in
> > > > > /proc/mounts.
> > > > 
> > > > Another approach might be to just change svcxdr_encode_fattr3() and
> > > > nfsd4_encode_fattr() in the 'FSIDSOJURCE_UUID' case to check if
> > > > dentry->d_inode has a different btrfs volume id to
> > > > exp->ex_path.dentry->d_inode.
> > > > If it does, then mix the volume id into the fsid somehow.
> > > > 
> > > > With that, you wouldn't want the first change I suggested.
> > > 
> > > This is what I have done. and it is based on linux 5.10.44
> > > 
> > > but it still not work, so still more jobs needed.
> > > 
> > 
> > The following is more what I had in mind.  It doesn't quite work and I
> > cannot work out why.
> > 
> > If you 'stat' a file inside the subvol, then 'find' will not complete.
> > If you don't, then it will.
> > 
> > Doing that 'stat' changes the st_dev number of the main filesystem,
> > which seems really weird.
> > I'm probably missing something obvious.  Maybe a more careful analysis
> > of what is changing when will help.
> 
> we compare the trace output between crossmnt and btrfs subvol with some
> trace, we found out that we need to add the subvol support to
> follow_down().
> 
> btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().

btrfs subvol crossmnt begin to work, although buggy.

some subvol is crossmnt-ed, some subvol is yet not, and some dir is
wrongly crossmnt-ed

'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin
to happen.

This is the current patch based on 5.10.44. 
At least nfsd_follow_up() is buggy.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/22


[-- Attachment #2: 0001-nfsd-btrfs-subvol-support.txt --]
[-- Type: application/octet-stream, Size: 6044 bytes --]

From 57e6b3cec9b8ac396b661c190511af80839ddbe5 Mon Sep 17 00:00:00 2001
From: wangyugui <wangyugui@e16-tech.com>
Date: Thu, 17 Jun 2021 08:33:06 +0800
Subject: [PATCH] nfsd: btrfs subvol support

(struct statfs).f_fsid: 	uniq between btrfs subvols
(struct stat).st_dev: 		uniq between btrfs subvols
(struct statx).stx_mnt_id:	NOT uniq between btrfs subvols, but yet not used in nfs/nfsd
	kernel samples/vfs/test-statx.c
		stx_rdev_major/stx_rdev_minor seems be truncated by something
		like old_encode_dev()/old_decode_dev()?

TODO: (struct nfs_fattr).fsid
TODO: FSIDSOURCE_FSID in nfs3xdr.c/nfsxdr.c
---
 fs/nfsd/nfs3xdr.c |  2 +-
 fs/nfsd/nfs4xdr.c | 16 ++++++++++++----
 fs/nfsd/nfsd.h    | 42 ++++++++++++++++++++++++++++++++++++++++++
 fs/nfsd/vfs.c     | 10 ++++++++--
 4 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 716566d..0de2953 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -877,7 +877,7 @@ compose_entry_fh(struct nfsd3_readdirres *cd, struct svc_fh *fhp,
 		dchild = lookup_positive_unlocked(name, dparent, namlen);
 	if (IS_ERR(dchild))
 		return rv;
-	if (d_mountpoint(dchild))
+	if (d_mountpoint(dchild) || unlikely(d_is_btrfs_subvol(dchild)))
 		goto out;
 	if (dchild->d_inode->i_ino != ino)
 		goto out;
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 5f5169b..ee335fc 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2457,7 +2457,7 @@ static __be32 nfsd4_encode_path(struct xdr_stream *xdr,
 		if (path_equal(&cur, root))
 			break;
 		if (cur.dentry == cur.mnt->mnt_root) {
-			if (follow_up(&cur))
+			if (nfsd_follow_up(&cur))
 				continue;
 			goto out_free;
 		}
@@ -2648,7 +2648,7 @@ static int get_parent_attributes(struct svc_export *exp, struct kstat *stat)
 	int err;
 
 	path_get(&path);
-	while (follow_up(&path)) {
+	while (nfsd_follow_up(&path)) {
 		if (path.dentry != path.mnt->mnt_root)
 			break;
 	}
@@ -2728,6 +2728,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 		.dentry	= dentry,
 	};
 	struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id);
+	bool is_btrfs_subvol= d_is_btrfs_subvol(dentry);
 
 	BUG_ON(bmval1 & NFSD_WRITEONLY_ATTRS_WORD1);
 	BUG_ON(!nfsd_attrs_supported(minorversion, bmval));
@@ -2744,7 +2745,8 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 	if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
 			FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
 	    (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
-		       FATTR4_WORD1_SPACE_TOTAL))) {
+		       FATTR4_WORD1_SPACE_TOTAL)) ||
+		unlikely(is_btrfs_subvol)) {
 		err = vfs_statfs(&path, &statfs);
 		if (err)
 			goto out_nfserr;
@@ -2895,7 +2897,13 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 			*p++ = cpu_to_be32(MINOR(stat.dev));
 			break;
 		case FSIDSOURCE_UUID:
-			p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
+			if (unlikely(is_btrfs_subvol)){
+				*p++ = cpu_to_be32(statfs.f_fsid.val[0]);
+				*p++ = cpu_to_be32(statfs.f_fsid.val[1]);
+				*p++ = cpu_to_be32(0);
+				*p++ = cpu_to_be32(0);
+			} else
+				p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
 								EX_UUID_LEN);
 			break;
 		}
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index cb742e1..42e14d6 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -487,4 +487,47 @@ static inline int nfsd4_is_junction(struct dentry *dentry)
 
 #endif /* CONFIG_NFSD_V4 */
 
+/* btrfs subvol support */
+/*
+ * same logical as fs/btrfs is_subvolume_inode(struct inode *inode)
+ * #define BTRFS_FIRST_FREE_OBJECTID 256ULL
+ * #define BTRFS_SUPER_MAGIC       0x9123683E
+ */
+static inline bool d_is_btrfs_subvol(const struct dentry *dentry)
+{
+    bool ret = dentry->d_inode && unlikely(dentry->d_inode->i_ino == 256ULL) &&
+		dentry->d_sb && dentry->d_sb->s_magic == BTRFS_SUPER_MAGIC;
+	//printk(KERN_INFO "d_is_btrfs_subvol(%s)=%d\n", dentry->d_name.name, ret);
+	return ret;
+}
+#include <linux/namei.h>
+/* add btrfs subvol support that only used in nfsd */
+/* FIXME: free clone_private_mount()? */
+static inline int nfsd_follow_down(struct path *path)
+{
+	if(unlikely(d_is_btrfs_subvol(path->dentry))){
+		//struct dentry *mnt_root=path->dentry;
+		struct vfsmount *mounted = clone_private_mount(path);
+		if (mounted) {
+			//mounted->mnt_root=mnt_root;
+			//? dput(path->dentry);
+			//? mntput(path->mnt);
+			path->mnt = mounted;
+			path->dentry = dget(mounted->mnt_root);
+			return 0;
+		}
+	}
+	return follow_down(path);
+}
+/* add btrfs subvol support that only used in nfsd */
+/* FIXME: free clone_private_mount()? */
+static inline int nfsd_follow_up(struct path *path)
+{
+	printk(KERN_INFO "nfsd_follow_up(%s)\n", path->dentry->d_name.name);
+	if(unlikely(d_is_btrfs_subvol(path->dentry))){
+		return 0;
+	}
+	return follow_up(path);
+}
+
 #endif /* LINUX_NFSD_NFSD_H */
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 1ecacee..3ab9b7f 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -65,9 +65,13 @@ nfsd_cross_mnt(struct svc_rqst *rqstp, struct dentry **dpp,
 			    .dentry = dget(dentry)};
 	int err = 0;
 
-	err = follow_down(&path);
+	err = nfsd_follow_down(&path);
 	if (err < 0)
 		goto out;
+	if (unlikely(d_is_btrfs_subvol(dentry))){
+		path_put(&path);
+		goto out;
+	} else
 	if (path.mnt == exp->ex_path.mnt && path.dentry == dentry &&
 	    nfsd_mountpoint(dentry, exp) == 2) {
 		/* This is only a mountpoint in some other namespace */
@@ -114,7 +118,7 @@ static void follow_to_parent(struct path *path)
 {
 	struct dentry *dp;
 
-	while (path->dentry == path->mnt->mnt_root && follow_up(path))
+	while (path->dentry == path->mnt->mnt_root && nfsd_follow_up(path))
 		;
 	dp = dget_parent(path->dentry);
 	dput(path->dentry);
@@ -160,6 +164,8 @@ int nfsd_mountpoint(struct dentry *dentry, struct svc_export *exp)
 		return 1;
 	if (nfsd4_is_junction(dentry))
 		return 1;
+	if (d_is_btrfs_subvol(dentry))
+		return 1;
 	if (d_mountpoint(dentry))
 		/*
 		 * Might only be a mountpoint in a different namespace,
-- 
2.30.2


[-- Attachment #3: 0002-trace-nfsd-btrfs-subvol-support.txt --]
[-- Type: application/octet-stream, Size: 5897 bytes --]

From 639489a60b84f9d16955143f52fc6316205ac57a Mon Sep 17 00:00:00 2001
From: wangyugui <wangyugui@e16-tech.com>
Date: Thu, 17 Jun 2021 08:33:06 +0800
Subject: [PATCH] trace nfsd: btrfs subvol support

[  235.831136] set_version_and_fsid_type fsid_type=7
[  235.842483] nfsd_cross_mnt(test)=0
[  235.845882] nfsd: nfsd_lookup(fh 28: 00070001 00440001 00000000 73fb4b0a 31596b2e 7be9789b, test)=/
[  235.854902] set_version_and_fsid_type fsid_type=6
[  235.859686] nfs_d_automount(test)
[  235.863069] nfsd_cross_mnt(test)=0
[  235.866478] nfsd: nfsd_lookup(fh 28: 00070001 00440001 00000000 73fb4b0a 31596b2e 7be9789b, test)=/
[  235.875500] set_version_and_fsid_type fsid_type=6
[  239.204677] lookup_positive_unlocked(name=xfs2) dentry=xfs2
[  239.210311] nfsd_cross_mnt(xfs2)=0
[  239.213708] set_version_and_fsid_type fsid_type=6
[  239.218406] nfsd4_encode_dirent_fattr(/) FATTR4_WORD0_FSID=1  FATTR4_WORD1_MOUNTED_ON_FILEID=1
	why /?
[  239.227078] nfs_d_automount(xfs2)
	why?
[  239.230437] nfsd_cross_mnt(xfs2)=0
[  239.233838] nfsd: nfsd_lookup(fh 20: 00060001 2b031f7d c249fdd0 1aa84b8e 045d774a 00000000, xfs2)=/
[  239.242854] set_version_and_fsid_type fsid_type=6

[  373.332124] set_version_and_fsid_type fsid_type=7
[  373.337639] nfsd_cross_mnt(test)=0
[  373.341035] nfsd: nfsd_lookup(fh 28: 00070001 00440001 00000000 73fb4b0a 31596b2e 7be9789b, test)=/
[  373.350047] set_version_and_fsid_type fsid_type=6
[  373.354781] nfs_d_automount(test)
[  373.358125] nfsd_cross_mnt(test)=0
[  373.361524] nfsd: nfsd_lookup(fh 28: 00070001 00440001 00000000 73fb4b0a 31596b2e 7be9789b, test)=/
[  373.370537] set_version_and_fsid_type fsid_type=6
[  377.521908] lookup_positive_unlocked(name=sub1) dentry=sub1
[  377.527477] nfsd_cross_mnt(sub1)=0
[  377.530879] set_version_and_fsid_type fsid_type=6
[  377.535572] nfsd4_encode_dirent_fattr(sub1) FATTR4_WORD0_FSID=1  FATTR4_WORD1_MOUNTED_ON_FILEID=1
[  377.544420] lookup_positive_unlocked(name=.snapshot) dentry=.snapshot

btrfs subvols =>force crossmnt
	subvol nfs/umount: os shutdonw or manual nfs/umount?
		special status(BTRFS_LAST_FREE_OBJECTID,only return to nfs)?
		#define BTRFS_LAST_FREE_OBJECTID -256ULL
		(struct file )->(struct inode *f_inode)->(struct super_block *i_sb;)->(unsigned long s_magic)
btrfs->xfs	=>still need crossmnt
xfs->btrfs	=>still need crossmnt

NFSEXP_CROSSMOUNT
NFSD_JUNCTION_XATTR_NAME
AT_NO_AUTOMOUNT
NFS_ATTR_FATTR_MOUNTPOINT
S_AUTOMOUNT
---
 fs/nfs/dir.c       | 2 ++
 fs/nfs/namespace.c | 1 +
 fs/nfsd/nfs4xdr.c  | 5 +++++
 fs/nfsd/nfsfh.c    | 1 +
 fs/nfsd/vfs.c      | 5 +++++
 5 files changed, 14 insertions(+)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index c837675..975440d 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1799,6 +1799,8 @@ nfs4_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
 
 	if (!(flags & LOOKUP_OPEN) || (flags & LOOKUP_DIRECTORY))
 		goto full_reval;
+	if (dentry->d_inode && dentry->d_inode->i_ino == 256ULL && dentry->d_sb)
+		printk(KERN_INFO "nfs4_do_lookup_revalidate(%s)=%lx\n", dentry->d_name.name, dentry->d_sb->s_magic);
 	if (d_mountpoint(dentry))
 		goto full_reval;
 
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index 2bcbe38..f69715c 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -152,6 +152,7 @@ struct vfsmount *nfs_d_automount(struct path *path)
 	int timeout = READ_ONCE(nfs_mountpoint_expiry_timeout);
 	int ret;
 
+	printk(KERN_INFO "nfs_d_automount(%s)\n", path->dentry->d_name.name);
 	if (IS_ROOT(path->dentry))
 		return ERR_PTR(-ESTALE);
 
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 6255b06..257ee17 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -3307,6 +3307,7 @@ nfsd4_encode_dirent_fattr(struct xdr_stream *xdr, struct nfsd4_readdir *cd,
 	dentry = lookup_positive_unlocked(name, cd->rd_fhp->fh_dentry, namlen);
 	if (IS_ERR(dentry))
 		return nfserrno(PTR_ERR(dentry));
+	printk(KERN_INFO "lookup_positive_unlocked(name=%s) dentry=%s\n", name, dentry->d_name.name);
 
 	exp_get(exp);
 	/*
@@ -3345,6 +3346,10 @@ nfsd4_encode_dirent_fattr(struct xdr_stream *xdr, struct nfsd4_readdir *cd,
 out_put:
 	dput(dentry);
 	exp_put(exp);
+	printk(KERN_INFO "nfsd4_encode_dirent_fattr(%s) FATTR4_WORD0_FSID=%d  FATTR4_WORD1_MOUNTED_ON_FILEID=%d\n",
+		 dentry->d_name.name,
+		 !!(cd->rd_bmval[0]&FATTR4_WORD0_FSID),
+		!!(cd->rd_bmval[1]&FATTR4_WORD1_MOUNTED_ON_FILEID));
 	return nfserr;
 }
 
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index c81dbba..28eaea3 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -530,6 +530,7 @@ static void set_version_and_fsid_type(struct svc_fh *fhp, struct svc_export *exp
 	fhp->fh_handle.fh_version = version;
 	if (version)
 		fhp->fh_handle.fh_fsid_type = fsid_type;
+	printk(KERN_INFO "set_version_and_fsid_type fsid_type=%d\n", fsid_type);
 }
 
 __be32
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ae34ffc..6c55010 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -66,6 +66,8 @@ nfsd_cross_mnt(struct svc_rqst *rqstp, struct dentry **dpp,
 	int err = 0;
 
 	err = nfsd_follow_down(&path);
+	printk(KERN_INFO "follow_down()=%d path.mnt=%s path.dentry=%s\n", err,
+		path.mnt->mnt_root->d_name.name, path.dentry->d_name.name);
 	if (err < 0)
 		goto out;
 	if (unlikely(d_is_btrfs_subvol(dentry))){
@@ -111,6 +113,7 @@ nfsd_cross_mnt(struct svc_rqst *rqstp, struct dentry **dpp,
 	path_put(&path);
 	exp_put(exp2);
 out:
+	printk(KERN_INFO "nfsd_cross_mnt(%s)=%d\n", dentry->d_name.name, err);
 	return err;
 }
 
@@ -233,9 +236,11 @@ nfsd_lookup_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	}
 	*dentry_ret = dentry;
 	*exp_ret = exp;
+	// printk(KERN_INFO "nfsd: nfsd_lookup(fh %s, %.*s)=%s\n", SVCFH_fmt(fhp), len, name, dentry->d_name.name);
 	return 0;
 
 out_nfserr:
+	// printk(KERN_INFO "nfsd: nfsd_lookup(fh %s, %.*s) error\n", SVCFH_fmt(fhp), len, name);
 	exp_put(exp);
 	return nfserrno(host_err);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-21 22:41                     ` Wang Yugui
@ 2021-06-22 17:34                       ` Frank Filz
  2021-06-22 22:48                         ` Wang Yugui
  0 siblings, 1 reply; 94+ messages in thread
From: Frank Filz @ 2021-06-22 17:34 UTC (permalink / raw)
  To: Wang Yugui, Frank Filz; +Cc: 'NeilBrown', linux-nfs, Frank Filz

On 6/21/21 3:41 PM, Wang Yugui wrote:
> Hi,
>
>> OK thanks for the information. I think they will just work in nfs-ganesha as
>> long as the snapshots or subvols are mounted within an nfs-ganesha export or
>> are exported explicitly. nfs-ganesha has the equivalent of knfsd's
>> nohide/crossmnt options and when nfs-ganesha detects crossing a filesystem
>> boundary will lookup the filesystem via getmntend and listing btrfs subvols
>> and then expose that filesystem (via the fsid attribute) to the clients
>> where at least the Linux nfs client will detect a filesystem boundary and
>> create a new mount entry for it.
>
> Not only exported explicitly, but also kept in the same hierarchy.
>
> If we export
> /mnt/test		#the btrfs
> /mnt/test/sub1	# the btrfs subvol 1
> /mnt/test/sub2	# the btrfs subvol 2
>
> we need to make sure we will not access '/mnt/test/sub1' through '/mnt/test'
> from nfs client.
>
> current safe export:
> #/mnt/test		#the btrfs, not exported
> /mnt/test/sub1	# the btrfs subvol 1
> /mnt/test/sub2	# the btrfs subvol 2
>

What's the problem with exporting /mnt/test AND then exporting sub1 and 
sub2 as crossmnt exports? As far as I can tell, that seems to work just 
fine with nfs-ganesha.

Frank


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-22 17:34                       ` Frank Filz
@ 2021-06-22 22:48                         ` Wang Yugui
  0 siblings, 0 replies; 94+ messages in thread
From: Wang Yugui @ 2021-06-22 22:48 UTC (permalink / raw)
  To: Frank Filz; +Cc: Frank Filz, 'NeilBrown', linux-nfs

Hi,

> On 6/21/21 3:41 PM, Wang Yugui wrote:
> > Hi,
> >
> >> OK thanks for the information. I think they will just work in nfs-ganesha as
> >> long as the snapshots or subvols are mounted within an nfs-ganesha export or
> >> are exported explicitly. nfs-ganesha has the equivalent of knfsd's
> >> nohide/crossmnt options and when nfs-ganesha detects crossing a filesystem
> >> boundary will lookup the filesystem via getmntend and listing btrfs subvols
> >> and then expose that filesystem (via the fsid attribute) to the clients
> >> where at least the Linux nfs client will detect a filesystem boundary and
> >> create a new mount entry for it.
> >
> > Not only exported explicitly, but also kept in the same hierarchy.
> >
> > If we export
> > /mnt/test		#the btrfs
> > /mnt/test/sub1	# the btrfs subvol 1
> > /mnt/test/sub2	# the btrfs subvol 2
> >
> > we need to make sure we will not access '/mnt/test/sub1' through '/mnt/test'
> > from nfs client.
> >
> > current safe export:
> > #/mnt/test		#the btrfs, not exported
> > /mnt/test/sub1	# the btrfs subvol 1
> > /mnt/test/sub2	# the btrfs subvol 2
> >
> 
> What's the problem with exporting /mnt/test AND then exporting sub1 and sub2 as crossmnt exports? As far as I can tell, that seems to work just fine with nfs-ganesha.

I'm not sure what will happen on nfs-ganesha.

crossmnt(kernel nfsd) failed to work when exporting /mnt/test,/mnt/test/sub1,
/mnt/test/sub2.

# /bin/find /nfs/test/
/nfs/test/
find: File system loop detected; ‘/nfs/test/sub1’ is part of the same file system loop as ‘/nfs/test/’.
/nfs/test/.snapshot
find: File system loop detected; ‘/nfs/test/.snapshot/sub1-s1’ is part of the same file system loop as ‘/nfs/test/’.
find: File system loop detected; ‘/nfs/test/.snapshot/sub2-s1’ is part of the same file system loop as ‘/nfs/test/’.
/nfs/test/dir1
/nfs/test/dir1/a.txt
find: File system loop detected; ‘/nfs/test/sub2’ is part of the same file system loop as ‘/nfs/test/’

/bin/find report 'File system loop detected', it means that vfs cache(
based on st_dev + st_ino?) on nfs client side will have some problem?

In fact, I was exporting /mnt/test  for years too.
but btrfs subvols means multiple filesystems(different st_dev),  in theory, we
needs to use it based on nfs crossmnt.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/23


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-22  7:14                       ` Wang Yugui
@ 2021-06-23  0:59                         ` NeilBrown
  2021-06-23  6:14                           ` Wang Yugui
  2021-06-23 15:35                           ` J. Bruce Fields
  0 siblings, 2 replies; 94+ messages in thread
From: NeilBrown @ 2021-06-23  0:59 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-nfs

On Tue, 22 Jun 2021, Wang Yugui wrote:
> > 
> > btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().
> 
> btrfs subvol crossmnt begin to work, although buggy.
> 
> some subvol is crossmnt-ed, some subvol is yet not, and some dir is
> wrongly crossmnt-ed
> 
> 'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin
> to happen.
> 
> This is the current patch based on 5.10.44. 
> At least nfsd_follow_up() is buggy.
> 

I don't think the approach you are taking makes sense.  Let me explain
why.

The problem is that applications on the NFS client can see different
files or directories on the same (apparent) filesystem with the same
inode number.  Most application won't care and NFS itself doesn't get
confused by the duplicate inode numbers, but 'find' and similar programs
(probably 'tar' for example) do get upset.

This happens because BTRFS reuses inode numbers in subvols which it
presents to the kernel as all part of the one filesystem (or at least,
all part of the one mount point).  NFSD only sees one filesystem, and so
reports the same filesystem-id (fsid) for all objects.  The NFS client
then sees that the fsid is the same and tells applications that the
objects are all in the one filesystem.

To fix this, we need to make sure that nfsd reports a different fsid for
objects in different subvols.  There are two obvious ways to do this.

One is to teach nfsd to recognize btrfs subvolumes exactly like separate
filesystems (as nfsd already ensure each filesystem gets its own fsid).
This is the approach of my first suggestion.  It requires changing
nfsd_mountpoint() and follow_up() and any other code that is aware of
different filesytems.  As I mentioned, it also requires changing mountd
to be able to extract a list of subvols from btrfs because they don't
appear in /proc/mounts.  

As you might know an NFS filehandle has 3 parts: a header, a filesystem
identifier, and an inode identifier.  This approach would involve giving
different subvols different filesystem identifiers in the filehandle.
This, it turns out is a very big change - bigger than I at first
imagined.

The second obvious approach is to leave the filehandles unchanged and to
continue to treat an entire btrfs filesystem as a single filesystem
EXCEPT when reporting the fsid to the NFS client.  All we *really* need
to do is make sure the client sees a different fsid when it enters a
part of the filesystem which re-uses inode numbers.  This is what my
latest patch did.

Your patch seems to combine ideas from both approaches.  It includes my
code to replace the fsid, but also intercepts follow_up etc.  This
cannot be useful.

As I noted when I posted it, there is a problem with my patch.  I now
understand that problem.

When NFS sees that fsid change it needs to create 2 inodes for that
directory.  One inode will be in the parent filesystem and will be
marked as an auto-mount point so that any lookup below that directory
will trigger an internal mount.  The other inode is the root of the
child filesystem.  It gets mounted on the first inode.

With normal filesystem mounts, there really is an inode in the parent
filesystem and NFS can find it (with NFSv4) using the MOUNTED_ON_FILEID
attribute.  This fileid will be different from all other inode numbers
in the parent filesystem.

With BTRFS there is no inode in the parent volume (as far as I know) so
there is nothing useful to return for MOUNTED_ON_FILEID.  This results
in NFS using the same inode number for the inode in the parent
filesystem as the inode in the child filesystem.  For btrfs, this will
be 256.  As there is already an inode in the parent filesystem with inum
256, 'find' complains.

The following patch addresses this by adding code to nfsd when it
determines MOUINTD_ON_FILEID to choose an number that should be unused
in btrfs.  With this change, 'find' seems to work correctly with NFSv4
mounts of btrfs.

This doesn't work with NFSv3 as NFSv3 doesn't have the MOUNTED_ON_FILEID
attribute - strictly speaking, the NFSv3 protocol doesn't support
crossing mount points, though the Linux implementation does allow it.

So this patch works and, I think, is the best we can do in terms of
functionality.  I don't like the details of the implementation though.
It requires NFSD to know too much about BTRFS internals.

I think I would like btrfs to make it clear where a subvol started,
maybe by setting DCACHE_MOUNTED on the dentry.  This flag is only a
hint, not a promise of anything, so other code should get confused.
This would have nfsd calling vfs_statfs quite so often ....  maybe that
isn't such a big deal.

More importantly, there needs to be some way for NFSD to find an inode
number to report for the MOUNTED_ON_FILEID.  This needs to be a number
not used elsewhere in the filesystem.  It might be safe to use the
same fileid for all subvols (as my patch currently does), but we would
need to confirm that 'find' and 'tar' don't complain about that or
mishandle it.  If it is safe to use the same fileid, then a new field in
the superblock to store it might work.  If a different fileid is needed,
the we might need a new field in 'struct kstatfs', so vfs_statfs can
report it.

Anyway, here is my current patch.  It includes support for NFSv3 as well
as NFSv4.

NeilBrown

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index 9421dae22737..790a3357525d 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/namei.h>
 #include <linux/module.h>
+#include <linux/statfs.h>
 #include <linux/exportfs.h>
 #include <linux/sunrpc/svc_xprt.h>
 
@@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 	int err;
 	struct auth_domain *dom = NULL;
 	struct svc_export exp = {}, *expp;
+	struct kstatfs statfs;
 	int an_int;
 
 	if (mesg[mlen-1] != '\n')
@@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 	err = kern_path(buf, 0, &exp.ex_path);
 	if (err)
 		goto out1;
+	err = vfs_statfs(&exp.ex_path, &statfs);
+	if (err)
+		goto out3;
+	exp.ex_fsid64 = statfs.f_fsid;
 
 	exp.ex_client = dom;
 	exp.cd = cd;
@@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
 	new->ex_anon_uid = item->ex_anon_uid;
 	new->ex_anon_gid = item->ex_anon_gid;
 	new->ex_fsid = item->ex_fsid;
+	new->ex_fsid64 = item->ex_fsid64;
 	new->ex_devid_map = item->ex_devid_map;
 	item->ex_devid_map = NULL;
 	new->ex_uuid = item->ex_uuid;
diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index ee0e3aba4a6e..d3eb9a599918 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -68,6 +68,7 @@ struct svc_export {
 	kuid_t			ex_anon_uid;
 	kgid_t			ex_anon_gid;
 	int			ex_fsid;
+	__kernel_fsid_t		ex_fsid64;
 	unsigned char *		ex_uuid; /* 16 byte fsid */
 	struct nfsd4_fs_locations ex_fslocs;
 	uint32_t		ex_nflavors;
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 0a5ebc52e6a9..f11ba3434fd6 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -367,10 +367,18 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
 	case FSIDSOURCE_FSID:
 		fsid = (u64)fhp->fh_export->ex_fsid;
 		break;
-	case FSIDSOURCE_UUID:
+	case FSIDSOURCE_UUID: {
+		struct kstatfs statfs;
+
 		fsid = ((u64 *)fhp->fh_export->ex_uuid)[0];
 		fsid ^= ((u64 *)fhp->fh_export->ex_uuid)[1];
+		if (fh_getstafs(fhp, &statfs) == 0 &&
+		    (statfs.f_fsid.val[0] != fhp->fh_export->ex_fsid64.val[0] ||
+		     statfs.f_fsid.val[1] != fhp->fh_export->ex_fsid64.val[1]))
+			/* looks like a btrfs subvol */
+			fsid = statfs.f_fsid.val[0] ^ statfs.f_fsid.val[1];
 		break;
+		}
 	default:
 		fsid = (u64)huge_encode_dev(fhp->fh_dentry->d_sb->s_dev);
 	}
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7abeccb975b2..5f614d1b362e 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -42,6 +42,7 @@
 #include <linux/sunrpc/svcauth_gss.h>
 #include <linux/sunrpc/addr.h>
 #include <linux/xattr.h>
+#include <linux/btrfs_tree.h>
 #include <uapi/linux/xattr.h>
 
 #include "idmap.h"
@@ -2869,8 +2870,10 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 	if (err)
 		goto out_nfserr;
 	if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
+		       FATTR4_WORD0_FSID |
 			FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
 	    (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
+		       FATTR4_WORD1_MOUNTED_ON_FILEID |
 		       FATTR4_WORD1_SPACE_TOTAL))) {
 		err = vfs_statfs(&path, &statfs);
 		if (err)
@@ -3024,6 +3027,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 		case FSIDSOURCE_UUID:
 			p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
 								EX_UUID_LEN);
+			if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
+			    statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
+				/* looks like a btrfs subvol */
+				p[-2] ^= statfs.f_fsid.val[0];
+				p[-1] ^= statfs.f_fsid.val[1];
+			}
 			break;
 		}
 	}
@@ -3286,6 +3295,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 				goto out_nfserr;
 			ino = parent_stat.ino;
 		}
+		if (fsid_source(fhp) == FSIDSOURCE_UUID &&
+		    (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
+		     statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]))
+			    /* btrfs subvol pseudo mount point */
+			    ino = BTRFS_FIRST_FREE_OBJECTID-1;
+
 		p = xdr_encode_hyper(p, ino);
 	}
 #ifdef CONFIG_NFSD_PNFS
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index b21b76e6b9a8..82b76b0b7bec 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -160,6 +160,13 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
 				    AT_STATX_SYNC_AS_STAT));
 }
 
+static inline __be32 fh_getstafs(const struct svc_fh *fh, struct kstatfs *statfs)
+{
+	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
+			 .dentry = fh->fh_dentry};
+	return nfserrno(vfs_statfs(&p, statfs));
+}
+
 static inline int nfsd_create_is_exclusive(int createmode)
 {
 	return createmode == NFS3_CREATE_EXCLUSIVE

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-23  0:59                         ` NeilBrown
@ 2021-06-23  6:14                           ` Wang Yugui
  2021-06-23  6:29                             ` NeilBrown
  2021-06-23 15:35                           ` J. Bruce Fields
  1 sibling, 1 reply; 94+ messages in thread
From: Wang Yugui @ 2021-06-23  6:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 15976 bytes --]

Hi,

This patch works very well. Thanks a lot.
-  crossmnt of btrfs subvol works as expected.
-  nfs/umount subvol works well.
-  pseudo mount point inode(255) is good.

I test it in 5.10.45 with a few minor rebase.
( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch,
just fs/nfsd/nfs3xdr.c rebase)

But when I tested it with another btrfs system without subvol but with
more data, 'find /nfs/test' caused a OOPS .  and this OOPS will not
happen just without this patch.

The data in this filesystem is created/left by xfstest(FSTYP=nfs,
TEST_DEV).

#nfs4 option: default mount.nfs4, nfs-utils-2.3.3

# access btrfs directly
$ find /mnt/test | wc -l
6612

# access btrfs through nfs
$ find /nfs/test | wc -l

[  466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004
[  466.172123] #PF: supervisor read access in kernel mode
[  466.177857] #PF: error_code(0x0000) - not-present page
[  466.183601] PGD 0 P4D 0
[  466.186443] Oops: 0000 [#1] SMP NOPTI
[  466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1
[  466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019
[  466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
[  466.212067] Code: e8 3e f9 ff ff 48 c7 c7 40 5a 90 c0 48 89 c6 e8 18 5a 1f d3 44 8b 14 24 e9 a2 f9 ff ff e9
 f7 3e 03 00 90 0f 1f 44 00 00 31 c0 <80> 7f 04 01 75 2d 0f b6 47 06 48 8b 97 90 00 00 00 84 c0 74 1f 83
[  466.233061] RSP: 0018:ffff9cdd0d3479d0 EFLAGS: 00010246
[  466.238894] RAX: 0000000000000000 RBX: 0000000000010abc RCX: ffff8f50f3049b00
[  466.246872] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[  466.254848] RBP: ffff9cdd0d347c68 R08: 0000000aaeb00000 R09: 0000000000000001
[  466.262825] R10: 0000000000010000 R11: 0000000000110000 R12: ffff8f30510f8000
[  466.270802] R13: ffff8f4fdabb2090 R14: ffff8f30c0b95600 R15: 0000000000000018
[  466.278779] FS:  0000000000000000(0000) GS:ffff8f5f7fb40000(0000) knlGS:0000000000000000
[  466.287823] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  466.294246] CR2: 0000000000000004 CR3: 00000014bfa10003 CR4: 00000000001706e0
[  466.302222] Call Trace:
[  466.304970]  nfsd4_encode_fattr+0x15ac/0x1940 [nfsd]
[  466.310557]  ? btrfs_verify_level_key+0xad/0xf0 [btrfs]
[  466.316413]  ? btrfs_search_slot+0x8e3/0x900 [btrfs]
[  466.321973]  nfsd4_encode_dirent+0x160/0x3b0 [nfsd]
[  466.327434]  nfsd_readdir+0x199/0x240 [nfsd]
[  466.332215]  ? nfsd4_encode_getattr+0x30/0x30 [nfsd]
[  466.337771]  ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
[  466.343714]  ? security_prepare_creds+0x6f/0xa0
[  466.348788]  nfsd4_encode_readdir+0xd9/0x1c0 [nfsd]
[  466.354250]  nfsd4_encode_operation+0x9b/0x1b0 [nfsd]
[  466.360430]  nfsd4_proc_compound+0x4e3/0x710 [nfsd]
[  466.366352]  nfsd_dispatch+0xd4/0x180 [nfsd]
[  466.371620]  svc_process_common+0x392/0x6c0 [sunrpc]
[  466.377650]  ? svc_recv+0x3c4/0x8a0 [sunrpc]
[  466.382883]  ? nfsd_svc+0x300/0x300 [nfsd]
[  466.387908]  ? nfsd_destroy+0x60/0x60 [nfsd]
[  466.393126]  svc_process+0xb7/0xf0 [sunrpc]
[  466.398234]  nfsd+0xe8/0x140 [nfsd]
[  466.402555]  kthread+0x116/0x130
[  466.406579]  ? kthread_park+0x80/0x80
[  466.411091]  ret_from_fork+0x1f/0x30
[  466.415499] Modules linked in: acpi_ipmi rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rfkill intel_rapl_m
sr intel_rapl_common iTCO_wdt intel_pmc_bxt iTCO_vendor_support dcdbas ipmi_ssif sb_edac x86_pkg_temp_thermal
intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_si rapl intel_cstate mei_me ipmi_devintf intel_uncore j
oydev mei ipmi_msghandler lpc_ich acpi_power_meter nvme_rdma nvme_fabrics rdma_cm iw_cm ib_cm rdmavt nfsd rdma
_rxe ib_uverbs ip6_udp_tunnel udp_tunnel ib_core auth_rpcgss nfs_acl lockd grace nfs_ssc ip_tables xfs mgag200
 drm_kms_helper crct10dif_pclmul crc32_pclmul btrfs cec crc32c_intel xor bnx2x raid6_pq drm igb mpt3sas ghash_
clmulni_intel pcspkr nvme megaraid_sas mdio nvme_core dca raid_class i2c_algo_bit scsi_transport_sas wmi dm_mu
ltipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sunrpc i2c_dev
[  466.499551] CR2: 0000000000000004
[  466.503759] ---[ end trace 91eb52bf0cb65801 ]---
[  466.511948] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
[  466.517714] Code: e8 3e f9 ff ff 48 c7 c7 40 5a 90 c0 48 89 c6 e8 18 5a 1f d3 44 8b 14 24 e9 a2 f9 ff ff e9
 f7 3e 03 00 90 0f 1f 44 00 00 31 c0 <80> 7f 04 01 75 2d 0f b6 47 06 48 8b 97 90 00 00 00 84 c0 74 1f 83
[  466.539753] RSP: 0018:ffff9cdd0d3479d0 EFLAGS: 00010246
[  466.546122] RAX: 0000000000000000 RBX: 0000000000010abc RCX: ffff8f50f3049b00
[  466.554625] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[  466.563096] RBP: ffff9cdd0d347c68 R08: 0000000aaeb00000 R09: 0000000000000001
[  466.571572] R10: 0000000000010000 R11: 0000000000110000 R12: ffff8f30510f8000
[  466.580024] R13: ffff8f4fdabb2090 R14: ffff8f30c0b95600 R15: 0000000000000018
[  466.588487] FS:  0000000000000000(0000) GS:ffff8f5f7fb40000(0000) knlGS:0000000000000000
[  466.598032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  466.604973] CR2: 0000000000000004 CR3: 00000014bfa10003 CR4: 00000000001706e0
[  466.613467] Kernel panic - not syncing: Fatal exception
[  466.807651] Kernel Offset: 0x12000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xfffff
fffbfffffff)
[  466.823190] ---[ end Kernel panic - not syncing: Fatal exception ]---


Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/23

> On Tue, 22 Jun 2021, Wang Yugui wrote:
> > > 
> > > btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().
> > 
> > btrfs subvol crossmnt begin to work, although buggy.
> > 
> > some subvol is crossmnt-ed, some subvol is yet not, and some dir is
> > wrongly crossmnt-ed
> > 
> > 'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin
> > to happen.
> > 
> > This is the current patch based on 5.10.44. 
> > At least nfsd_follow_up() is buggy.
> > 
> 
> I don't think the approach you are taking makes sense.  Let me explain
> why.
> 
> The problem is that applications on the NFS client can see different
> files or directories on the same (apparent) filesystem with the same
> inode number.  Most application won't care and NFS itself doesn't get
> confused by the duplicate inode numbers, but 'find' and similar programs
> (probably 'tar' for example) do get upset.
> 
> This happens because BTRFS reuses inode numbers in subvols which it
> presents to the kernel as all part of the one filesystem (or at least,
> all part of the one mount point).  NFSD only sees one filesystem, and so
> reports the same filesystem-id (fsid) for all objects.  The NFS client
> then sees that the fsid is the same and tells applications that the
> objects are all in the one filesystem.
> 
> To fix this, we need to make sure that nfsd reports a different fsid for
> objects in different subvols.  There are two obvious ways to do this.
> 
> One is to teach nfsd to recognize btrfs subvolumes exactly like separate
> filesystems (as nfsd already ensure each filesystem gets its own fsid).
> This is the approach of my first suggestion.  It requires changing
> nfsd_mountpoint() and follow_up() and any other code that is aware of
> different filesytems.  As I mentioned, it also requires changing mountd
> to be able to extract a list of subvols from btrfs because they don't
> appear in /proc/mounts.  
> 
> As you might know an NFS filehandle has 3 parts: a header, a filesystem
> identifier, and an inode identifier.  This approach would involve giving
> different subvols different filesystem identifiers in the filehandle.
> This, it turns out is a very big change - bigger than I at first
> imagined.
> 
> The second obvious approach is to leave the filehandles unchanged and to
> continue to treat an entire btrfs filesystem as a single filesystem
> EXCEPT when reporting the fsid to the NFS client.  All we *really* need
> to do is make sure the client sees a different fsid when it enters a
> part of the filesystem which re-uses inode numbers.  This is what my
> latest patch did.
> 
> Your patch seems to combine ideas from both approaches.  It includes my
> code to replace the fsid, but also intercepts follow_up etc.  This
> cannot be useful.
> 
> As I noted when I posted it, there is a problem with my patch.  I now
> understand that problem.
> 
> When NFS sees that fsid change it needs to create 2 inodes for that
> directory.  One inode will be in the parent filesystem and will be
> marked as an auto-mount point so that any lookup below that directory
> will trigger an internal mount.  The other inode is the root of the
> child filesystem.  It gets mounted on the first inode.
> 
> With normal filesystem mounts, there really is an inode in the parent
> filesystem and NFS can find it (with NFSv4) using the MOUNTED_ON_FILEID
> attribute.  This fileid will be different from all other inode numbers
> in the parent filesystem.
> 
> With BTRFS there is no inode in the parent volume (as far as I know) so
> there is nothing useful to return for MOUNTED_ON_FILEID.  This results
> in NFS using the same inode number for the inode in the parent
> filesystem as the inode in the child filesystem.  For btrfs, this will
> be 256.  As there is already an inode in the parent filesystem with inum
> 256, 'find' complains.
> 
> The following patch addresses this by adding code to nfsd when it
> determines MOUINTD_ON_FILEID to choose an number that should be unused
> in btrfs.  With this change, 'find' seems to work correctly with NFSv4
> mounts of btrfs.
> 
> This doesn't work with NFSv3 as NFSv3 doesn't have the MOUNTED_ON_FILEID
> attribute - strictly speaking, the NFSv3 protocol doesn't support
> crossing mount points, though the Linux implementation does allow it.
> 
> So this patch works and, I think, is the best we can do in terms of
> functionality.  I don't like the details of the implementation though.
> It requires NFSD to know too much about BTRFS internals.
> 
> I think I would like btrfs to make it clear where a subvol started,
> maybe by setting DCACHE_MOUNTED on the dentry.  This flag is only a
> hint, not a promise of anything, so other code should get confused.
> This would have nfsd calling vfs_statfs quite so often ....  maybe that
> isn't such a big deal.
> 
> More importantly, there needs to be some way for NFSD to find an inode
> number to report for the MOUNTED_ON_FILEID.  This needs to be a number
> not used elsewhere in the filesystem.  It might be safe to use the
> same fileid for all subvols (as my patch currently does), but we would
> need to confirm that 'find' and 'tar' don't complain about that or
> mishandle it.  If it is safe to use the same fileid, then a new field in
> the superblock to store it might work.  If a different fileid is needed,
> the we might need a new field in 'struct kstatfs', so vfs_statfs can
> report it.
> 
> Anyway, here is my current patch.  It includes support for NFSv3 as well
> as NFSv4.
> 
> NeilBrown
> 
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index 9421dae22737..790a3357525d 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -15,6 +15,7 @@
>  #include <linux/slab.h>
>  #include <linux/namei.h>
>  #include <linux/module.h>
> +#include <linux/statfs.h>
>  #include <linux/exportfs.h>
>  #include <linux/sunrpc/svc_xprt.h>
>  
> @@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
>  	int err;
>  	struct auth_domain *dom = NULL;
>  	struct svc_export exp = {}, *expp;
> +	struct kstatfs statfs;
>  	int an_int;
>  
>  	if (mesg[mlen-1] != '\n')
> @@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
>  	err = kern_path(buf, 0, &exp.ex_path);
>  	if (err)
>  		goto out1;
> +	err = vfs_statfs(&exp.ex_path, &statfs);
> +	if (err)
> +		goto out3;
> +	exp.ex_fsid64 = statfs.f_fsid;
>  
>  	exp.ex_client = dom;
>  	exp.cd = cd;
> @@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
>  	new->ex_anon_uid = item->ex_anon_uid;
>  	new->ex_anon_gid = item->ex_anon_gid;
>  	new->ex_fsid = item->ex_fsid;
> +	new->ex_fsid64 = item->ex_fsid64;
>  	new->ex_devid_map = item->ex_devid_map;
>  	item->ex_devid_map = NULL;
>  	new->ex_uuid = item->ex_uuid;
> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
> index ee0e3aba4a6e..d3eb9a599918 100644
> --- a/fs/nfsd/export.h
> +++ b/fs/nfsd/export.h
> @@ -68,6 +68,7 @@ struct svc_export {
>  	kuid_t			ex_anon_uid;
>  	kgid_t			ex_anon_gid;
>  	int			ex_fsid;
> +	__kernel_fsid_t		ex_fsid64;
>  	unsigned char *		ex_uuid; /* 16 byte fsid */
>  	struct nfsd4_fs_locations ex_fslocs;
>  	uint32_t		ex_nflavors;
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 0a5ebc52e6a9..f11ba3434fd6 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -367,10 +367,18 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
>  	case FSIDSOURCE_FSID:
>  		fsid = (u64)fhp->fh_export->ex_fsid;
>  		break;
> -	case FSIDSOURCE_UUID:
> +	case FSIDSOURCE_UUID: {
> +		struct kstatfs statfs;
> +
>  		fsid = ((u64 *)fhp->fh_export->ex_uuid)[0];
>  		fsid ^= ((u64 *)fhp->fh_export->ex_uuid)[1];
> +		if (fh_getstafs(fhp, &statfs) == 0 &&
> +		    (statfs.f_fsid.val[0] != fhp->fh_export->ex_fsid64.val[0] ||
> +		     statfs.f_fsid.val[1] != fhp->fh_export->ex_fsid64.val[1]))
> +			/* looks like a btrfs subvol */
> +			fsid = statfs.f_fsid.val[0] ^ statfs.f_fsid.val[1];
>  		break;
> +		}
>  	default:
>  		fsid = (u64)huge_encode_dev(fhp->fh_dentry->d_sb->s_dev);
>  	}
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 7abeccb975b2..5f614d1b362e 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -42,6 +42,7 @@
>  #include <linux/sunrpc/svcauth_gss.h>
>  #include <linux/sunrpc/addr.h>
>  #include <linux/xattr.h>
> +#include <linux/btrfs_tree.h>
>  #include <uapi/linux/xattr.h>
>  
>  #include "idmap.h"
> @@ -2869,8 +2870,10 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  	if (err)
>  		goto out_nfserr;
>  	if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
> +		       FATTR4_WORD0_FSID |
>  			FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
>  	    (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
> +		       FATTR4_WORD1_MOUNTED_ON_FILEID |
>  		       FATTR4_WORD1_SPACE_TOTAL))) {
>  		err = vfs_statfs(&path, &statfs);
>  		if (err)
> @@ -3024,6 +3027,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  		case FSIDSOURCE_UUID:
>  			p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
>  								EX_UUID_LEN);
> +			if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> +			    statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
> +				/* looks like a btrfs subvol */
> +				p[-2] ^= statfs.f_fsid.val[0];
> +				p[-1] ^= statfs.f_fsid.val[1];
> +			}
>  			break;
>  		}
>  	}
> @@ -3286,6 +3295,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  				goto out_nfserr;
>  			ino = parent_stat.ino;
>  		}
> +		if (fsid_source(fhp) == FSIDSOURCE_UUID &&
> +		    (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> +		     statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]))
> +			    /* btrfs subvol pseudo mount point */
> +			    ino = BTRFS_FIRST_FREE_OBJECTID-1;
> +
>  		p = xdr_encode_hyper(p, ino);
>  	}
>  #ifdef CONFIG_NFSD_PNFS
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index b21b76e6b9a8..82b76b0b7bec 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -160,6 +160,13 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
>  				    AT_STATX_SYNC_AS_STAT));
>  }
>  
> +static inline __be32 fh_getstafs(const struct svc_fh *fh, struct kstatfs *statfs)
> +{
> +	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
> +			 .dentry = fh->fh_dentry};
> +	return nfserrno(vfs_statfs(&p, statfs));
> +}
> +
>  static inline int nfsd_create_is_exclusive(int createmode)
>  {
>  	return createmode == NFS3_CREATE_EXCLUSIVE


[-- Attachment #2: 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch --]
[-- Type: application/octet-stream, Size: 10440 bytes --]

From 7f674853edde79f37589586ef219b8650e409677 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Wed, 23 Jun 2021 10:59:00 +1000
Subject: [PATCH] any idea about auto export multiple btrfs snapshots?

On Tue, 22 Jun 2021, Wang Yugui wrote:
> >
> > btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().
>
> btrfs subvol crossmnt begin to work, although buggy.
>
> some subvol is crossmnt-ed, some subvol is yet not, and some dir is
> wrongly crossmnt-ed
>
> 'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin
> to happen.
>
> This is the current patch based on 5.10.44.
> At least nfsd_follow_up() is buggy.
>

I don't think the approach you are taking makes sense.  Let me explain
why.

The problem is that applications on the NFS client can see different
files or directories on the same (apparent) filesystem with the same
inode number.  Most application won't care and NFS itself doesn't get
confused by the duplicate inode numbers, but 'find' and similar programs
(probably 'tar' for example) do get upset.

This happens because BTRFS reuses inode numbers in subvols which it
presents to the kernel as all part of the one filesystem (or at least,
all part of the one mount point).  NFSD only sees one filesystem, and so
reports the same filesystem-id (fsid) for all objects.  The NFS client
then sees that the fsid is the same and tells applications that the
objects are all in the one filesystem.

To fix this, we need to make sure that nfsd reports a different fsid for
objects in different subvols.  There are two obvious ways to do this.

One is to teach nfsd to recognize btrfs subvolumes exactly like separate
filesystems (as nfsd already ensure each filesystem gets its own fsid).
This is the approach of my first suggestion.  It requires changing
nfsd_mountpoint() and follow_up() and any other code that is aware of
different filesytems.  As I mentioned, it also requires changing mountd
to be able to extract a list of subvols from btrfs because they don't
appear in /proc/mounts.

As you might know an NFS filehandle has 3 parts: a header, a filesystem
identifier, and an inode identifier.  This approach would involve giving
different subvols different filesystem identifiers in the filehandle.
This, it turns out is a very big change - bigger than I at first
imagined.

The second obvious approach is to leave the filehandles unchanged and to
continue to treat an entire btrfs filesystem as a single filesystem
EXCEPT when reporting the fsid to the NFS client.  All we *really* need
to do is make sure the client sees a different fsid when it enters a
part of the filesystem which re-uses inode numbers.  This is what my
latest patch did.

Your patch seems to combine ideas from both approaches.  It includes my
code to replace the fsid, but also intercepts follow_up etc.  This
cannot be useful.

As I noted when I posted it, there is a problem with my patch.  I now
understand that problem.

When NFS sees that fsid change it needs to create 2 inodes for that
directory.  One inode will be in the parent filesystem and will be
marked as an auto-mount point so that any lookup below that directory
will trigger an internal mount.  The other inode is the root of the
child filesystem.  It gets mounted on the first inode.

With normal filesystem mounts, there really is an inode in the parent
filesystem and NFS can find it (with NFSv4) using the MOUNTED_ON_FILEID
attribute.  This fileid will be different from all other inode numbers
in the parent filesystem.

With BTRFS there is no inode in the parent volume (as far as I know) so
there is nothing useful to return for MOUNTED_ON_FILEID.  This results
in NFS using the same inode number for the inode in the parent
filesystem as the inode in the child filesystem.  For btrfs, this will
be 256.  As there is already an inode in the parent filesystem with inum
256, 'find' complains.

The following patch addresses this by adding code to nfsd when it
determines MOUINTD_ON_FILEID to choose an number that should be unused
in btrfs.  With this change, 'find' seems to work correctly with NFSv4
mounts of btrfs.

This doesn't work with NFSv3 as NFSv3 doesn't have the MOUNTED_ON_FILEID
attribute - strictly speaking, the NFSv3 protocol doesn't support
crossing mount points, though the Linux implementation does allow it.

So this patch works and, I think, is the best we can do in terms of
functionality.  I don't like the details of the implementation though.
It requires NFSD to know too much about BTRFS internals.

I think I would like btrfs to make it clear where a subvol started,
maybe by setting DCACHE_MOUNTED on the dentry.  This flag is only a
hint, not a promise of anything, so other code should get confused.
This would have nfsd calling vfs_statfs quite so often ....  maybe that
isn't such a big deal.

More importantly, there needs to be some way for NFSD to find an inode
number to report for the MOUNTED_ON_FILEID.  This needs to be a number
not used elsewhere in the filesystem.  It might be safe to use the
same fileid for all subvols (as my patch currently does), but we would
need to confirm that 'find' and 'tar' don't complain about that or
mishandle it.  If it is safe to use the same fileid, then a new field in
the superblock to store it might work.  If a different fileid is needed,
the we might need a new field in 'struct kstatfs', so vfs_statfs can
report it.

Anyway, here is my current patch.  It includes support for NFSv3 as well
as NFSv4.

NeilBrown
---
 fs/nfsd/export.c  |  7 +++++++
 fs/nfsd/export.h  |  1 +
 fs/nfsd/nfs3xdr.c | 10 +++++++++-
 fs/nfsd/nfs4xdr.c | 15 +++++++++++++++
 fs/nfsd/vfs.h     |  7 +++++++
 5 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index 9421dae22737..790a3357525d 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/namei.h>
 #include <linux/module.h>
+#include <linux/statfs.h>
 #include <linux/exportfs.h>
 #include <linux/sunrpc/svc_xprt.h>
 
@@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 	int err;
 	struct auth_domain *dom = NULL;
 	struct svc_export exp = {}, *expp;
+	struct kstatfs statfs;
 	int an_int;
 
 	if (mesg[mlen-1] != '\n')
@@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 	err = kern_path(buf, 0, &exp.ex_path);
 	if (err)
 		goto out1;
+	err = vfs_statfs(&exp.ex_path, &statfs);
+	if (err)
+		goto out3;
+	exp.ex_fsid64 = statfs.f_fsid;
 
 	exp.ex_client = dom;
 	exp.cd = cd;
@@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
 	new->ex_anon_uid = item->ex_anon_uid;
 	new->ex_anon_gid = item->ex_anon_gid;
 	new->ex_fsid = item->ex_fsid;
+	new->ex_fsid64 = item->ex_fsid64;
 	new->ex_devid_map = item->ex_devid_map;
 	item->ex_devid_map = NULL;
 	new->ex_uuid = item->ex_uuid;
diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index ee0e3aba4a6e..d3eb9a599918 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -68,6 +68,7 @@ struct svc_export {
 	kuid_t			ex_anon_uid;
 	kgid_t			ex_anon_gid;
 	int			ex_fsid;
+	__kernel_fsid_t		ex_fsid64;
 	unsigned char *		ex_uuid; /* 16 byte fsid */
 	struct nfsd4_fs_locations ex_fslocs;
 	uint32_t		ex_nflavors;
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 0a5ebc52e6a9..f11ba3434fd6 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -153,11 +153,19 @@ static __be32 *encode_fsid(__be32 *p, struct svc_fh *fhp)
 	case FSIDSOURCE_FSID:
 		p = xdr_encode_hyper(p, (u64) fhp->fh_export->ex_fsid);
 		break;
-	case FSIDSOURCE_UUID:
+	case FSIDSOURCE_UUID: {
+		struct kstatfs statfs;
+
 		f = ((u64*)fhp->fh_export->ex_uuid)[0];
 		f ^= ((u64*)fhp->fh_export->ex_uuid)[1];
+		if (fh_getstafs(fhp, &statfs) == 0 &&
+		    (statfs.f_fsid.val[0] != fhp->fh_export->ex_fsid64.val[0] ||
+		     statfs.f_fsid.val[1] != fhp->fh_export->ex_fsid64.val[1]))
+			/* looks like a btrfs subvol */
+			f = statfs.f_fsid.val[0] ^ statfs.f_fsid.val[1];
 		p = xdr_encode_hyper(p, f);
 		break;
+		}
 	}
 	return p;
 }
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7abeccb975b2..5f614d1b362e 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -42,6 +42,7 @@
 #include <linux/sunrpc/svcauth_gss.h>
 #include <linux/sunrpc/addr.h>
 #include <linux/xattr.h>
+#include <linux/btrfs_tree.h>
 #include <uapi/linux/xattr.h>
 
 #include "idmap.h"
@@ -2869,8 +2870,10 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 	if (err)
 		goto out_nfserr;
 	if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
+		       FATTR4_WORD0_FSID |
 			FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
 	    (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
+		       FATTR4_WORD1_MOUNTED_ON_FILEID |
 		       FATTR4_WORD1_SPACE_TOTAL))) {
 		err = vfs_statfs(&path, &statfs);
 		if (err)
@@ -3024,6 +3027,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 		case FSIDSOURCE_UUID:
 			p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
 								EX_UUID_LEN);
+			if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
+			    statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
+				/* looks like a btrfs subvol */
+				p[-2] ^= statfs.f_fsid.val[0];
+				p[-1] ^= statfs.f_fsid.val[1];
+			}
 			break;
 		}
 	}
@@ -3286,6 +3295,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 				goto out_nfserr;
 			ino = parent_stat.ino;
 		}
+		if (fsid_source(fhp) == FSIDSOURCE_UUID &&
+		    (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
+		     statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]))
+			    /* btrfs subvol pseudo mount point */
+			    ino = BTRFS_FIRST_FREE_OBJECTID-1;
+
 		p = xdr_encode_hyper(p, ino);
 	}
 #ifdef CONFIG_NFSD_PNFS
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index b21b76e6b9a8..82b76b0b7bec 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -160,6 +160,13 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
 				    AT_STATX_SYNC_AS_STAT));
 }
 
+static inline __be32 fh_getstafs(const struct svc_fh *fh, struct kstatfs *statfs)
+{
+	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
+			 .dentry = fh->fh_dentry};
+	return nfserrno(vfs_statfs(&p, statfs));
+}
+
 static inline int nfsd_create_is_exclusive(int createmode)
 {
 	return createmode == NFS3_CREATE_EXCLUSIVE
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-23  6:14                           ` Wang Yugui
@ 2021-06-23  6:29                             ` NeilBrown
  2021-06-23  9:34                               ` Wang Yugui
  0 siblings, 1 reply; 94+ messages in thread
From: NeilBrown @ 2021-06-23  6:29 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-nfs

On Wed, 23 Jun 2021, Wang Yugui wrote:
> Hi,
> 
> This patch works very well. Thanks a lot.
> -  crossmnt of btrfs subvol works as expected.
> -  nfs/umount subvol works well.
> -  pseudo mount point inode(255) is good.
> 
> I test it in 5.10.45 with a few minor rebase.
> ( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch,
> just fs/nfsd/nfs3xdr.c rebase)
> 
> But when I tested it with another btrfs system without subvol but with
> more data, 'find /nfs/test' caused a OOPS .  and this OOPS will not
> happen just without this patch.
> 
> The data in this filesystem is created/left by xfstest(FSTYP=nfs,
> TEST_DEV).
> 
> #nfs4 option: default mount.nfs4, nfs-utils-2.3.3
> 
> # access btrfs directly
> $ find /mnt/test | wc -l
> 6612
> 
> # access btrfs through nfs
> $ find /nfs/test | wc -l
> 
> [  466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004
> [  466.172123] #PF: supervisor read access in kernel mode
> [  466.177857] #PF: error_code(0x0000) - not-present page
> [  466.183601] PGD 0 P4D 0
> [  466.186443] Oops: 0000 [#1] SMP NOPTI
> [  466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1
> [  466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019
> [  466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd]

in nfsd4_encode_fattr there is code:

	if ((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) && !fhp) {
		tempfh = kmalloc(sizeof(struct svc_fh), GFP_KERNEL);
		status = nfserr_jukebox;
		if (!tempfh)
			goto out;
		fh_init(tempfh, NFS4_FHSIZE);
		status = fh_compose(tempfh, exp, dentry, NULL);
		if (status)
			goto out;
		fhp = tempfh;
	}

Change that to test for (bmval1 & FATTR4_WORD1_MOUNTED_ON_FILEID) as
well.

NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-23  6:29                             ` NeilBrown
@ 2021-06-23  9:34                               ` Wang Yugui
  2021-06-23 23:38                                 ` NeilBrown
  0 siblings, 1 reply; 94+ messages in thread
From: Wang Yugui @ 2021-06-23  9:34 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 3858 bytes --]

Hi,

> On Wed, 23 Jun 2021, Wang Yugui wrote:
> > Hi,
> > 
> > This patch works very well. Thanks a lot.
> > -  crossmnt of btrfs subvol works as expected.
> > -  nfs/umount subvol works well.
> > -  pseudo mount point inode(255) is good.
> > 
> > I test it in 5.10.45 with a few minor rebase.
> > ( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch,
> > just fs/nfsd/nfs3xdr.c rebase)
> > 
> > But when I tested it with another btrfs system without subvol but with
> > more data, 'find /nfs/test' caused a OOPS .  and this OOPS will not
> > happen just without this patch.
> > 
> > The data in this filesystem is created/left by xfstest(FSTYP=nfs,
> > TEST_DEV).
> > 
> > #nfs4 option: default mount.nfs4, nfs-utils-2.3.3
> > 
> > # access btrfs directly
> > $ find /mnt/test | wc -l
> > 6612
> > 
> > # access btrfs through nfs
> > $ find /nfs/test | wc -l
> > 
> > [  466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004
> > [  466.172123] #PF: supervisor read access in kernel mode
> > [  466.177857] #PF: error_code(0x0000) - not-present page
> > [  466.183601] PGD 0 P4D 0
> > [  466.186443] Oops: 0000 [#1] SMP NOPTI
> > [  466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1
> > [  466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019
> > [  466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
> 
> in nfsd4_encode_fattr there is code:
> 
> 	if ((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) && !fhp) {
> 		tempfh = kmalloc(sizeof(struct svc_fh), GFP_KERNEL);
> 		status = nfserr_jukebox;
> 		if (!tempfh)
> 			goto out;
> 		fh_init(tempfh, NFS4_FHSIZE);
> 		status = fh_compose(tempfh, exp, dentry, NULL);
> 		if (status)
> 			goto out;
> 		fhp = tempfh;
> 	}
> 
> Change that to test for (bmval1 & FATTR4_WORD1_MOUNTED_ON_FILEID) as
> well.
> 
> NeilBrown


It works well.

-	if ((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) && !fhp) {
+	if (((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) ||
+		 (bmval1 & FATTR4_WORD1_MOUNTED_ON_FILEID))
+		 && !fhp) {
 		tempfh = kmalloc(sizeof(struct svc_fh), GFP_KERNEL);
 		status = nfserr_jukebox;
 		if (!tempfh)


And I tested some case about statfs.f_fsid conflict between btrfs
filesystem, and it works well too. 

Is it safe in theory too?

test case:
two btrfs filesystem with just 1 bit diff of UUID
# blkid /dev/sdb1 /dev/sdb2
/dev/sdb1: UUID="35327ecf-a5a7-4617-a160-1fdbfd644940" UUID_SUB="a831ebde-1e66-4592-bfde-7a86fd6478b5" BLOCK_SIZE="4096" TYPE="btrfs" PARTLABEL="primary" PARTUUID="3e30a849-88db-4fb3-92e6-b66bfbe9cb98"
/dev/sdb2: UUID="35327ecf-a5a7-4617-a160-1fdbfd644941" UUID_SUB="31e07d66-a656-48a8-b1fb-5b438565238e" BLOCK_SIZE="4096" TYPE="btrfs" PARTLABEL="primary" PARTUUID="93a2db85-065a-4ecf-89d4-6a8dcdb8ff99"

both have 3 subvols.
# btrfs subvolume list /mnt/test
ID 256 gen 13 top level 5 path sub1
ID 257 gen 13 top level 5 path sub2
ID 258 gen 13 top level 5 path sub3
# btrfs subvolume list /mnt/scratch
ID 256 gen 13 top level 5 path sub1
ID 257 gen 13 top level 5 path sub2
ID 258 gen 13 top level 5 path sub3


statfs.f_fsid.c is the source of 'statfs' command.

# statfs /mnt/test/sub1 /mnt/test/sub2 /mnt/test/sub3 /mnt/scratch/sub1 /mnt/scratch/sub2 /mnt/scratch/sub3
/mnt/test/sub1
        f_fsid=0x9452611458c30e57
/mnt/test/sub2
        f_fsid=0x9452611458c30e56
/mnt/test/sub3
        f_fsid=0x9452611458c30e55
/mnt/scratch/sub1
        f_fsid=0x9452611458c30e56
/mnt/scratch/sub2
        f_fsid=0x9452611458c30e57
/mnt/scratch/sub3
        f_fsid=0x9452611458c30e54

statfs.f_fsid is uniq inside a btrfs and it's subvols.
but statfs.f_fsid is NOT uniq between btrfs filesystems because just 1
bit diff of UUID.

'find /mnt/test/' and 'find /mnt/scratch/' works as expected.


Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/23


[-- Attachment #2: statfs.f_fsid.c --]
[-- Type: application/octet-stream, Size: 698 bytes --]

#define __USE_LARGEFILE64

#include <stdio.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/statfs.h>

int main(int argc, char **argv)
{
		struct statfs st;
		char *path;
		int ret=0;

		if (argc < 2) {
				fprintf(stderr, "Usage:%s path [path].. \n", argv[0]);
				return 1;
		}
		for (int i = 1; i < argc; ++i) {
				path = argv[i];
				if (statfs(path, &st) == 0 &&
					(st.f_fsid.__val[0] || st.f_fsid.__val[1])) {
						fprintf(stdout, "%s\n\tf_fsid=0x%08x%08x\n", path,
								st.f_fsid.__val[0], st.f_fsid.__val[1]);
				} else {
						fprintf(stdout, "%s\n\tstatfs error or null f_fsid.\n",path);
						ret=1;
				}
		}
		return ret;
}

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-23  0:59                         ` NeilBrown
  2021-06-23  6:14                           ` Wang Yugui
@ 2021-06-23 15:35                           ` J. Bruce Fields
  2021-06-23 22:04                             ` NeilBrown
  1 sibling, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2021-06-23 15:35 UTC (permalink / raw)
  To: NeilBrown; +Cc: Wang Yugui, linux-nfs

Is there any hope of solving this problem within btrfs?

It doesn't seem like it should have been that difficult for it to give
subvolumes separate superblocks and vfsmounts.

But this has come up before, and I think the answer may have been that
it's just too late to fix.

--b.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-23 15:35                           ` J. Bruce Fields
@ 2021-06-23 22:04                             ` NeilBrown
  2021-06-23 22:25                               ` J. Bruce Fields
  2021-06-24 21:58                               ` Patrick Goetz
  0 siblings, 2 replies; 94+ messages in thread
From: NeilBrown @ 2021-06-23 22:04 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Wang Yugui, linux-nfs

On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> Is there any hope of solving this problem within btrfs?
> 
> It doesn't seem like it should have been that difficult for it to give
> subvolumes separate superblocks and vfsmounts.
> 
> But this has come up before, and I think the answer may have been that
> it's just too late to fix.

It is never too late to do the right thing!

Probably the best approach to fixing this completely on the btrfs side
would be to copy the auto-mount approach used in NFS.  NFS sees multiple
different volumes on the server and transparently creates new vfsmounts,
using the automount infrastructure to mount and unmount them.  BTRFS
effective sees multiple volumes on the block device and could do the
same thing.

I can only think of one change to the user-space API (other than
/proc/mounts contents) that this would cause and I suspect it could be
resolved if needed.

Currently when you 'stat' the mountpoint of a btrfs subvol you see the
root of that subvol.  However when you 'stat' the mountpoint of an NFS
sub-filesystem (before any access below there) you see the mountpoint
(s_dev matches the parent).  This is how automounts are expected to work
and if btrfs were switched to use automounts for subvols, stating the
mountpoint would initially show the mountpoint, not the subvol root.

If this were seen to be a problem I doubt it would be hard to add
optional functionality to automount so that 'stat' triggers the mount.

All we really need is:
1/ someone to write the code
2/ someone to review the code
3/ someone to accept the code

How hard can it be :-)

NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-23 22:04                             ` NeilBrown
@ 2021-06-23 22:25                               ` J. Bruce Fields
  2021-06-23 23:29                                 ` NeilBrown
  2021-06-24 21:58                               ` Patrick Goetz
  1 sibling, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2021-06-23 22:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: Wang Yugui, linux-nfs

On Thu, Jun 24, 2021 at 08:04:57AM +1000, NeilBrown wrote:
> On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> > Is there any hope of solving this problem within btrfs?
> > 
> > It doesn't seem like it should have been that difficult for it to give
> > subvolumes separate superblocks and vfsmounts.
> > 
> > But this has come up before, and I think the answer may have been that
> > it's just too late to fix.
> 
> It is never too late to do the right thing!
> 
> Probably the best approach to fixing this completely on the btrfs side
> would be to copy the auto-mount approach used in NFS.  NFS sees multiple
> different volumes on the server and transparently creates new vfsmounts,
> using the automount infrastructure to mount and unmount them.  BTRFS
> effective sees multiple volumes on the block device and could do the
> same thing.

Yes, that makes sense to me.

> I can only think of one change to the user-space API (other than
> /proc/mounts contents) that this would cause and I suspect it could be
> resolved if needed.
> 
> Currently when you 'stat' the mountpoint of a btrfs subvol you see the
> root of that subvol.  However when you 'stat' the mountpoint of an NFS
> sub-filesystem (before any access below there) you see the mountpoint
> (s_dev matches the parent).  This is how automounts are expected to work
> and if btrfs were switched to use automounts for subvols, stating the
> mountpoint would initially show the mountpoint, not the subvol root.
> 
> If this were seen to be a problem I doubt it would be hard to add
> optional functionality to automount so that 'stat' triggers the mount.

One other thing I'm not sure about: how do cold cache lookups of
filehandles for (possibly not-yet-mounted) subvolumes work?

> All we really need is:
> 1/ someone to write the code
> 2/ someone to review the code
> 3/ someone to accept the code

Hah.  Still, the special exceptions for btrfs seem to be accumulating.
I wonder if that's happening outside nfs as well.

--b.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-23 22:25                               ` J. Bruce Fields
@ 2021-06-23 23:29                                 ` NeilBrown
  2021-06-23 23:41                                   ` Frank Filz
  2021-06-24  0:01                                   ` J. Bruce Fields
  0 siblings, 2 replies; 94+ messages in thread
From: NeilBrown @ 2021-06-23 23:29 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Wang Yugui, linux-nfs

On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> On Thu, Jun 24, 2021 at 08:04:57AM +1000, NeilBrown wrote:
> > On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> 
> One other thing I'm not sure about: how do cold cache lookups of
> filehandles for (possibly not-yet-mounted) subvolumes work?

Ahhhh...  that's a good point.  Filehandle lookup depends on the target
filesystem being mounted.  NFS exporting filesystems which are
auto-mounted on demand would be ... interesting.

That argues in favour of nfsd treating a btrfs filesystem as a single
filesystem and gaining some knowledge about different subvolumes within
a filesystem.

This has implications for NFS re-export.  If a filehandle is received
for an NFS filesystem that needs to be automounted, I expect it would
fail.

Or do we want to introduce a third level in the filehandle: filesystem,
subvol, inode.  So just the "filesystem" is used to look things up in
/proc/mounts, but "filesystem+subvol" is used to determine the fsid.

Maybe another way to state this is that the filesystem could identify a
number of bytes from the fs-local part of the filehandle that should be
mixed in to the fsid.  That might be a reasonably clean interface.

> 
> > All we really need is:
> > 1/ someone to write the code
> > 2/ someone to review the code
> > 3/ someone to accept the code
> 
> Hah.  Still, the special exceptions for btrfs seem to be accumulating.
> I wonder if that's happening outside nfs as well.

I have some colleagues who work on btrfs and based on my occasional
discussions, I think that: yes, btrfs is a bit "special".  There are a
number of corner-cases where it doesn't quite behave how one would hope.
This is probably inevitable given they way it is pushing the boundaries
of functionality.  It can be a challenge to determine if that "hope" is
actually reasonable, and to figure out a good solution that meets the
need cleanly without imposing performance burdens elsewhere.

NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-23  9:34                               ` Wang Yugui
@ 2021-06-23 23:38                                 ` NeilBrown
  0 siblings, 0 replies; 94+ messages in thread
From: NeilBrown @ 2021-06-23 23:38 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-nfs

On Wed, 23 Jun 2021, Wang Yugui wrote:
> Hi,
> 
> > On Wed, 23 Jun 2021, Wang Yugui wrote:
> > > Hi,
> > > 
> > > This patch works very well. Thanks a lot.
> > > -  crossmnt of btrfs subvol works as expected.
> > > -  nfs/umount subvol works well.
> > > -  pseudo mount point inode(255) is good.
> > > 
> > > I test it in 5.10.45 with a few minor rebase.
> > > ( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch,
> > > just fs/nfsd/nfs3xdr.c rebase)
> > > 
> > > But when I tested it with another btrfs system without subvol but with
> > > more data, 'find /nfs/test' caused a OOPS .  and this OOPS will not
> > > happen just without this patch.
> > > 
> > > The data in this filesystem is created/left by xfstest(FSTYP=nfs,
> > > TEST_DEV).
> > > 
> > > #nfs4 option: default mount.nfs4, nfs-utils-2.3.3
> > > 
> > > # access btrfs directly
> > > $ find /mnt/test | wc -l
> > > 6612
> > > 
> > > # access btrfs through nfs
> > > $ find /nfs/test | wc -l
> > > 
> > > [  466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004
> > > [  466.172123] #PF: supervisor read access in kernel mode
> > > [  466.177857] #PF: error_code(0x0000) - not-present page
> > > [  466.183601] PGD 0 P4D 0
> > > [  466.186443] Oops: 0000 [#1] SMP NOPTI
> > > [  466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1
> > > [  466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019
> > > [  466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
> > 
> > in nfsd4_encode_fattr there is code:
> > 
> > 	if ((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) && !fhp) {
> > 		tempfh = kmalloc(sizeof(struct svc_fh), GFP_KERNEL);
> > 		status = nfserr_jukebox;
> > 		if (!tempfh)
> > 			goto out;
> > 		fh_init(tempfh, NFS4_FHSIZE);
> > 		status = fh_compose(tempfh, exp, dentry, NULL);
> > 		if (status)
> > 			goto out;
> > 		fhp = tempfh;
> > 	}
> > 
> > Change that to test for (bmval1 & FATTR4_WORD1_MOUNTED_ON_FILEID) as
> > well.
> > 
> > NeilBrown
> 
> 
> It works well.
> 
> -	if ((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) && !fhp) {
> +	if (((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) ||
> +		 (bmval1 & FATTR4_WORD1_MOUNTED_ON_FILEID))
> +		 && !fhp) {
>  		tempfh = kmalloc(sizeof(struct svc_fh), GFP_KERNEL);
>  		status = nfserr_jukebox;
>  		if (!tempfh)

Good. Thanks for testing.

> 
> 
> And I tested some case about statfs.f_fsid conflict between btrfs
> filesystem, and it works well too. 
> 
> Is it safe in theory too?

Probably .... :-)

> 
> test case:
> two btrfs filesystem with just 1 bit diff of UUID
> # blkid /dev/sdb1 /dev/sdb2
> /dev/sdb1: UUID="35327ecf-a5a7-4617-a160-1fdbfd644940" UUID_SUB="a831ebde-1e66-4592-bfde-7a86fd6478b5" BLOCK_SIZE="4096" TYPE="btrfs" PARTLABEL="primary" PARTUUID="3e30a849-88db-4fb3-92e6-b66bfbe9cb98"
> /dev/sdb2: UUID="35327ecf-a5a7-4617-a160-1fdbfd644941" UUID_SUB="31e07d66-a656-48a8-b1fb-5b438565238e" BLOCK_SIZE="4096" TYPE="btrfs" PARTLABEL="primary" PARTUUID="93a2db85-065a-4ecf-89d4-6a8dcdb8ff99"

Having two UUIDs that differ in just one bit would be somewhat unusual.

> 
> both have 3 subvols.
> # btrfs subvolume list /mnt/test
> ID 256 gen 13 top level 5 path sub1
> ID 257 gen 13 top level 5 path sub2
> ID 258 gen 13 top level 5 path sub3
> # btrfs subvolume list /mnt/scratch
> ID 256 gen 13 top level 5 path sub1
> ID 257 gen 13 top level 5 path sub2
> ID 258 gen 13 top level 5 path sub3
> 
> 
> statfs.f_fsid.c is the source of 'statfs' command.

You can use "stat -f".

> 
> # statfs /mnt/test/sub1 /mnt/test/sub2 /mnt/test/sub3 /mnt/scratch/sub1 /mnt/scratch/sub2 /mnt/scratch/sub3
> /mnt/test/sub1
>         f_fsid=0x9452611458c30e57
> /mnt/test/sub2
>         f_fsid=0x9452611458c30e56
> /mnt/test/sub3
>         f_fsid=0x9452611458c30e55
> /mnt/scratch/sub1
>         f_fsid=0x9452611458c30e56
> /mnt/scratch/sub2
>         f_fsid=0x9452611458c30e57
> /mnt/scratch/sub3
>         f_fsid=0x9452611458c30e54
> 
> statfs.f_fsid is uniq inside a btrfs and it's subvols.
> but statfs.f_fsid is NOT uniq between btrfs filesystems because just 1
> bit diff of UUID.

Maybe we should be using a hash function to mix the various numbers into
the fsid rather than a simple xor.

In general we have no guarantee that the stable identifier for each
filesystem is unique.  We rely on randomness and large numbers of bits
making collisions extremely unlikely.
Using xor doesn't help, but we would need some hash scheme that is
guaranteed to be stable.  Maybe the Jenkins Hash.

NeilBrown


> 
> 'find /mnt/test/' and 'find /mnt/scratch/' works as expected.
> 
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/06/23
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* RE: any idea about auto export multiple btrfs snapshots?
  2021-06-23 23:29                                 ` NeilBrown
@ 2021-06-23 23:41                                   ` Frank Filz
  2021-06-24  0:01                                   ` J. Bruce Fields
  1 sibling, 0 replies; 94+ messages in thread
From: Frank Filz @ 2021-06-23 23:41 UTC (permalink / raw)
  To: 'NeilBrown', 'J. Bruce Fields'
  Cc: 'Wang Yugui', linux-nfs

> On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> > On Thu, Jun 24, 2021 at 08:04:57AM +1000, NeilBrown wrote:
> > > On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> >
> > One other thing I'm not sure about: how do cold cache lookups of
> > filehandles for (possibly not-yet-mounted) subvolumes work?
> 
> Ahhhh...  that's a good point.  Filehandle lookup depends on the target
> filesystem being mounted.  NFS exporting filesystems which are auto-mounted
> on demand would be ... interesting.
> 
> That argues in favour of nfsd treating a btrfs filesystem as a single filesystem
> and gaining some knowledge about different subvolumes within a filesystem.
> 
> This has implications for NFS re-export.  If a filehandle is received for an NFS
> filesystem that needs to be automounted, I expect it would fail.
> 
> Or do we want to introduce a third level in the filehandle: filesystem, subvol,
> inode.  So just the "filesystem" is used to look things up in /proc/mounts, but
> "filesystem+subvol" is used to determine the fsid.
> 
> Maybe another way to state this is that the filesystem could identify a number of
> bytes from the fs-local part of the filehandle that should be mixed in to the fsid.
> That might be a reasonably clean interface.

Hmm, and interesting problem I hadn't considered for nfs-ganesha. Ganesha can handle a lookup into a filesystem (we treat subvols as filesystems) that was not mounted when we started (when we startup we scan mnttab and the btrfs subvol list and add any filesystems belonging to the configured exports) by re-scanning mnttab and the btrfs subvol list.

But what if Ganesha restarted, and then after that, a filesystem that a client had a handle for was not mounted at restart time, but is mounted by the time the client tries to use the handle... That would be easy for us to fix, if a handle specifies an unknown fsid, trigger a filesystem rescan.

> > > All we really need is:
> > > 1/ someone to write the code
> > > 2/ someone to review the code
> > > 3/ someone to accept the code
> >
> > Hah.  Still, the special exceptions for btrfs seem to be accumulating.
> > I wonder if that's happening outside nfs as well.
> 
> I have some colleagues who work on btrfs and based on my occasional
> discussions, I think that: yes, btrfs is a bit "special".  There are a number of
> corner-cases where it doesn't quite behave how one would hope.
> This is probably inevitable given they way it is pushing the boundaries of
> functionality.  It can be a challenge to determine if that "hope" is actually
> reasonable, and to figure out a good solution that meets the need cleanly
> without imposing performance burdens elsewhere.

What other special cases does btrfs have that cause nfs servers pain? I know their handle is big but the only special case code nfs-ganesha has at the moment is listing the subvols as part of the filesystem scan.

Frank



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-23 23:29                                 ` NeilBrown
  2021-06-23 23:41                                   ` Frank Filz
@ 2021-06-24  0:01                                   ` J. Bruce Fields
  1 sibling, 0 replies; 94+ messages in thread
From: J. Bruce Fields @ 2021-06-24  0:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: Wang Yugui, linux-nfs

On Thu, Jun 24, 2021 at 09:29:01AM +1000, NeilBrown wrote:
> On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> > On Thu, Jun 24, 2021 at 08:04:57AM +1000, NeilBrown wrote:
> > > On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> > 
> > One other thing I'm not sure about: how do cold cache lookups of
> > filehandles for (possibly not-yet-mounted) subvolumes work?
> 
> Ahhhh...  that's a good point.  Filehandle lookup depends on the target
> filesystem being mounted.  NFS exporting filesystems which are
> auto-mounted on demand would be ... interesting.
> 
> That argues in favour of nfsd treating a btrfs filesystem as a single
> filesystem and gaining some knowledge about different subvolumes within
> a filesystem.
> 
> This has implications for NFS re-export.  If a filehandle is received
> for an NFS filesystem that needs to be automounted, I expect it would
> fail.

Yeah, that's why this is on my mind.  Currently we can only re-export
filesystems that are explicitly mounted and exported with an fsid=
option.

I had an idea that it would be interesting to run nfsd in a mode where
all it does is re-export everything exported by one single server.  In
theory then you no longer need to do any encapsulation, so you avoid the
maximum-filehandle-length problem.  When you get an unfamiliar
filehandle, you pass it on to the original server and get back an fsid,
and if it's one you haven't seen before you have to cook up a new
vfsmount and stuff somehow.  I ran aground trying to understand how to
do that in a way that wasn't too complicated.

Anyway.

--b.

> Or do we want to introduce a third level in the filehandle: filesystem,
> subvol, inode.  So just the "filesystem" is used to look things up in
> /proc/mounts, but "filesystem+subvol" is used to determine the fsid.
> 
> Maybe another way to state this is that the filesystem could identify a
> number of bytes from the fs-local part of the filehandle that should be
> mixed in to the fsid.  That might be a reasonably clean interface.
> 
> > 
> > > All we really need is:
> > > 1/ someone to write the code
> > > 2/ someone to review the code
> > > 3/ someone to accept the code
> > 
> > Hah.  Still, the special exceptions for btrfs seem to be accumulating.
> > I wonder if that's happening outside nfs as well.
> 
> I have some colleagues who work on btrfs and based on my occasional
> discussions, I think that: yes, btrfs is a bit "special".  There are a
> number of corner-cases where it doesn't quite behave how one would hope.
> This is probably inevitable given they way it is pushing the boundaries
> of functionality.  It can be a challenge to determine if that "hope" is
> actually reasonable, and to figure out a good solution that meets the
> need cleanly without imposing performance burdens elsewhere.
> 
> NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-23 22:04                             ` NeilBrown
  2021-06-23 22:25                               ` J. Bruce Fields
@ 2021-06-24 21:58                               ` Patrick Goetz
  2021-06-24 23:27                                 ` NeilBrown
  1 sibling, 1 reply; 94+ messages in thread
From: Patrick Goetz @ 2021-06-24 21:58 UTC (permalink / raw)
  To: NeilBrown, J. Bruce Fields; +Cc: Wang Yugui, linux-nfs



On 6/23/21 5:04 PM, NeilBrown wrote:
> On Thu, 24 Jun 2021, J. Bruce Fields wrote:
>> Is there any hope of solving this problem within btrfs?
>>
>> It doesn't seem like it should have been that difficult for it to give
>> subvolumes separate superblocks and vfsmounts.
>>
>> But this has come up before, and I think the answer may have been that
>> it's just too late to fix.
> 
> It is never too late to do the right thing!
> 
> Probably the best approach to fixing this completely on the btrfs side
> would be to copy the auto-mount approach used in NFS.  NFS sees multiple
> different volumes on the server and transparently creates new vfsmounts,
> using the automount infrastructure to mount and unmount them.

I'm very confused about what you're talking about.  Is this documented 
somewhere? I mean, I do use autofs, but see that as a separate software 
system working with NFS.


>  BTRFS
> effective sees multiple volumes on the block device and could do the
> same thing.
> 
> I can only think of one change to the user-space API (other than
> /proc/mounts contents) that this would cause and I suspect it could be
> resolved if needed.
> 
> Currently when you 'stat' the mountpoint of a btrfs subvol you see the
> root of that subvol.  However when you 'stat' the mountpoint of an NFS
> sub-filesystem (before any access below there) you see the mountpoint
> (s_dev matches the parent).  This is how automounts are expected to work
> and if btrfs were switched to use automounts for subvols, stating the
> mountpoint would initially show the mountpoint, not the subvol root.
> 
> If this were seen to be a problem I doubt it would be hard to add
> optional functionality to automount so that 'stat' triggers the mount.
> 
> All we really need is:
> 1/ someone to write the code
> 2/ someone to review the code
> 3/ someone to accept the code
> 
> How hard can it be :-)
> 
> NeilBrown
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: any idea about auto export multiple btrfs snapshots?
  2021-06-24 21:58                               ` Patrick Goetz
@ 2021-06-24 23:27                                 ` NeilBrown
  0 siblings, 0 replies; 94+ messages in thread
From: NeilBrown @ 2021-06-24 23:27 UTC (permalink / raw)
  To: Patrick Goetz; +Cc: J. Bruce Fields, Wang Yugui, linux-nfs

On Fri, 25 Jun 2021, Patrick Goetz wrote:
> 
> On 6/23/21 5:04 PM, NeilBrown wrote:
> > 
> > Probably the best approach to fixing this completely on the btrfs side
> > would be to copy the auto-mount approach used in NFS.  NFS sees multiple
> > different volumes on the server and transparently creates new vfsmounts,
> > using the automount infrastructure to mount and unmount them.
> 
> I'm very confused about what you're talking about.  Is this documented 
> somewhere? I mean, I do use autofs, but see that as a separate software 
> system working with NFS.
> 

autofs (together with the user-space automountd) is a special filesystem
that provides automount functionality to the sysadmin.
It makes use of some core automount functionality in the Linux VFS.
This functionality is referred to as "managed" dentries.
See "Revalidation and automounts" in https://lwn.net/Articles/649115/.

autofs makes use of this functionality to provide automounts.  NFS makes
use of this same functionality to provide the same mount-point structure
on the client that it finds on the server.

I don't think there is any documentation specifically about NFS using
this infrastructure.  It should be largely transparent to users.

Suppose that on the server "/export/foo" is a mount of some
filesystem, and you nfs4 mount "server:/export" to "/import" on the
client.
Then you will at first see only "/import" in /proc/mounts on client.
If you "ls -ld /import/foo" you will still only see /import.
But if you "ls -l /import/foo" so it lists the contents of that other
filesytem, then check /proc/mounts, you will now see "/import" and
"/import/foo".

After a while (between 500 and 1000 seconds I think) of not accessing
/import/foo, that entry will disappear from /proc/mounts.

I'm sure you will recognise this as very similar to autofs behaviour.
It uses the same core functionality.  The timeout for inactive NFS
sub-filesystems to be unmounted can be controlled via
/proc/sys/fs/nfs/nfs_mountpoint_timeout and, since Linux 5.7, via the
nfs_mountpoint_expiry_timeout module parameter.
These aren't documented.

Note that I'm no longer sure that btrfs using automount like this would
actually make things easier for nfsd.  But in some ways I think it would
be the "right" thing to do.

NeilBrown


^ permalink raw reply	[flat|nested] 94+ messages in thread

* cannot use btrfs for nfs server
  2021-03-11  7:46   ` Ulli Horlacher
@ 2021-07-08 22:17     ` Ulli Horlacher
  2021-07-09  0:05       ` Graham Cobb
  2021-07-09 16:06       ` Lord Vader
  0 siblings, 2 replies; 94+ messages in thread
From: Ulli Horlacher @ 2021-07-08 22:17 UTC (permalink / raw)
  To: linux-btrfs


I have waited some time and some Ubuntu updates, but the bug is still there:

On Thu 2021-03-11 (08:46), Ulli Horlacher wrote:
> On Wed 2021-03-10 (08:46), Ulli Horlacher wrote:
> 
> > When I try to access a btrfs filesystem via nfs, I get the error:
> > 
> > root@tsmsrvi:~# mount tsmsrvj:/data/fex /nfs/tsmsrvj/fex
> > root@tsmsrvi:~# time find /nfs/tsmsrvj/fex | wc -l
> > find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same file system loop as '/nfs/tsmsrvj/fex'.
> 
> It is even worse:
> 
> root@tsmsrvj:# grep localhost /etc/exports
> /data/fex       localhost(rw,async,no_subtree_check,no_root_squash)
> 
> root@tsmsrvj:# mount localhost:/data/fex /nfs/localhost/fex
> 
> root@tsmsrvj:# du -s /data/fex
> 64282240        /data/fex
> 
> root@tsmsrvj:# du -s /nfs/localhost/fex
> du: WARNING: Circular directory structure.
> This almost certainly means that you have a corrupted file system.
> NOTIFY YOUR SYSTEM MANAGER.
> The following directory is part of the cycle:
>   /nfs/localhost/fex/spool
> 
> 0       /nfs/localhost/fex
> 
> root@tsmsrvj:# btrfs subvolume list /data
> ID 257 gen 42 top level 5 path fex
> ID 270 gen 42 top level 257 path fex/spool
> ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
> ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
> ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
> ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test

root@tsmsrvj:~# uname -a
Linux tsmsrvj 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

root@tsmsrvj:~# btrfs version
btrfs-progs v5.4.1

root@tsmsrvj:~# dpkg -l | grep nfs-
ii  nfs-common                             1:1.3.4-2.5ubuntu3.4              amd64        NFS support files common to client and server
ii  nfs-kernel-server                      1:1.3.4-2.5ubuntu3.4              amd64        support for NFS kernel server

This makes btrfs with snapshots unusable as a nfs server :-(

How/where can I escalate it further?

My Ubuntu bug report has been ignored :-(

https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/1918599

-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<20210311074636.GA28705@tik.uni-stuttgart.de>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-08 22:17     ` cannot use btrfs for nfs server Ulli Horlacher
@ 2021-07-09  0:05       ` Graham Cobb
  2021-07-09  4:05         ` NeilBrown
  2021-07-09  6:53         ` Ulli Horlacher
  2021-07-09 16:06       ` Lord Vader
  1 sibling, 2 replies; 94+ messages in thread
From: Graham Cobb @ 2021-07-09  0:05 UTC (permalink / raw)
  To: linux-btrfs

On 08/07/2021 23:17, Ulli Horlacher wrote:
> 
> I have waited some time and some Ubuntu updates, but the bug is still there:

Yes: find and du get confused about seeing inode numbers reused in what
they think is a single filesystem. However, the filesystems are not
actually corrupted, and all normal file and directory actions work
correctly. The loops and cycles are not there - but find and du can't
tell that.

I use NFS mounts of btrfs disks all the time and have never had any real
problem - just find and du confused.

You can eliminate the problems by exporting and mounting single
subvolumes only - making sure that there are no nested subvolumes
exported, or that the subvolumes are all mounted individually.

> This makes btrfs with snapshots unusable as a nfs server :-(

No, it doesn't. I use it ALL the time: my main data lives on btrfs
servers and is exported to the clients. I use tools like btrbk and
btrfs-snapshot-aware-rsnapshot on the server and then access those btrfs
snapshots from the clients over NFS as well (for example to retrieve
accidentally deleted files). You just have to be careful with subvolume
structure and what you mount where. And I recommend only using find and
du operations on the server, not the client.

> How/where can I escalate it further?

Try complaining to NFS. It might be that it would work better if NFS
assigned different NFS filesystem IDs to each subvolume - I don't know.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-09  0:05       ` Graham Cobb
@ 2021-07-09  4:05         ` NeilBrown
  2021-07-09  6:53         ` Ulli Horlacher
  1 sibling, 0 replies; 94+ messages in thread
From: NeilBrown @ 2021-07-09  4:05 UTC (permalink / raw)
  To: Graham Cobb; +Cc: linux-btrfs

On Fri, 09 Jul 2021, Graham Cobb wrote:
> 
> > How/where can I escalate it further?
> 
> Try complaining to NFS. It might be that it would work better if NFS
> assigned different NFS filesystem IDs to each subvolume - I don't know.
> 
> 
Better than complaining...:

Apply the patch you can find at

 https://lore.kernel.org/linux-nfs/162457725899.28671.14092003979067994699@noble.neil.brown.name/T/#mc4752a019af79cbb166d5338d6ed0db141832546

then apply the fix described at

 https://lore.kernel.org/linux-nfs/162457725899.28671.14092003979067994699@noble.neil.brown.name/T/#mc26984e10e7837e28aca3209fcb03b38a4df6fe7

which I think is shown in more detail in a subsequent message in the
thread.

Then confirm for yourself that it works.
Then reply to that thread (or send a new message to linux-nfs) saying something like:

 Hi,
  I've been having problems with NFS and btrfs too.  I found this patch
  and it works really well for me.  Any chance we can get it included
  upstream? 

That might spur us on to further action - enthusiasm is much better than
complaints :-)

(the problem is not that NFS doesn't assign different filesystem IDs,
 the problem is that NFSd doesn't tell NFS that there are different volumes).

NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-09  0:05       ` Graham Cobb
  2021-07-09  4:05         ` NeilBrown
@ 2021-07-09  6:53         ` Ulli Horlacher
  2021-07-09  7:23           ` Forza
  2021-07-09 16:35           ` Chris Murphy
  1 sibling, 2 replies; 94+ messages in thread
From: Ulli Horlacher @ 2021-07-09  6:53 UTC (permalink / raw)
  To: linux-btrfs

On Fri 2021-07-09 (01:05), Graham Cobb wrote:
> On 08/07/2021 23:17, Ulli Horlacher wrote:
> 
> > 
> > I have waited some time and some Ubuntu updates, but the bug is still there:
> 
> Yes: find and du get confused about seeing inode numbers reused in what
> they think is a single filesystem.

A lot of tools aren't working correctly any more, even ls:

root@tsmsrvj:~# ls -R /nfs/localhost/fex | wc 
ls: /nfs/localhost/fex/spool: not listing already-listed directory

In consequence many cronjobs and montoring tools will fail :-(


> You can eliminate the problems by exporting and mounting single
> subvolumes only 

This is not possible at our site, we use rotating snapshots created by a
cronjob.


-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<56c40592-0937-060a-5f8a-969d8a88d541@cobb.uk.net>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-09  6:53         ` Ulli Horlacher
@ 2021-07-09  7:23           ` Forza
  2021-07-09  7:24             ` Hugo Mills
  2021-07-09  7:34             ` Ulli Horlacher
  2021-07-09 16:35           ` Chris Murphy
  1 sibling, 2 replies; 94+ messages in thread
From: Forza @ 2021-07-09  7:23 UTC (permalink / raw)
  To: Ulli Horlacher, linux-btrfs

Hello everyone, 

---- From: Ulli Horlacher <framstag@rus.uni-stuttgart.de> -- Sent: 2021-07-09 - 08:53 ----

> On Fri 2021-07-09 (01:05), Graham Cobb wrote:
>> On 08/07/2021 23:17, Ulli Horlacher wrote:
>> 
>> > 
>> > I have waited some time and some Ubuntu updates, but the bug is still there:
>> 
>> Yes: find and du get confused about seeing inode numbers reused in what
>> they think is a single filesystem.
> 
> A lot of tools aren't working correctly any more, even ls:
> 
> root@tsmsrvj:~# ls -R /nfs/localhost/fex | wc 
> ls: /nfs/localhost/fex/spool: not listing already-listed directory
> 
> In consequence many cronjobs and montoring tools will fail :-(
> 
> 
>> You can eliminate the problems by exporting and mounting single
>> subvolumes only 
> 
> This is not possible at our site, we use rotating snapshots created by a
> cronjob.
> 
> 

Have you tried using the fsid= export option in /etc/exports? 

Example:
/media/nfs/  192.168.0.*(fsid=20000001,rw,sync,no_subtree_check,no_root_squash)

We're using this with Btrfs subvols without issues. We use NFSv4 so I do not know how this works with NFSv3. 

Example:
## On the Ubuntu NFS server:
# btrfs sub list -o .
ID 5384 gen 345641 top level 258 path volume/nfs_ssd/132bbc3e-aed1-15a5-f30d-9515e490e62c/subvol1
ID 5385 gen 345640 top level 258 path volume/nfs_ssd/132bbc3e-aed1-15a5-f30d-9515e490e62c/subvol2

## On the NFS client:
[09:20 srv01 132bbc3e-aed1-15a5-f30d-9515e490e62c]# ll
total 0
drwxr-xr-x 1 root root 6 Jul  9 09:17 subvol1
drwxr-xr-x 1 root root 0 Jul  9 09:17 subvol2
[09:20 srv01 132bbc3e-aed1-15a5-f30d-9515e490e62c]# touch subvol1/foo
[09:20 srv01 132bbc3e-aed1-15a5-f30d-9515e490e62c]# touch subvol2/bar
[09:20 srv01 132bbc3e-aed1-15a5-f30d-9515e490e62c]# touch foobar
[09:20 srv01 132bbc3e-aed1-15a5-f30d-9515e490e62c]# ll -R
.:
total 0
-rw-r--r-- 1 root root  0 Jul  9 09:20 foobar
drwxr-xr-x 1 root root 12 Jul  9 09:20 subvol1
drwxr-xr-x 1 root root  6 Jul  9 09:20 subvol2

./subvol1:
total 0
-rw-r--r-- 1 root root 0 Jul  9 09:17 bar
-rw-r--r-- 1 root root 0 Jul  9 09:20 foo

./subvol2:
total 0
-rw-r--r-- 1 root root 0 Jul  9 09:20 bar




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-09  7:23           ` Forza
@ 2021-07-09  7:24             ` Hugo Mills
  2021-07-09  7:34             ` Ulli Horlacher
  1 sibling, 0 replies; 94+ messages in thread
From: Hugo Mills @ 2021-07-09  7:24 UTC (permalink / raw)
  To: Forza; +Cc: Ulli Horlacher, linux-btrfs

   I'm using it on NFSv3 and it works fine for me.

   Hugo.

On Fri, Jul 09, 2021 at 09:23:14AM +0200, Forza wrote:
> Hello everyone, 
> 
> ---- From: Ulli Horlacher <framstag@rus.uni-stuttgart.de> -- Sent: 2021-07-09 - 08:53 ----
> 
> > On Fri 2021-07-09 (01:05), Graham Cobb wrote:
> >> On 08/07/2021 23:17, Ulli Horlacher wrote:
> >> 
> >> > 
> >> > I have waited some time and some Ubuntu updates, but the bug is still there:
> >> 
> >> Yes: find and du get confused about seeing inode numbers reused in what
> >> they think is a single filesystem.
> > 
> > A lot of tools aren't working correctly any more, even ls:
> > 
> > root@tsmsrvj:~# ls -R /nfs/localhost/fex | wc 
> > ls: /nfs/localhost/fex/spool: not listing already-listed directory
> > 
> > In consequence many cronjobs and montoring tools will fail :-(
> > 
> > 
> >> You can eliminate the problems by exporting and mounting single
> >> subvolumes only 
> > 
> > This is not possible at our site, we use rotating snapshots created by a
> > cronjob.
> > 
> > 
> 
> Have you tried using the fsid= export option in /etc/exports? 
> 
> Example:
> /media/nfs/  192.168.0.*(fsid=20000001,rw,sync,no_subtree_check,no_root_squash)
> 
> We're using this with Btrfs subvols without issues. We use NFSv4 so I do not know how this works with NFSv3. 
> 
> Example:
> ## On the Ubuntu NFS server:
> # btrfs sub list -o .
> ID 5384 gen 345641 top level 258 path volume/nfs_ssd/132bbc3e-aed1-15a5-f30d-9515e490e62c/subvol1
> ID 5385 gen 345640 top level 258 path volume/nfs_ssd/132bbc3e-aed1-15a5-f30d-9515e490e62c/subvol2
> 
> ## On the NFS client:
> [09:20 srv01 132bbc3e-aed1-15a5-f30d-9515e490e62c]# ll
> total 0
> drwxr-xr-x 1 root root 6 Jul  9 09:17 subvol1
> drwxr-xr-x 1 root root 0 Jul  9 09:17 subvol2
> [09:20 srv01 132bbc3e-aed1-15a5-f30d-9515e490e62c]# touch subvol1/foo
> [09:20 srv01 132bbc3e-aed1-15a5-f30d-9515e490e62c]# touch subvol2/bar
> [09:20 srv01 132bbc3e-aed1-15a5-f30d-9515e490e62c]# touch foobar
> [09:20 srv01 132bbc3e-aed1-15a5-f30d-9515e490e62c]# ll -R
> .:
> total 0
> -rw-r--r-- 1 root root  0 Jul  9 09:20 foobar
> drwxr-xr-x 1 root root 12 Jul  9 09:20 subvol1
> drwxr-xr-x 1 root root  6 Jul  9 09:20 subvol2
> 
> ./subvol1:
> total 0
> -rw-r--r-- 1 root root 0 Jul  9 09:17 bar
> -rw-r--r-- 1 root root 0 Jul  9 09:20 foo
> 
> ./subvol2:
> total 0
> -rw-r--r-- 1 root root 0 Jul  9 09:20 bar
> 
> 
> 

-- 
Hugo Mills             | Modern medicine does not treat causes: headaches are
hugo@... carfax.org.uk | not caused by a paracetamol deficiency.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-09  7:23           ` Forza
  2021-07-09  7:24             ` Hugo Mills
@ 2021-07-09  7:34             ` Ulli Horlacher
  2021-07-09 16:30               ` Chris Murphy
  1 sibling, 1 reply; 94+ messages in thread
From: Ulli Horlacher @ 2021-07-09  7:34 UTC (permalink / raw)
  To: linux-btrfs

On Fri 2021-07-09 (09:23), Forza wrote:

> > In consequence many cronjobs and montoring tools will fail :-(
> > 
> >> You can eliminate the problems by exporting and mounting single
> >> subvolumes only 
> > 
> > This is not possible at our site, we use rotating snapshots created by a
> > cronjob.

> Have you tried using the fsid= export option in /etc/exports? 

I have testet it with localhost:

root@tsmsrvj:/# grep localhost /etc/exports 
/data/fex       localhost(rw,async,no_subtree_check,no_root_squash,fsid=20000001)

root@tsmsrvj:/# mount -v localhost:/data/fex /nfs/localhost/fex
mount.nfs: timeout set for Fri Jul  9 09:32:55 2021
mount.nfs: trying text-based options 'vers=4.2,addr=127.0.0.1,clientaddr=127.0.0.1'

root@tsmsrvj:/# mount | grep localhost
localhost:/data/fex on /nfs/localhost/fex type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1)

root@tsmsrvj:/# du -s /nfs/localhost/fex
du: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following directory is part of the cycle:
  /nfs/localhost/fex/spool


-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<475ccf1.ca37f515.17a8a262a72@tnonline.net>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-08 22:17     ` cannot use btrfs for nfs server Ulli Horlacher
  2021-07-09  0:05       ` Graham Cobb
@ 2021-07-09 16:06       ` Lord Vader
  2021-07-10  7:03         ` Ulli Horlacher
  1 sibling, 1 reply; 94+ messages in thread
From: Lord Vader @ 2021-07-09 16:06 UTC (permalink / raw)
  To: linux-btrfs

On Fri, 9 Jul 2021 at 01:18, Ulli Horlacher
<framstag@rus.uni-stuttgart.de> wrote:
> I have waited some time and some Ubuntu updates, but the bug is still there:
> > > When I try to access a btrfs filesystem via nfs, I get the error:
> > > root@tsmsrvi:~# mount tsmsrvj:/data/fex /nfs/tsmsrvj/fex
> > > root@tsmsrvi:~# time find /nfs/tsmsrvj/fex | wc -l
> > > find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same file system loop as '/nfs/tsmsrvj/fex'.

Can you try exporting NFS share with 'crossmnt' option?

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-09  7:34             ` Ulli Horlacher
@ 2021-07-09 16:30               ` Chris Murphy
  2021-07-10  6:35                 ` Ulli Horlacher
  0 siblings, 1 reply; 94+ messages in thread
From: Chris Murphy @ 2021-07-09 16:30 UTC (permalink / raw)
  To: Btrfs BTRFS

On Fri, Jul 9, 2021 at 1:34 AM Ulli Horlacher
<framstag@rus.uni-stuttgart.de> wrote:
>
> root@tsmsrvj:/# du -s /nfs/localhost/fex
> du: WARNING: Circular directory structure.
> This almost certainly means that you have a corrupted file system.
> NOTIFY YOUR SYSTEM MANAGER.
> The following directory is part of the cycle:
>   /nfs/localhost/fex/spool

What do you get for:

btrfs subvolume list -to /nfs/localhost/fex


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-09  6:53         ` Ulli Horlacher
  2021-07-09  7:23           ` Forza
@ 2021-07-09 16:35           ` Chris Murphy
  2021-07-10  6:56             ` Ulli Horlacher
  1 sibling, 1 reply; 94+ messages in thread
From: Chris Murphy @ 2021-07-09 16:35 UTC (permalink / raw)
  To: Btrfs BTRFS

On Fri, Jul 9, 2021 at 12:53 AM Ulli Horlacher
<framstag@rus.uni-stuttgart.de> wrote:
>
> On Fri 2021-07-09 (01:05), Graham Cobb wrote:

> > You can eliminate the problems by exporting and mounting single
> > subvolumes only
>
> This is not possible at our site, we use rotating snapshots created by a
> cronjob.

These two things sound orthogonal to me. You can have a:

<FS_TREE>/fex which is mounted via fstab using -o subvol=fex /nfs/localhost/fex

And you can separately snapshot fex from the top-level, mounted
anywhere you want, but I kinda like putting such things in /run/
because then they're not in the way for more routine/interactive
locations like /media or /mnt.

But I don't really understand your workflow, or what the fstab or
subvolume setup looks like. Are you able to share the cron job script,
the fstab, and the full subvolume listing? btrfs subvolume list -ta
/nfs/localhost/fex ?




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-09 16:30               ` Chris Murphy
@ 2021-07-10  6:35                 ` Ulli Horlacher
  2021-07-11 11:41                   ` Forza
  0 siblings, 1 reply; 94+ messages in thread
From: Ulli Horlacher @ 2021-07-10  6:35 UTC (permalink / raw)
  To: Btrfs BTRFS

On Fri 2021-07-09 (10:30), Chris Murphy wrote:
> On Fri, Jul 9, 2021 at 1:34 AM Ulli Horlacher
> <framstag@rus.uni-stuttgart.de> wrote:
> 
> >
> > root@tsmsrvj:/# du -s /nfs/localhost/fex
> > du: WARNING: Circular directory structure.
> > This almost certainly means that you have a corrupted file system.
> > NOTIFY YOUR SYSTEM MANAGER.
> > The following directory is part of the cycle:
> >   /nfs/localhost/fex/spool
> 
> What do you get for:
> 
> btrfs subvolume list -to /nfs/localhost/fex

root@tsmsrvj:~# btrfs subvolume list -to /nfs/localhost/fex
ERROR: not a btrfs filesystem: /nfs/localhost/fex
ERROR: can't access '/nfs/localhost/fex'


root@tsmsrvj:~# mount | grep localhost
localhost:/data/fex on /nfs/localhost/fex type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1)


-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<CAJCQCtR=Xar+0pD9ivhk-kfrWxTxbJpVYu3z8A617GKshf2AsA@mail.gmail.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-09 16:35           ` Chris Murphy
@ 2021-07-10  6:56             ` Ulli Horlacher
  2021-07-10 22:17               ` Chris Murphy
  0 siblings, 1 reply; 94+ messages in thread
From: Ulli Horlacher @ 2021-07-10  6:56 UTC (permalink / raw)
  To: Btrfs BTRFS

On Fri 2021-07-09 (10:35), Chris Murphy wrote:

> But I don't really understand your workflow, or what the fstab or
> subvolume setup looks like. Are you able to share the cron job script,
> the fstab, and the full subvolume listing? btrfs subvolume list -ta
> /nfs/localhost/fex ?

/nfs/localhost/fex is just a test setup on a test server.
The production server does not use nfs so far, but we plan to migrate from
local disks to nfs. But before we do it, btrfs via nfs MUST work without
problems and error messages.

/nfs/localhost/fex is not in /etc/fstab, I have mounted it manually, as I
wrote in my previous mails. It is just a test.


root@tsmsrvj:# grep local /etc/exports 
/data/fex       localhost(rw,async,no_subtree_check,no_root_squash,fsid=20000001)

root@tsmsrvj:# mount -v localhost:/data/fex /nfs/localhost/fex
mount.nfs: timeout set for Sat Jul 10 08:47:57 2021
mount.nfs: trying text-based options 'vers=4.2,addr=127.0.0.1,clientaddr=127.0.0.1'

root@tsmsrvj:# snaprotate -v test 5 /data/fex/spool
$ btrfs subvolume snapshot -r /data/fex/spool /data/fex/spool/.snapshot/2021-07-10_0849.test
Create a readonly snapshot of '/data/fex/spool' in '/data/fex/spool/.snapshot/2021-07-10_0849.test'

root@tsmsrvj:# snaprotate -l
/data/fex/spool/.snapshot/2021-03-07_1453.test
/data/fex/spool/.snapshot/2021-03-07_1531.test
/data/fex/spool/.snapshot/2021-03-07_1532.test
/data/fex/spool/.snapshot/2021-03-07_1718.test
/data/fex/spool/.snapshot/2021-07-10_0849.test

root@tsmsrvj:# btrfs subvolume list /data
ID 257 gen 1466 top level 5 path fex
ID 270 gen 1471 top level 257 path fex/spool
ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test
ID 394 gen 1470 top level 270 path fex/spool/.snapshot/2021-07-10_0849.test




We cannot move the snapshots to a different directory. Our workflow
depends on snaprotate:

http://fex.belwue.de/linuxtools/snaprotate.html



-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<CAJCQCtQvak-28B7eUf5zRnAeGK27qZaF-1ZZt=OAHk+2KmfsWQ@mail.gmail.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-09 16:06       ` Lord Vader
@ 2021-07-10  7:03         ` Ulli Horlacher
  0 siblings, 0 replies; 94+ messages in thread
From: Ulli Horlacher @ 2021-07-10  7:03 UTC (permalink / raw)
  To: linux-btrfs

On Fri 2021-07-09 (19:06), Lord Vader wrote:
> On Fri, 9 Jul 2021 at 01:18, Ulli Horlacher
> <framstag@rus.uni-stuttgart.de> wrote:
> 
> > I have waited some time and some Ubuntu updates, but the bug is still there:
> > > > When I try to access a btrfs filesystem via nfs, I get the error:
> > > > root@tsmsrvi:~# mount tsmsrvj:/data/fex /nfs/tsmsrvj/fex
> > > > root@tsmsrvi:~# time find /nfs/tsmsrvj/fex | wc -l
> > > > find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same file system loop as '/nfs/tsmsrvj/fex'.
> 
> Can you try exporting NFS share with 'crossmnt' option?

root@tsmsrvj:/etc# exportfs -v
/data/fex       localhost.localdomain(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)

root@tsmsrvj:/etc# mount -v localhost:/data/fex /nfs/localhost/fex
mount.nfs: timeout set for Sat Jul 10 09:02:31 2021
mount.nfs: trying text-based options 'vers=4.2,addr=127.0.0.1,clientaddr=127.0.0.1'

root@tsmsrvj:/etc# du -s /nfs/localhost/fex
du: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following directory is part of the cycle:
  /nfs/localhost/fex/spool

-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<CAMnT83vyufNCMDQQnyYi-k8dOft3_bc_2L-rgHOBzeWgKqPt2A@mail.gmail.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-10  6:56             ` Ulli Horlacher
@ 2021-07-10 22:17               ` Chris Murphy
  2021-07-12  7:25                 ` Ulli Horlacher
  0 siblings, 1 reply; 94+ messages in thread
From: Chris Murphy @ 2021-07-10 22:17 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sat, Jul 10, 2021 at 12:56 AM Ulli Horlacher
<framstag@rus.uni-stuttgart.de> wrote:

> root@tsmsrvj:# snaprotate -v test 5 /data/fex/spool
> $ btrfs subvolume snapshot -r /data/fex/spool /data/fex/spool/.snapshot/2021-07-10_0849.test
> Create a readonly snapshot of '/data/fex/spool' in '/data/fex/spool/.snapshot/2021-07-10_0849.test'

I think this might be the source of the problem. Nested snapshots are
not a good idea, it causes various kinds of confusion. It's not any
different if you do an LVM snapshot and nest a bind mount of one file
system in another. I have no idea how NFS works but it sounds to me
it's getting confused when finding the same file system inodes
multiple times and that's just what happens with snapshots. Whether
Btrfs or some other snapshotting mechanism.


> We cannot move the snapshots to a different directory. Our workflow
> depends on snaprotate:
>
> http://fex.belwue.de/linuxtools/snaprotate.html

OK does the problem happen if you have no nested snapshots (no nested
subvolumes of any kind) in the NFS export path? If the problem doesn't
happen, then either the tool you've chosen needs to be enhanced so it
will create snapshots somewhere else, which Btrfs supports, or you
need to find another tool that can.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-10  6:35                 ` Ulli Horlacher
@ 2021-07-11 11:41                   ` Forza
  2021-07-12  7:17                     ` Ulli Horlacher
  0 siblings, 1 reply; 94+ messages in thread
From: Forza @ 2021-07-11 11:41 UTC (permalink / raw)
  To: Btrfs BTRFS



On 2021-07-10 08:35, Ulli Horlacher wrote:
> On Fri 2021-07-09 (10:30), Chris Murphy wrote:
>> On Fri, Jul 9, 2021 at 1:34 AM Ulli Horlacher
>> <framstag@rus.uni-stuttgart.de> wrote:
>>
>>>
>>> root@tsmsrvj:/# du -s /nfs/localhost/fex
>>> du: WARNING: Circular directory structure.
>>> This almost certainly means that you have a corrupted file system.
>>> NOTIFY YOUR SYSTEM MANAGER.
>>> The following directory is part of the cycle:
>>>    /nfs/localhost/fex/spool
>>
>> What do you get for:
>>
>> btrfs subvolume list -to /nfs/localhost/fex
> 
> root@tsmsrvj:~# btrfs subvolume list -to /nfs/localhost/fex
> ERROR: not a btrfs filesystem: /nfs/localhost/fex
> ERROR: can't access '/nfs/localhost/fex'
> 
> 
> root@tsmsrvj:~# mount | grep localhost
> localhost:/data/fex on /nfs/localhost/fex type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1)
> 
> 

I think you should have done the btrfs filesystem and not nfs mount:

btrfs subvolume list -to /data/fex

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-11 11:41                   ` Forza
@ 2021-07-12  7:17                     ` Ulli Horlacher
  0 siblings, 0 replies; 94+ messages in thread
From: Ulli Horlacher @ 2021-07-12  7:17 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sun 2021-07-11 (13:41), Forza wrote:

> btrfs subvolume list -to /data/fex

root@tsmsrvj:/# btrfs subvolume list -to /data/fex
ID      gen     top level       path
--      ---     ---------       ----
270     1471    257             fex/spool


-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<2fd105cb-c097-63e8-0c43-049dceeb93c9@tnonline.net>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-10 22:17               ` Chris Murphy
@ 2021-07-12  7:25                 ` Ulli Horlacher
  2021-07-12 13:06                   ` Graham Cobb
  0 siblings, 1 reply; 94+ messages in thread
From: Ulli Horlacher @ 2021-07-12  7:25 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sat 2021-07-10 (16:17), Chris Murphy wrote:
> On Sat, Jul 10, 2021 at 12:56 AM Ulli Horlacher
> <framstag@rus.uni-stuttgart.de> wrote:
> 
> > root@tsmsrvj:# snaprotate -v test 5 /data/fex/spool
> > $ btrfs subvolume snapshot -r /data/fex/spool /data/fex/spool/.snapshot/2021-07-10_0849.test
> > Create a readonly snapshot of '/data/fex/spool' in '/data/fex/spool/.snapshot/2021-07-10_0849.test'
> 
> I think this might be the source of the problem. Nested snapshots are
> not a good idea, it causes various kinds of confusion.

I do not have nested snapshots anywhere.
/data/fex/spool is not a snapshot.
/data/fex/spool/.snapshot/2021-07-10_0849.test is a simple snapshot of
the btrfs subvolume /data/fex/spool


> > We cannot move the snapshots to a different directory. Our workflow
> > depends on snaprotate:
> >
> > http://fex.belwue.de/linuxtools/snaprotate.html
> 
> OK does the problem happen if you have no nested snapshots (no nested
> subvolumes of any kind) in the NFS export path?
>
> If the problem doesn't happen, then either the tool you've chosen needs
> to be enhanced so it will create snapshots somewhere else, which Btrfs
> supports, or you need to find another tool that can.

Without snapshots there is no problem, but we need access to the snapshots
on the nfs clients for backup/recovery like Netapp offers it.
But Netapp is EXPENSIVE :-}

If we cannot handle it with btrfs, then we have to switch to ZFS.

-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<CAJCQCtQn0=8KiB=2garN8k2NRd1PO3HBnrMNvmqssSfKT2-UXQ@mail.gmail.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-12  7:25                 ` Ulli Horlacher
@ 2021-07-12 13:06                   ` Graham Cobb
  2021-07-12 16:16                     ` Ulli Horlacher
  0 siblings, 1 reply; 94+ messages in thread
From: Graham Cobb @ 2021-07-12 13:06 UTC (permalink / raw)
  To: Btrfs BTRFS

On 12/07/2021 08:25, Ulli Horlacher wrote:
> On Sat 2021-07-10 (16:17), Chris Murphy wrote:
>> On Sat, Jul 10, 2021 at 12:56 AM Ulli Horlacher
>> <framstag@rus.uni-stuttgart.de> wrote:
>>
>>> root@tsmsrvj:# snaprotate -v test 5 /data/fex/spool
>>> $ btrfs subvolume snapshot -r /data/fex/spool /data/fex/spool/.snapshot/2021-07-10_0849.test
>>> Create a readonly snapshot of '/data/fex/spool' in '/data/fex/spool/.snapshot/2021-07-10_0849.test'
>>
>> I think this might be the source of the problem. Nested snapshots are
>> not a good idea, it causes various kinds of confusion.
> 
> I do not have nested snapshots anywhere.
> /data/fex/spool is not a snapshot.

But it is the subvolume which is being snapshotted. What happens if you
put the snapshots somewhere that is not part of that subvolume? For
example, create /data/fex/snapshots, snapshot /data/fex/spool into a
snapshot in /data/fex/snapshots/spool/2021-07-10_0849.test, export
/data/fex/snapshots using NFS and mount /data/fex/snapshots on the client?

> /data/fex/spool/.snapshot/2021-07-10_0849.test is a simple snapshot of
> the btrfs subvolume /data/fex/spool
> 
> 
>>> We cannot move the snapshots to a different directory. Our workflow
>>> depends on snaprotate:
>>>
>>> http://fex.belwue.de/linuxtools/snaprotate.html

Won't snaprotate follow softlinks? ln -s /data/fex/snapshots
/data/fex/spool/.snapshot

>>
>> OK does the problem happen if you have no nested snapshots (no nested
>> subvolumes of any kind) in the NFS export path?
>>
>> If the problem doesn't happen, then either the tool you've chosen needs
>> to be enhanced so it will create snapshots somewhere else, which Btrfs
>> supports, or you need to find another tool that can.
> 
> Without snapshots there is no problem, but we need access to the snapshots
> on the nfs clients for backup/recovery like Netapp offers it.
> But Netapp is EXPENSIVE :-}

My server snapshots data subvolumes into a different part of the tree
(in my case I use btrbk) and exports them to clients and the clients can
access all the snapshots over NFS perfectly well.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-12 13:06                   ` Graham Cobb
@ 2021-07-12 16:16                     ` Ulli Horlacher
  2021-07-12 22:56                       ` g.btrfs
  0 siblings, 1 reply; 94+ messages in thread
From: Ulli Horlacher @ 2021-07-12 16:16 UTC (permalink / raw)
  To: Btrfs BTRFS

On Mon 2021-07-12 (14:06), Graham Cobb wrote:

> >>> root@tsmsrvj:# snaprotate -v test 5 /data/fex/spool
> >>> $ btrfs subvolume snapshot -r /data/fex/spool /data/fex/spool/.snapshot/2021-07-10_0849.test
> >>> Create a readonly snapshot of '/data/fex/spool' in '/data/fex/spool/.snapshot/2021-07-10_0849.test'
> >>
> >> I think this might be the source of the problem. Nested snapshots are
> >> not a good idea, it causes various kinds of confusion.
> > 
> > I do not have nested snapshots anywhere.
> > /data/fex/spool is not a snapshot.
> 
> But it is the subvolume which is being snapshotted. What happens if you
> put the snapshots somewhere that is not part of that subvolume? For
> example, create /data/fex/snapshots, snapshot /data/fex/spool into a
> snapshot in /data/fex/snapshots/spool/2021-07-10_0849.test, export
> /data/fex/snapshots using NFS and mount /data/fex/snapshots on the client?

Same problem:

root@tsmsrvj:/etc# mount | grep data
/dev/sdb1 on /data type btrfs (rw,relatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/)

root@tsmsrvj:/etc# mkdir /data/snapshots /nfs/localhost/snapshots

root@tsmsrvj:/etc# btrfs subvolume snapshot -r /data/fex/spool /data/snapshots/fex_1
Create a readonly snapshot of '/data/fex/spool' in '/data/snapshots/fex_1'

root@tsmsrvj:/etc# btrfs subvolume snapshot -r /data/fex/spool /data/snapshots/fex_2
Create a readonly snapshot of '/data/fex/spool' in '/data/snapshots/fex_2'

root@tsmsrvj:/etc# btrfs subvolume list /data
ID 257 gen 1558 top level 5 path fex
ID 270 gen 1557 top level 257 path fex/spool
ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test
ID 394 gen 1470 top level 270 path fex/spool/.snapshot/2021-07-10_0849.test
ID 399 gen 1554 top level 270 path fex/spool/.snapshot/2021-07-12_1747.test
ID 400 gen 1556 top level 5 path snapshots/fex_1
ID 401 gen 1557 top level 5 path snapshots/fex_2

root@tsmsrvj:/etc# grep localhost /etc/exports 
/data/fex       localhost(rw,async,no_subtree_check,no_root_squash,crossmnt)
/data/snapshots localhost(rw,async,no_subtree_check,no_root_squash,crossmnt)

## ==> no nested subvolumes! different nfs exports

root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/fex /nfs/localhost/fex
root@tsmsrvj:/etc# mount | grep localhost
localhost:/data/fex on /nfs/localhost/fex type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1)

root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/snapshots /nfs/localhost/snapshots
root@tsmsrvj:/etc# mount | grep localhost
localhost:/data/fex on /nfs/localhost/fex type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1)
localhost:/data/fex on /nfs/localhost/snapshots type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1)

## why localhost:/data/fex twice??

root@tsmsrvj:/etc# du -Hs /nfs/localhost/snapshots
du: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following directory is part of the cycle:
  /nfs/localhost/snapshots/spool

51425792        /nfs/localhost/snapshots



> >>> We cannot move the snapshots to a different directory. Our workflow
> >>> depends on snaprotate:
> >>>
> >>> http://fex.belwue.de/linuxtools/snaprotate.html
> 
> Won't snaprotate follow softlinks? ln -s /data/fex/snapshots
> /data/fex/spool/.snapshot

Yes, it does, the snapshot storage place is just a simple directory, it
does not have to be a subvolume. So, a symbolic links is ok, but it does
not help, see above.


> My server snapshots data subvolumes into a different part of the tree
> (in my case I use btrbk) and exports them to clients and the clients can
> access all the snapshots over NFS perfectly well.

It does not work in my test evironment with Ubuntu 20.04 and btrfs 5.4.1 

-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<294e8449-383f-1c90-62be-fb618332862e@cobb.uk.net>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-12 16:16                     ` Ulli Horlacher
@ 2021-07-12 22:56                       ` g.btrfs
  2021-07-13  7:37                         ` Ulli Horlacher
  0 siblings, 1 reply; 94+ messages in thread
From: g.btrfs @ 2021-07-12 22:56 UTC (permalink / raw)
  To: Btrfs BTRFS


On 12/07/2021 17:16, Ulli Horlacher wrote:
> On Mon 2021-07-12 (14:06), Graham Cobb wrote:
> 
>>>>> root@tsmsrvj:# snaprotate -v test 5 /data/fex/spool
>>>>> $ btrfs subvolume snapshot -r /data/fex/spool /data/fex/spool/.snapshot/2021-07-10_0849.test
>>>>> Create a readonly snapshot of '/data/fex/spool' in '/data/fex/spool/.snapshot/2021-07-10_0849.test'
>>>>
>>>> I think this might be the source of the problem. Nested snapshots are
>>>> not a good idea, it causes various kinds of confusion.
>>>
>>> I do not have nested snapshots anywhere.
>>> /data/fex/spool is not a snapshot.
>>
>> But it is the subvolume which is being snapshotted. What happens if you
>> put the snapshots somewhere that is not part of that subvolume? For
>> example, create /data/fex/snapshots, snapshot /data/fex/spool into a
>> snapshot in /data/fex/snapshots/spool/2021-07-10_0849.test, export
>> /data/fex/snapshots using NFS and mount /data/fex/snapshots on the client?
> 
> Same problem:
> 
> root@tsmsrvj:/etc# mount | grep data
> /dev/sdb1 on /data type btrfs (rw,relatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/)
> 
> root@tsmsrvj:/etc# mkdir /data/snapshots /nfs/localhost/snapshots
> 
> root@tsmsrvj:/etc# btrfs subvolume snapshot -r /data/fex/spool /data/snapshots/fex_1
> Create a readonly snapshot of '/data/fex/spool' in '/data/snapshots/fex_1'
> 
> root@tsmsrvj:/etc# btrfs subvolume snapshot -r /data/fex/spool /data/snapshots/fex_2
> Create a readonly snapshot of '/data/fex/spool' in '/data/snapshots/fex_2'
> 
> root@tsmsrvj:/etc# btrfs subvolume list /data
> ID 257 gen 1558 top level 5 path fex
> ID 270 gen 1557 top level 257 path fex/spool
> ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
> ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
> ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test
> ID 394 gen 1470 top level 270 path fex/spool/.snapshot/2021-07-10_0849.test
> ID 399 gen 1554 top level 270 path fex/spool/.snapshot/2021-07-12_1747.test
> ID 400 gen 1556 top level 5 path snapshots/fex_1
> ID 401 gen 1557 top level 5 path snapshots/fex_2
> 
> root@tsmsrvj:/etc# grep localhost /etc/exports 
> /data/fex       localhost(rw,async,no_subtree_check,no_root_squash,crossmnt)
> /data/snapshots localhost(rw,async,no_subtree_check,no_root_squash,crossmnt)
> 
> ## ==> no nested subvolumes! different nfs exports
> 
> root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/fex /nfs/localhost/fex
> root@tsmsrvj:/etc# mount | grep localhost
> localhost:/data/fex on /nfs/localhost/fex type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1)
> 
> root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/snapshots /nfs/localhost/snapshots
> root@tsmsrvj:/etc# mount | grep localhost
> localhost:/data/fex on /nfs/localhost/fex type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1)
> localhost:/data/fex on /nfs/localhost/snapshots type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1)
> 
> ## why localhost:/data/fex twice??
> 
> root@tsmsrvj:/etc# du -Hs /nfs/localhost/snapshots
> du: WARNING: Circular directory structure.
> This almost certainly means that you have a corrupted file system.
> NOTIFY YOUR SYSTEM MANAGER.
> The following directory is part of the cycle:
>   /nfs/localhost/snapshots/spool

Sure. But it makes the useful operations work. du, find, ls -R, etc all
work properly on /nfs/localhost/fex.

When I go looking in the snapshots I am generally looking for which
version of a particular file I need to restore. For example, maybe I
want to find an old version of /nfs/localhost/fex/spool/some/file. I
would then find the best snapshot to use with:

ls -l /nfs/localhost/fex_snapshots/spool_*/some/file

which might show something like:

-rw-r--r-- 1 cobb me 2.8K 2018-04-03
/nfs/localhost/fex_snapshots/spool_20210703/some/file
-rw-r--r-- 1 cobb me 7 2021-07-06
/nfs/localhost/fex_snapshots/spool_20210706/some/file
-rw-r--r-- 1 cobb me 25 2021-07-12
/nfs/localhost/fex_snapshots/spool_20210712/some/file

So I could tell I need to restore the version from spool_20210703 if I
need the one with the old data in it, which got lost a few days ago.

This is exactly how I use NFS to access my btrbk snapshots stored on the
backup server. Of course, if you need to restore a whole subvolume you
are better of using btrfs send/receive to bring the snapshot back,
instead of using NFS - that preserves the btrfs features like reflinks.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-12 22:56                       ` g.btrfs
@ 2021-07-13  7:37                         ` Ulli Horlacher
  2021-07-19 12:06                           ` Forza
  0 siblings, 1 reply; 94+ messages in thread
From: Ulli Horlacher @ 2021-07-13  7:37 UTC (permalink / raw)
  To: Btrfs BTRFS

On Mon 2021-07-12 (23:56), g.btrfs@cobb.uk.net wrote:

> > root@tsmsrvj:/etc# du -Hs /nfs/localhost/snapshots
> > du: WARNING: Circular directory structure.
> > This almost certainly means that you have a corrupted file system.
> > NOTIFY YOUR SYSTEM MANAGER.
> > The following directory is part of the cycle:
> >   /nfs/localhost/snapshots/spool
> 
> Sure. But it makes the useful operations work. du, find, ls -R, etc all
> work properly on /nfs/localhost/fex.

Properly on /nfs/localhost/fex : yes
Properly on /nfs/localhost/snapshots : NO

And the error messages are annoying!

root@tsmsrvj:/etc# exportfs -v
/data/fex       localhost.localdomain(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
/data/snapshots localhost.localdomain(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)

root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/fex /nfs/localhost/fex
root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/snapshots /nfs/localhost/snapshots
root@tsmsrvj:/etc# mount | grep localhost
localhost:/data/fex on /nfs/localhost/fex type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1)
localhost:/data/snapshots on /nfs/localhost/snapshots type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1)

root@tsmsrvj:/etc# ls -la /data/snapshots /nfs/localhost/snapshots
/data/snapshots:
total 16
drwxr-xr-x 1 root root     20 Jul 13 09:19 .
drwxr-xr-x 1 root root     24 Jul 12 17:42 ..
drwxr-xr-x 1 fex  fex  261964 Mar  7 14:53 fex_1
drwxr-xr-x 1 fex  fex  261964 Mar  7 14:53 fex_2

/nfs/localhost/snapshots:
total 4
drwxr-xr-x 1 root root     20 Jul 13 09:19 .
drwxr-xr-x 4 root root   4096 Jul 12 17:49 ..
drwxr-xr-x 1 fex  fex  261964 Mar  7 14:53 fex_1
drwxr-xr-x 1 fex  fex  261964 Mar  7 14:53 fex_2

root@tsmsrvj:/etc# du -Hs /nfs/localhost/snapshots
du: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following directory is part of the cycle:
  /nfs/localhost/snapshots/fex_1/XXXXXXXXXX@gmail.com

du: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following directory is part of the cycle:
  /nfs/localhost/snapshots/fex_2/XXXXXXXXXX@gmail.com

25708064        /nfs/localhost/snapshots

root@tsmsrvj:/etc# du -Hs /data/snapshots
25712896        /data/snapshots

root@tsmsrvj:/etc# ls -R /nfs/localhost/snapshots | wc -l
ls: /nfs/localhost/snapshots/fex_1/XXXXXXXXXX@gmail.com: not listing already-listed directory
ls: /nfs/localhost/snapshots/fex_2/XXXXXXXXXX@gmail.com: not listing already-listed directory
128977

root@tsmsrvj:/etc# ls -R /data/snapshots | wc -l
129021

root@tsmsrvj:/etc# ls -aR /nfs/localhost/snapshots | wc -l
ls: /nfs/localhost/snapshots/fex_1/XXXXXXXXXX@gmail.com: not listing already-listed directory
ls: /nfs/localhost/snapshots/fex_2/XXXXXXXXXX@gmail.com: not listing already-listed directory
281357

root@tsmsrvj:/etc# ls -aR /data/snapshots | wc -l
281427



More debug info:

root@tsmsrvj:/data/snapshots# find . >/tmp/local.list

root@tsmsrvj:/nfs/localhost/snapshots# find . >/tmp/nfs.list
find: File system loop detected; './fex_1/XXXXXXXXXX@gmail.com' is part of the same file system loop as '.'.
find: File system loop detected; './fex_2/XXXXXXXXXX@gmail.com' is part of the same file system loop as '.'.

root@tsmsrvj:/nfs/localhost/snapshots# diff -u /tmp/local.list /tmp/nfs.list

--- /tmp/local.list	2021-07-13 09:25:36.388084331 +0200
+++ /tmp/nfs.list	2021-07-13 09:26:02.120793230 +0200
@@ -1,25 +1,5 @@
 .
 ./fex_1
-./fex_1/XXXXXXXXXX@gmail.com
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/alist
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/filename
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/size
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/autodelete
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/keep
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/ip
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/uurl
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/useragent
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/header
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/dkey
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/speed
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/md5sum
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/download
-./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/error
-./fex_1/XXXXXXXXXX@gmail.com/.log
-./fex_1/XXXXXXXXXX@gmail.com/.log/fup
-./fex_1/XXXXXXXXXX@gmail.com/.log/fop
 ./fex_1/XXXXXXXXXX@web.de
 ./fex_1/XXXXXXXXXX@web.de/@LOCALE
 ./fex_1/XXXXXXXXXX@web.de/.log
@@ -97976,26 +97956,6 @@
 ./fex_1/.xkeys
 ./fex_1/.snapshot
 ./fex_2
-./fex_2/XXXXXXXXXX@gmail.com
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/alist
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/filename
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/size
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/autodelete
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/keep
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/ip
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/uurl
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/useragent
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/header
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/dkey
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/speed
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/md5sum
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/download
-./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/error
-./fex_2/XXXXXXXXXX@gmail.com/.log
-./fex_2/XXXXXXXXXX@gmail.com/.log/fup
-./fex_2/XXXXXXXXXX@gmail.com/.log/fop
 ./fex_2/XXXXXXXXXX@web.de
 ./fex_2/XXXXXXXXXX@web.de/@LOCALE
 ./fex_2/XXXXXXXXXX@web.de/.log

-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<8506b846-4c4d-6e8f-09ee-e0f2736aac4e@cobb.uk.net>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
       [not found]   ` <162632387205.13764.6196748476850020429@noble.neil.brown.name>
@ 2021-07-15 14:09     ` Josef Bacik
  2021-07-15 16:45       ` Christoph Hellwig
  2021-07-15 23:02       ` NeilBrown
  2021-07-15 15:45     ` J. Bruce Fields
  1 sibling, 2 replies; 94+ messages in thread
From: Josef Bacik @ 2021-07-15 14:09 UTC (permalink / raw)
  To: NeilBrown, J. Bruce Fields, Chuck Lever, Chris Mason, David Sterba
  Cc: linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On 7/15/21 12:37 AM, NeilBrown wrote:
> 
> Hi all,
>   the problem this patch address has been discuss on both the NFS list
>   and the BTRFS list, so I'm sending this to both.  I'd be very happy for
>   people experiencing the problem (NFS export of BTRFS subvols) who are
>   in a position to rebuild the kernel on their NFS server to test this
>   and report success (or otherwise).
> 
>   While I've tried to write this patch so that it *could* land upstream
>   (and could definitely land in a distro franken-kernel if needed), I'm
>   not completely sure it *should* land upstream.  It includes some deep
>   knowledge of BTRFS into NFSD code.  This could be removed later once
>   proper APIs are designed and provided.  I can see arguments either way
>   and wonder what others think.
> 
>   BTRFS developers:  please examine the various claims I have made about
>     BTRFS and correct any that are wrong.  The observation that
>     getdents can report the same inode number of unrelated files
>     (a file and a subvol in my case) is ... interesting.
> 
>   NFSD developers: please comment on anything else.
> 
>   Others: as I said: testing would be great! :-)
> 
> Subject: [PATCH] NFSD: handle BTRFS subvolumes better.
> 
> A single BTRFS mount can present as multiple "volumes".  i.e. multiple
> sets of objects with potentially overlapping inode number spaces.
> The st_dev presented to user-space via the stat(2) family of calls is
> different for each internal volume, as is the f_fsid reported by
> statfs().
> 
> However nfsd doesn't look at st_dev or the fsid (other than for the
> export point - typically the mount point), so it doesn't notice the
> different filesystems.  Importantly, it doesn't report a different fsid
> to the NFS client.
> 
> This leads to the NFS client reusing inode numbers, and applications
> like "find" and "du" complaining, particularly when they find a
> directory with the same st_ino and st_dev as an ancestor.  This
> typically happens with the root of a sub-volume as the root of every
> volume in BTRFS has the same inode number (256).
> 
> To fix this, we need to report a different fsid for each subvolume, but
> need to use the same fsid that we currently use for the top-level
> volume.  Changing this (by rebooting a server to new code), might
> confuse the client.  I don't think it would be a major problem (stale
> filehandles shouldn't happen), but it is best avoided.
> 
> Determining the fsid to use is a bit awkward....
> 
> There is limited space in the protocol (32 bits for NFSv3, 64 for NFSv4)
> so we cannot append the subvolume fsid.  The best option seems to be to
> hash it in.  This patch uses a simple 'xor', but possible a Jenkins hash
> would be better.
> 
> For BTRFS (and other) filesystems the current fsid is a hash (xor) of
> the uuid provided from userspace by mounted.  This is derived from the
> statfs fsid.  If we use the statfs fsid for subvolumes and xor this in,
> we risk erasing useful unique information.  So I have chosen not to use
> the statfs fsid.
> 
> Ideally we should have an API for the filesystem to report if it uses
> multiple subvolumes, and to provide a unique identifier for each.  For
> now, this patch calls exportfs_encode_fh().  If the returned fsid type
> is NOT one of those used by BTRFS, then we assume the st_fsid cannot
> change, and use the current behaviour.
> 
> If the type IS one that BTRFS uses, we use intimate knowledge of BTRFS
> to extract the root_object_id from the filehandle and record that with
> the export information.  Then when exporting an fsid, we check if
> subvolumes are enabled and if the current dentry has a different
> root_object_id to the exported volume.  If it does, the root_object_id
> is hashed (xor) into the reported fsid.
> 
> When an NFSv4 client sees that the fsid has changed, it will ask for the
> MOUNTED_ON_FILEID.  With the Linux NFS client, this is visible to
> userspace as an automount point, until content within the directory is
> accessed and the automount is triggered.  Currently the MOUNTED_ON_FILEID
> for these subvolume roots is the same as of the root - 256.  This will
> cause find et.al.  to complain until the automount actually gets mounted.
> 
> So this patch reports the MOUNTED_OF_FILEID in such cases to be a magic
> number that appears to be appropriate for BTRFS:
>      BTRFS_FIRST_FREE_OBJECTID - 1
> 
> Again, we really want an API to get this from the filesystem.  Changing
> it later has no cost, so we don't need any commitment from the btrfs team
> that this is what they will provide if/when we do get such an API.
> 
> This same problem (of an automount point with a duplicate inode number)
> also exists for NFSv3.  This problem cannot be resolved completely on
> the server as NFSv3 doesn't have a well defined "MOUNTED_ON_FILEID"
> concept, but we can come close.  The inode number returned by READDIR is
> likely to be the mounted-on-fileid.  With READDIR_PLUS, two fileids are
> returned, the one from the readdir, and (optionally) another from
> 'stat'.  Linux-NFS checks these match and if not, it treats the first as
> a mounted-on-fileid.
> 
> Interestingly BTRFS getdents() *DOES* report a different inode number
> for subvol roots than is returned by stat().  These aren't actually
> unique (!!!!) but in at least one case, they are different from
> ancestors, so this is sufficient.
> 
> NFSD currently SUPPRESSES the stat information if the inode number is
> different.  This is because there is room for a file to be renamed between
> the readdir call and the lookup_one_len() prior to getattr, and the
> results could be confusing.  However for the case of a BTRFS filesystem
> with an inode number of 256, the value of reporting the difference seems
> to exceed the cost of any confusion caused by a race (if that is even
> possible in this case).
> So this patch allows the two fileids to be different when 256 is found
> on BTRFS.
> 
> With this patch a 'du' or 'find' in an NFS-mounted btrfs filesystem
> which has snapshot subvols works correctly for both NFSv4 and NFSv3.
> Fortunately the problematic programs tend to trigger READDIR_PLUS and so
> benefit from the detection of the MOUNTED_ON_FILEID which is provides.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>

I'm going to restate what I think the problem is you're having just so I'm sure 
we're on the same page.

1. We export a btrfs volume via nfsd that has multiple subvolumes.
2. We run find, and when we stat a file, nfsd doesn't send along our bogus 
st_dev, it sends it's own thing (I assume?).  This confuses du/find because you 
get the same inode number with different parents.

Is this correct?  If that's the case then it' be relatively straightforward to 
add another callback into export_operations to grab this fsid right?  Hell we 
could simply return the objectid of the root since that's unique across the 
entire file system.  We already do our magic FH encoding to make sure we keep 
all this straight for NFS, another callback to give that info isn't going to 
kill us.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
       [not found]   ` <162632387205.13764.6196748476850020429@noble.neil.brown.name>
  2021-07-15 14:09     ` [PATCH/RFC] NFSD: handle BTRFS subvolumes better Josef Bacik
@ 2021-07-15 15:45     ` J. Bruce Fields
  2021-07-15 23:08       ` NeilBrown
  1 sibling, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2021-07-15 15:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: Chuck Lever, Chris Mason, Josef Bacik, David Sterba, linux-nfs,
	Wang Yugui, Ulli Horlacher, linux-btrfs

On Thu, Jul 15, 2021 at 02:37:52PM +1000, NeilBrown wrote:
> To fix this, we need to report a different fsid for each subvolume, but
> need to use the same fsid that we currently use for the top-level
> volume.  Changing this (by rebooting a server to new code), might
> confuse the client.  I don't think it would be a major problem (stale
> filehandles shouldn't happen), but it is best avoided.
...
> Again, we really want an API to get this from the filesystem.  Changing
> it later has no cost, so we don't need any commitment from the btrfs team
> that this is what they will provide if/when we do get such an API.

"No cost" makes me a little nervous, are we sure nobody will notice the
mountd-on-fileid changing?

Fileid and fsid changes I'd worry about more, though I wouldn't rule it
out if that'd stand in the way of a bug fix.

Thanks for looking into this.

--b.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-15 14:09     ` [PATCH/RFC] NFSD: handle BTRFS subvolumes better Josef Bacik
@ 2021-07-15 16:45       ` Christoph Hellwig
  2021-07-15 17:11         ` Josef Bacik
  2021-07-15 23:02       ` NeilBrown
  1 sibling, 1 reply; 94+ messages in thread
From: Christoph Hellwig @ 2021-07-15 16:45 UTC (permalink / raw)
  To: Josef Bacik
  Cc: NeilBrown, J. Bruce Fields, Chuck Lever, Chris Mason,
	David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On Thu, Jul 15, 2021 at 10:09:37AM -0400, Josef Bacik wrote:
> I'm going to restate what I think the problem is you're having just so I'm
> sure we're on the same page.
> 
> 1. We export a btrfs volume via nfsd that has multiple subvolumes.
> 2. We run find, and when we stat a file, nfsd doesn't send along our bogus
> st_dev, it sends it's own thing (I assume?).  This confuses du/find because
> you get the same inode number with different parents.
> 
> Is this correct?  If that's the case then it' be relatively straightforward
> to add another callback into export_operations to grab this fsid right?
> Hell we could simply return the objectid of the root since that's unique
> across the entire file system.  We already do our magic FH encoding to make
> sure we keep all this straight for NFS, another callback to give that info
> isn't going to kill us.  Thanks,

Hell no.  btrfs is broken plain and simple, and we've been arguing about
this for years without progress.  btrfs needs to stop claiming different
st_dev inside the same mount, otherwise hell is going to break lose left
right and center, and this is just one of the many cases where it does.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-15 16:45       ` Christoph Hellwig
@ 2021-07-15 17:11         ` Josef Bacik
  2021-07-15 17:24           ` Christoph Hellwig
  0 siblings, 1 reply; 94+ messages in thread
From: Josef Bacik @ 2021-07-15 17:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, J. Bruce Fields, Chuck Lever, Chris Mason,
	David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On 7/15/21 12:45 PM, Christoph Hellwig wrote:
> On Thu, Jul 15, 2021 at 10:09:37AM -0400, Josef Bacik wrote:
>> I'm going to restate what I think the problem is you're having just so I'm
>> sure we're on the same page.
>>
>> 1. We export a btrfs volume via nfsd that has multiple subvolumes.
>> 2. We run find, and when we stat a file, nfsd doesn't send along our bogus
>> st_dev, it sends it's own thing (I assume?).  This confuses du/find because
>> you get the same inode number with different parents.
>>
>> Is this correct?  If that's the case then it' be relatively straightforward
>> to add another callback into export_operations to grab this fsid right?
>> Hell we could simply return the objectid of the root since that's unique
>> across the entire file system.  We already do our magic FH encoding to make
>> sure we keep all this straight for NFS, another callback to give that info
>> isn't going to kill us.  Thanks,
> 
> Hell no.  btrfs is broken plain and simple, and we've been arguing about
> this for years without progress.  btrfs needs to stop claiming different
> st_dev inside the same mount, otherwise hell is going to break lose left
> right and center, and this is just one of the many cases where it does.
> 

Because there's no alternative.  We need a way to tell userspace they've 
wandered into a different inode namespace.  There's no argument that what we're 
doing is ugly, but there's never been a clear "do X instead".  Just a lot of 
whinging that btrfs is broken.  This makes userspace happy and is simple and 
straightforward.  I'm open to alternatives, but there have been 0 workable 
alternatives proposed in the last decade of complaining about it.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-15 17:11         ` Josef Bacik
@ 2021-07-15 17:24           ` Christoph Hellwig
  2021-07-15 18:01             ` Josef Bacik
  0 siblings, 1 reply; 94+ messages in thread
From: Christoph Hellwig @ 2021-07-15 17:24 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christoph Hellwig, NeilBrown, J. Bruce Fields, Chuck Lever,
	Chris Mason, David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher,
	linux-btrfs

On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
> Because there's no alternative.  We need a way to tell userspace they've
> wandered into a different inode namespace.  There's no argument that what
> we're doing is ugly, but there's never been a clear "do X instead".  Just a
> lot of whinging that btrfs is broken.  This makes userspace happy and is
> simple and straightforward.  I'm open to alternatives, but there have been 0
> workable alternatives proposed in the last decade of complaining about it.

Make sure we cross a vfsmount when crossing the "st_dev" domain so
that it is properly reported.   Suggested many times and ignored all
the time beause it requires a bit of work.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-15 17:24           ` Christoph Hellwig
@ 2021-07-15 18:01             ` Josef Bacik
  2021-07-15 22:37               ` NeilBrown
                                 ` (2 more replies)
  0 siblings, 3 replies; 94+ messages in thread
From: Josef Bacik @ 2021-07-15 18:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, J. Bruce Fields, Chuck Lever, Chris Mason,
	David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On 7/15/21 1:24 PM, Christoph Hellwig wrote:
> On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
>> Because there's no alternative.  We need a way to tell userspace they've
>> wandered into a different inode namespace.  There's no argument that what
>> we're doing is ugly, but there's never been a clear "do X instead".  Just a
>> lot of whinging that btrfs is broken.  This makes userspace happy and is
>> simple and straightforward.  I'm open to alternatives, but there have been 0
>> workable alternatives proposed in the last decade of complaining about it.
> 
> Make sure we cross a vfsmount when crossing the "st_dev" domain so
> that it is properly reported.   Suggested many times and ignored all
> the time beause it requires a bit of work.
> 

You keep telling me this but forgetting that I did all this work when you 
originally suggested it.  The problem I ran into was the automount stuff 
requires that we have a completely different superblock for every vfsmount. 
This is fine for things like nfs or samba where the automount literally points 
to a completely different mount, but doesn't work for btrfs where it's on the 
same file system.  If you have 1000 subvolumes and run sync() you're going to 
write the superblock 1000 times for the same file system.  You are going to 
reclaim inodes on the same file system 1000 times.  You are going to reclaim 
dcache on the same filesytem 1000 times.  You are also going to pin 1000 
dentries/inodes into memory whenever you wander into these things because the 
super is going to hold them open.

This is not a workable solution.  It's not a matter of simply tying into 
existing infrastructure, we'd have to completely rework how the VFS deals with 
this stuff in order to be reasonable.  And when I brought this up to Al he told 
me I was insane and we absolutely had to have a different SB for every vfsmount, 
which means we can't use vfsmount for this, which means we don't have any other 
options.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-15 18:01             ` Josef Bacik
@ 2021-07-15 22:37               ` NeilBrown
  2021-07-19 15:40                 ` Josef Bacik
  2021-07-19 15:49                 ` J. Bruce Fields
  2021-07-19  9:16               ` Christoph Hellwig
  2021-07-20 22:10               ` J. Bruce Fields
  2 siblings, 2 replies; 94+ messages in thread
From: NeilBrown @ 2021-07-15 22:37 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christoph Hellwig, J. Bruce Fields, Chuck Lever, Chris Mason,
	David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On Fri, 16 Jul 2021, Josef Bacik wrote:
> On 7/15/21 1:24 PM, Christoph Hellwig wrote:
> > On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
> >> Because there's no alternative.  We need a way to tell userspace they've
> >> wandered into a different inode namespace.  There's no argument that what
> >> we're doing is ugly, but there's never been a clear "do X instead".  Just a
> >> lot of whinging that btrfs is broken.  This makes userspace happy and is
> >> simple and straightforward.  I'm open to alternatives, but there have been 0
> >> workable alternatives proposed in the last decade of complaining about it.
> > 
> > Make sure we cross a vfsmount when crossing the "st_dev" domain so
> > that it is properly reported.   Suggested many times and ignored all
> > the time beause it requires a bit of work.
> > 
> 
> You keep telling me this but forgetting that I did all this work when you 
> originally suggested it.  The problem I ran into was the automount stuff 
> requires that we have a completely different superblock for every vfsmount. 
> This is fine for things like nfs or samba where the automount literally points 
> to a completely different mount, but doesn't work for btrfs where it's on the 
> same file system.  If you have 1000 subvolumes and run sync() you're going to 
> write the superblock 1000 times for the same file system.  You are going to 
> reclaim inodes on the same file system 1000 times.  You are going to reclaim 
> dcache on the same filesytem 1000 times.  You are also going to pin 1000 
> dentries/inodes into memory whenever you wander into these things because the 
> super is going to hold them open.
> 
> This is not a workable solution.  It's not a matter of simply tying into 
> existing infrastructure, we'd have to completely rework how the VFS deals with 
> this stuff in order to be reasonable.  And when I brought this up to Al he told 
> me I was insane and we absolutely had to have a different SB for every vfsmount, 
> which means we can't use vfsmount for this, which means we don't have any other 
> options.  Thanks,

When I was first looking at this, I thought that separate vfsmnts
and auto-mounting was the way to go "just like NFS".  NFS still shares a
lot between the multiple superblock - certainly it shares the same
connection to the server.

But I dropped the idea when Bruce pointed out that nfsd is not set up to
export auto-mounted filesystems.  It needs to be able to find a
filesystem given a UUID (extracted from a filehandle), and it does this
by walking through the mount table to find one that matches.  So unless
all btrfs subvols were mounted all the time (which I wouldn't propose),
it would need major work to fix.

NFSv4 describes the fsid as having a "major" and "minor" component.
We've never treated these as having an important meaning - just extra
bits to encode uniqueness in.  Maybe we should have used "major" for the
vfsmnt, and kept "minor" for the subvol.....

The idea for a single vfsmnt exposing multiple inode-name-spaces does
appeal to me.  The "st_dev" is just part of the name, and already a
fairly blurry part.  Thanks to bind mounts, multiple mounts can have the
same st_dev.  I see no intrinsic reason that a single mount should not
have multiple fsids, provided that a coherent picture is provided to
userspace which doesn't contain too many surprises.

NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-15 14:09     ` [PATCH/RFC] NFSD: handle BTRFS subvolumes better Josef Bacik
  2021-07-15 16:45       ` Christoph Hellwig
@ 2021-07-15 23:02       ` NeilBrown
  1 sibling, 0 replies; 94+ messages in thread
From: NeilBrown @ 2021-07-15 23:02 UTC (permalink / raw)
  To: Josef Bacik
  Cc: J. Bruce Fields, Chuck Lever, Chris Mason, David Sterba,
	linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On Fri, 16 Jul 2021, Josef Bacik wrote:
> 
> I'm going to restate what I think the problem is you're having just so I'm sure 
> we're on the same page.
> 
> 1. We export a btrfs volume via nfsd that has multiple subvolumes.
> 2. We run find, and when we stat a file, nfsd doesn't send along our bogus 
> st_dev, it sends it's own thing (I assume?).  This confuses du/find because you 
> get the same inode number with different parents.
> 
> Is this correct?  If that's the case then it' be relatively straightforward to 
> add another callback into export_operations to grab this fsid right?  Hell we 
> could simply return the objectid of the root since that's unique across the 
> entire file system.  We already do our magic FH encoding to make sure we keep 
> all this straight for NFS, another callback to give that info isn't going to 
> kill us.  Thanks,

Fairly close.
As well as the fsid I need a "mounted-on" inode number, so one callback
to provide both would do.
If zero was reported, that would be equivalent to not providing the
callback.
- Is "u64" always enough for the subvol-id?
- Should we make these details available to user-space with a new STATX
  flag?
- Should it be a new export_operations callback, or new fields in
  "struct kstat" ??

... though having asked those question, I begin to wonder if I took a
wrong turn.
I can already get some fsid information form statfs, though it is only
64bits and for BTRFS is combines the filesystem uuid and the subvol
id.  For that reason I avoided it.

But I'm already caching the fsid for the export-point.  If, when I find
a different fsid lower down, I xor the result with the export-point
fsid, the result would be fairly clean (the xor difference between the
two subvol ids) and could be safely mixed into the fsid we currently
report.

So all I REALLY need from btrfs is a "mounted-on" inode number, matching
what readdir() reports.
I wouldn't argue AGAINST getting cleaner fsid information.  A 128-bit
uuid and a 64bit subvol id would be ideal.
I'd rather see them as new STATX flags than a new export_operations
callback.

NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-15 15:45     ` J. Bruce Fields
@ 2021-07-15 23:08       ` NeilBrown
  0 siblings, 0 replies; 94+ messages in thread
From: NeilBrown @ 2021-07-15 23:08 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Chuck Lever, Chris Mason, Josef Bacik, David Sterba, linux-nfs,
	Wang Yugui, Ulli Horlacher, linux-btrfs

On Fri, 16 Jul 2021, J. Bruce Fields wrote:
> On Thu, Jul 15, 2021 at 02:37:52PM +1000, NeilBrown wrote:
> > To fix this, we need to report a different fsid for each subvolume, but
> > need to use the same fsid that we currently use for the top-level
> > volume.  Changing this (by rebooting a server to new code), might
> > confuse the client.  I don't think it would be a major problem (stale
> > filehandles shouldn't happen), but it is best avoided.
> ...
> > Again, we really want an API to get this from the filesystem.  Changing
> > it later has no cost, so we don't need any commitment from the btrfs team
> > that this is what they will provide if/when we do get such an API.
> 
> "No cost" makes me a little nervous, are we sure nobody will notice the
> mountd-on-fileid changing?

One cannot be 100% sure, but I cannot see how anything would depend on
it being stable.  Certainly the kernel doesn't.
'ls -i' doesn't report it - even as "ls -if".  "find -inum xx" cannot see
it.
Obviously readdir() will see it but if any application put much weight
on the number, it could already get confused when btrfs returns
non-unique numbers as I mentioned.  
I certainly wouldn't lose sleep over changing it.

NeilBrown

> 
> Fileid and fsid changes I'd worry about more, though I wouldn't rule it
> out if that'd stand in the way of a bug fix.
> 
> Thanks for looking into this.
> 
> --b.
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-15 18:01             ` Josef Bacik
  2021-07-15 22:37               ` NeilBrown
@ 2021-07-19  9:16               ` Christoph Hellwig
  2021-07-19 23:54                 ` NeilBrown
  2021-07-20 22:10               ` J. Bruce Fields
  2 siblings, 1 reply; 94+ messages in thread
From: Christoph Hellwig @ 2021-07-19  9:16 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christoph Hellwig, NeilBrown, J. Bruce Fields, Chuck Lever,
	Chris Mason, David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher,
	linux-btrfs

On Thu, Jul 15, 2021 at 02:01:11PM -0400, Josef Bacik wrote:
> This is not a workable solution.  It's not a matter of simply tying into
> existing infrastructure, we'd have to completely rework how the VFS deals
> with this stuff in order to be reasonable.  And when I brought this up to Al
> he told me I was insane and we absolutely had to have a different SB for
> every vfsmount, which means we can't use vfsmount for this, which means we
> don't have any other options.  Thanks,

Then fix the problem another way.  The problem is known, old and keeps
breaking stuff.  Don't paper over it, fix it. 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-13  7:37                         ` Ulli Horlacher
@ 2021-07-19 12:06                           ` Forza
  2021-07-19 13:07                             ` Forza
  2021-07-27 11:27                             ` Ulli Horlacher
  0 siblings, 2 replies; 94+ messages in thread
From: Forza @ 2021-07-19 12:06 UTC (permalink / raw)
  To: Btrfs BTRFS



On 2021-07-13 09:37, Ulli Horlacher wrote:
> On Mon 2021-07-12 (23:56), g.btrfs@cobb.uk.net wrote:
> 
>>> root@tsmsrvj:/etc# du -Hs /nfs/localhost/snapshots
>>> du: WARNING: Circular directory structure.
>>> This almost certainly means that you have a corrupted file system.
>>> NOTIFY YOUR SYSTEM MANAGER.
>>> The following directory is part of the cycle:
>>>    /nfs/localhost/snapshots/spool
>>
>> Sure. But it makes the useful operations work. du, find, ls -R, etc all
>> work properly on /nfs/localhost/fex.
> 
> Properly on /nfs/localhost/fex : yes
> Properly on /nfs/localhost/snapshots : NO
> 
> And the error messages are annoying!
> 
> root@tsmsrvj:/etc# exportfs -v
> /data/fex       localhost.localdomain(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
> /data/snapshots localhost.localdomain(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
> 
> root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/fex /nfs/localhost/fex
> root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/snapshots /nfs/localhost/snapshots
> root@tsmsrvj:/etc# mount | grep localhost
> localhost:/data/fex on /nfs/localhost/fex type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1)
> localhost:/data/snapshots on /nfs/localhost/snapshots type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1)
> 

What kind of NFS server is this? Isn't UDP mounts legacy and not 
normally used by default?

Can you switch to an nfs4 server and try again? I also still think you 
should use fsid export option.



> root@tsmsrvj:/etc# ls -la /data/snapshots /nfs/localhost/snapshots
> /data/snapshots:
> total 16
> drwxr-xr-x 1 root root     20 Jul 13 09:19 .
> drwxr-xr-x 1 root root     24 Jul 12 17:42 ..
> drwxr-xr-x 1 fex  fex  261964 Mar  7 14:53 fex_1
> drwxr-xr-x 1 fex  fex  261964 Mar  7 14:53 fex_2
> 
> /nfs/localhost/snapshots:
> total 4
> drwxr-xr-x 1 root root     20 Jul 13 09:19 .
> drwxr-xr-x 4 root root   4096 Jul 12 17:49 ..
> drwxr-xr-x 1 fex  fex  261964 Mar  7 14:53 fex_1
> drwxr-xr-x 1 fex  fex  261964 Mar  7 14:53 fex_2
> 
> root@tsmsrvj:/etc# du -Hs /nfs/localhost/snapshots
> du: WARNING: Circular directory structure.
> This almost certainly means that you have a corrupted file system.
> NOTIFY YOUR SYSTEM MANAGER.
> The following directory is part of the cycle:
>    /nfs/localhost/snapshots/fex_1/XXXXXXXXXX@gmail.com
> 
> du: WARNING: Circular directory structure.
> This almost certainly means that you have a corrupted file system.
> NOTIFY YOUR SYSTEM MANAGER.
> The following directory is part of the cycle:
>    /nfs/localhost/snapshots/fex_2/XXXXXXXXXX@gmail.com
> 
> 25708064        /nfs/localhost/snapshots
> 
> root@tsmsrvj:/etc# du -Hs /data/snapshots
> 25712896        /data/snapshots
> 
> root@tsmsrvj:/etc# ls -R /nfs/localhost/snapshots | wc -l
> ls: /nfs/localhost/snapshots/fex_1/XXXXXXXXXX@gmail.com: not listing already-listed directory
> ls: /nfs/localhost/snapshots/fex_2/XXXXXXXXXX@gmail.com: not listing already-listed directory
> 128977
> 
> root@tsmsrvj:/etc# ls -R /data/snapshots | wc -l
> 129021
> 
> root@tsmsrvj:/etc# ls -aR /nfs/localhost/snapshots | wc -l
> ls: /nfs/localhost/snapshots/fex_1/XXXXXXXXXX@gmail.com: not listing already-listed directory
> ls: /nfs/localhost/snapshots/fex_2/XXXXXXXXXX@gmail.com: not listing already-listed directory
> 281357
> 
> root@tsmsrvj:/etc# ls -aR /data/snapshots | wc -l
> 281427
> 
> 
> 
> More debug info:
> 
> root@tsmsrvj:/data/snapshots# find . >/tmp/local.list
> 
> root@tsmsrvj:/nfs/localhost/snapshots# find . >/tmp/nfs.list
> find: File system loop detected; './fex_1/XXXXXXXXXX@gmail.com' is part of the same file system loop as '.'.
> find: File system loop detected; './fex_2/XXXXXXXXXX@gmail.com' is part of the same file system loop as '.'.
> 
> root@tsmsrvj:/nfs/localhost/snapshots# diff -u /tmp/local.list /tmp/nfs.list
> 
> --- /tmp/local.list	2021-07-13 09:25:36.388084331 +0200
> +++ /tmp/nfs.list	2021-07-13 09:26:02.120793230 +0200
> @@ -1,25 +1,5 @@
>   .
>   ./fex_1
> -./fex_1/XXXXXXXXXX@gmail.com
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/alist
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/filename
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/size
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/autodelete
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/keep
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/ip
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/uurl
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/useragent
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/header
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/dkey
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/speed
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/md5sum
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/download
> -./fex_1/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/error
> -./fex_1/XXXXXXXXXX@gmail.com/.log
> -./fex_1/XXXXXXXXXX@gmail.com/.log/fup
> -./fex_1/XXXXXXXXXX@gmail.com/.log/fop
>   ./fex_1/XXXXXXXXXX@web.de
>   ./fex_1/XXXXXXXXXX@web.de/@LOCALE
>   ./fex_1/XXXXXXXXXX@web.de/.log
> @@ -97976,26 +97956,6 @@
>   ./fex_1/.xkeys
>   ./fex_1/.snapshot
>   ./fex_2
> -./fex_2/XXXXXXXXXX@gmail.com
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/alist
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/filename
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/size
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/autodelete
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/keep
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/ip
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/uurl
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/useragent
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/header
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/dkey
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/speed
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/md5sum
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/download
> -./fex_2/XXXXXXXXXX@gmail.com/XXXXXXXXXX@pi2.uni-stuttgart.de/origin-8.5.1-SR2.zip/error
> -./fex_2/XXXXXXXXXX@gmail.com/.log
> -./fex_2/XXXXXXXXXX@gmail.com/.log/fup
> -./fex_2/XXXXXXXXXX@gmail.com/.log/fop
>   ./fex_2/XXXXXXXXXX@web.de
>   ./fex_2/XXXXXXXXXX@web.de/@LOCALE
>   ./fex_2/XXXXXXXXXX@web.de/.log
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-19 12:06                           ` Forza
@ 2021-07-19 13:07                             ` Forza
  2021-07-19 13:35                               ` Forza
  2021-07-27 11:27                             ` Ulli Horlacher
  1 sibling, 1 reply; 94+ messages in thread
From: Forza @ 2021-07-19 13:07 UTC (permalink / raw)
  To: Btrfs BTRFS



On 2021-07-19 14:06, Forza wrote:
> 
> 
> On 2021-07-13 09:37, Ulli Horlacher wrote:
>> On Mon 2021-07-12 (23:56), g.btrfs@cobb.uk.net wrote:
>>
>>>> root@tsmsrvj:/etc# du -Hs /nfs/localhost/snapshots
>>>> du: WARNING: Circular directory structure.
>>>> This almost certainly means that you have a corrupted file system.
>>>> NOTIFY YOUR SYSTEM MANAGER.
>>>> The following directory is part of the cycle:
>>>>    /nfs/localhost/snapshots/spool
>>>
>>> Sure. But it makes the useful operations work. du, find, ls -R, etc all
>>> work properly on /nfs/localhost/fex.
>>
>> Properly on /nfs/localhost/fex : yes
>> Properly on /nfs/localhost/snapshots : NO
>>
>> And the error messages are annoying!
>>
>> root@tsmsrvj:/etc# exportfs -v
>> /data/fex       
>> localhost.localdomain(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash) 
>>
>> /data/snapshots 
>> localhost.localdomain(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash) 
>>
>>
>> root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/fex /nfs/localhost/fex
>> root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/snapshots 
>> /nfs/localhost/snapshots
>> root@tsmsrvj:/etc# mount | grep localhost
>> localhost:/data/fex on /nfs/localhost/fex type nfs 
>> (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1) 
>>
>> localhost:/data/snapshots on /nfs/localhost/snapshots type nfs 
>> (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1) 
>>
>>
> 
> What kind of NFS server is this? Isn't UDP mounts legacy and not 
> normally used by default?
> 
> Can you switch to an nfs4 server and try again? I also still think you 
> should use fsid export option.
> 
> 
> 
I'm replying to myself here because I booted up a VM with Fedora 34 and 
tested a similar setup as Mr Horlacher's and can reproduce the errors.

Setup:
1) create a subvolume /mnt/rootvol/nfs
2) create some snapshots:
btrfs sub snap /mnt/rootvol/nfs /mnt/rootvol/nfs/.snapshots/nfs-1
btrfs sub snap /mnt/rootvol/nfs /mnt/rootvol/nfs/.snapshots/nfs-2

3) export as:
/mnt/rootvol/nfs/ *(fsid=1234,no_root_squash)

4) mount -o vers=4 localhost:/mnt/rootvol/nfs /media/nfs-mnt/
5) "du -sh /media/nfs-mnt" fails with
"WARNING: Circular directory structure."

6) "ls -alR /mnt/nfs-mnt" fails with
"not listing already-listed directory"

In addition I have tried with various export options such as crossmnt, 
nohide and subtree_check. They do not improve the situation.

Also the behaviour is the same with nfs3 as with nfs4.

Full outputs are available at https://paste.ee/p/pkHLh

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-19 13:07                             ` Forza
@ 2021-07-19 13:35                               ` Forza
  0 siblings, 0 replies; 94+ messages in thread
From: Forza @ 2021-07-19 13:35 UTC (permalink / raw)
  To: Btrfs BTRFS



On 2021-07-19 15:07, Forza wrote:
> 
> 
> On 2021-07-19 14:06, Forza wrote:
>>
>>
>> On 2021-07-13 09:37, Ulli Horlacher wrote:
>>> On Mon 2021-07-12 (23:56), g.btrfs@cobb.uk.net wrote:
>>>
>>>>> root@tsmsrvj:/etc# du -Hs /nfs/localhost/snapshots
>>>>> du: WARNING: Circular directory structure.
>>>>> This almost certainly means that you have a corrupted file system.
>>>>> NOTIFY YOUR SYSTEM MANAGER.
>>>>> The following directory is part of the cycle:
>>>>>    /nfs/localhost/snapshots/spool
>>>>
>>>> Sure. But it makes the useful operations work. du, find, ls -R, etc all
>>>> work properly on /nfs/localhost/fex.
>>>
>>> Properly on /nfs/localhost/fex : yes
>>> Properly on /nfs/localhost/snapshots : NO
>>>
>>> And the error messages are annoying!
>>>
>>> root@tsmsrvj:/etc# exportfs -v
>>> /data/fex 
>>> localhost.localdomain(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash) 
>>>
>>> /data/snapshots 
>>> localhost.localdomain(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash) 
>>>
>>>
>>> root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/fex 
>>> /nfs/localhost/fex
>>> root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/snapshots 
>>> /nfs/localhost/snapshots
>>> root@tsmsrvj:/etc# mount | grep localhost
>>> localhost:/data/fex on /nfs/localhost/fex type nfs 
>>> (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1) 
>>>
>>> localhost:/data/snapshots on /nfs/localhost/snapshots type nfs 
>>> (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1) 
>>>
>>>
>>
>> What kind of NFS server is this? Isn't UDP mounts legacy and not 
>> normally used by default?
>>
>> Can you switch to an nfs4 server and try again? I also still think you 
>> should use fsid export option.
>>
>>
>>
> I'm replying to myself here because I booted up a VM with Fedora 34 and 
> tested a similar setup as Mr Horlacher's and can reproduce the errors.
> 
> Setup:
> 1) create a subvolume /mnt/rootvol/nfs
> 2) create some snapshots:
> btrfs sub snap /mnt/rootvol/nfs /mnt/rootvol/nfs/.snapshots/nfs-1
> btrfs sub snap /mnt/rootvol/nfs /mnt/rootvol/nfs/.snapshots/nfs-2
> 
> 3) export as:
> /mnt/rootvol/nfs/ *(fsid=1234,no_root_squash)
> 
> 4) mount -o vers=4 localhost:/mnt/rootvol/nfs /media/nfs-mnt/
> 5) "du -sh /media/nfs-mnt" fails with
> "WARNING: Circular directory structure."
> 
> 6) "ls -alR /mnt/nfs-mnt" fails with
> "not listing already-listed directory"
> 
> In addition I have tried with various export options such as crossmnt, 
> nohide and subtree_check. They do not improve the situation.
> 
> Also the behaviour is the same with nfs3 as with nfs4.
> 
> Full outputs are available at https://paste.ee/p/pkHLh

Perhaps the problem is that inode numbers are re-used inside snapshots 
and that nfsd doesn't understand how to handle this properly?

# ls -ila /media/nfs-mnt/
total 0
256 drwxr-xr-x. 1 root root 80 Jul 19 14:17 .
270 drwxr-xr-x. 1 root root 14 Jul 19 14:21 ..
259 -rw-r--r--. 1 root root  0 Jul 19 14:17 bar
261 -rw-r--r--. 1 root root  0 Jul 19 14:17 file1
262 -rw-r--r--. 1 root root  0 Jul 19 14:17 file2
263 -rw-r--r--. 1 root root  0 Jul 19 14:17 file3
258 -rw-r--r--. 1 root root  0 Jul 19 14:17 foo
257 drwxr-xr-x. 1 root root 30 Jul 19 15:02 .snapshots
260 -rw-r--r--. 1 root root  0 Jul 19 14:17 somefiles

# ls -ila /media/nfs-mnt/.snapshots/nfs-2/
total 0
256 drwxr-xr-x. 1 root root 80 Jul 19 14:17 .
257 drwxr-xr-x. 1 root root 30 Jul 19 15:02 ..
259 -rw-r--r--. 1 root root  0 Jul 19 14:17 bar
261 -rw-r--r--. 1 root root  0 Jul 19 14:17 file1
262 -rw-r--r--. 1 root root  0 Jul 19 14:17 file2
263 -rw-r--r--. 1 root root  0 Jul 19 14:17 file3
258 -rw-r--r--. 1 root root  0 Jul 19 14:17 foo
257 drwxr-xr-x. 1 root root 10 Jul 19 14:17 .snapshots
260 -rw-r--r--. 1 root root  0 Jul 19 14:17 somefiles


Using nfs4 exports and specifying each snapshot as its own fsid does not 
work either.

### /etc/exports
/mnt/rootvol/nfs/ *(fsid=root,no_root_squash,no_subtree_check)
/mnt/rootvol/nfs/.snapshots/nfs-1 
*(fsid=1000,no_root_squash,no_subtree_check)
/mnt/rootvol/nfs/.snapshots/nfs-2 
*(fsid=2000,no_root_squash,no_subtree_check)
/mnt/rootvol/nfs/.snapshots/nfs-3 
*(fsid=3000,no_root_squash,no_subtree_check)

# ls -laRi nfs-mnt/
nfs-mnt/:
total 0
256 drwxr-xr-x. 1 root root 80 Jul 19 14:17 .
270 drwxr-xr-x. 1 root root 14 Jul 19 14:21 ..
259 -rw-r--r--. 1 root root  0 Jul 19 14:17 bar
261 -rw-r--r--. 1 root root  0 Jul 19 14:17 file1
262 -rw-r--r--. 1 root root  0 Jul 19 14:17 file2
263 -rw-r--r--. 1 root root  0 Jul 19 14:17 file3
258 -rw-r--r--. 1 root root  0 Jul 19 14:17 foo
257 drwxr-xr-x. 1 root root 30 Jul 19 15:02 .snapshots
260 -rw-r--r--. 1 root root  0 Jul 19 14:17 somefiles

nfs-mnt/.snapshots:
total 0
257 drwxr-xr-x. 1 root root 30 Jul 19 15:02 .
256 drwxr-xr-x. 1 root root 80 Jul 19 14:17 ..
256 drwxr-xr-x. 1 root root 56 Jul 19 15:02 nfs-1
256 drwxr-xr-x. 1 root root 80 Jul 19 14:17 nfs-2
256 drwxr-xr-x. 1 root root 86 Jul 19 15:03 nfs-3
ls: nfs-mnt/.snapshots/nfs-1: not listing already-listed directory
ls: nfs-mnt/.snapshots/nfs-2: not listing already-listed directory
ls: nfs-mnt/.snapshots/nfs-3: not listing already-listed directory

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-15 22:37               ` NeilBrown
@ 2021-07-19 15:40                 ` Josef Bacik
  2021-07-19 20:00                   ` J. Bruce Fields
  2021-07-19 15:49                 ` J. Bruce Fields
  1 sibling, 1 reply; 94+ messages in thread
From: Josef Bacik @ 2021-07-19 15:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christoph Hellwig, J. Bruce Fields, Chuck Lever, Chris Mason,
	David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On 7/15/21 6:37 PM, NeilBrown wrote:
> On Fri, 16 Jul 2021, Josef Bacik wrote:
>> On 7/15/21 1:24 PM, Christoph Hellwig wrote:
>>> On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
>>>> Because there's no alternative.  We need a way to tell userspace they've
>>>> wandered into a different inode namespace.  There's no argument that what
>>>> we're doing is ugly, but there's never been a clear "do X instead".  Just a
>>>> lot of whinging that btrfs is broken.  This makes userspace happy and is
>>>> simple and straightforward.  I'm open to alternatives, but there have been 0
>>>> workable alternatives proposed in the last decade of complaining about it.
>>>
>>> Make sure we cross a vfsmount when crossing the "st_dev" domain so
>>> that it is properly reported.   Suggested many times and ignored all
>>> the time beause it requires a bit of work.
>>>
>>
>> You keep telling me this but forgetting that I did all this work when you
>> originally suggested it.  The problem I ran into was the automount stuff
>> requires that we have a completely different superblock for every vfsmount.
>> This is fine for things like nfs or samba where the automount literally points
>> to a completely different mount, but doesn't work for btrfs where it's on the
>> same file system.  If you have 1000 subvolumes and run sync() you're going to
>> write the superblock 1000 times for the same file system.  You are going to
>> reclaim inodes on the same file system 1000 times.  You are going to reclaim
>> dcache on the same filesytem 1000 times.  You are also going to pin 1000
>> dentries/inodes into memory whenever you wander into these things because the
>> super is going to hold them open.
>>
>> This is not a workable solution.  It's not a matter of simply tying into
>> existing infrastructure, we'd have to completely rework how the VFS deals with
>> this stuff in order to be reasonable.  And when I brought this up to Al he told
>> me I was insane and we absolutely had to have a different SB for every vfsmount,
>> which means we can't use vfsmount for this, which means we don't have any other
>> options.  Thanks,
> 
> When I was first looking at this, I thought that separate vfsmnts
> and auto-mounting was the way to go "just like NFS".  NFS still shares a
> lot between the multiple superblock - certainly it shares the same
> connection to the server.
> 
> But I dropped the idea when Bruce pointed out that nfsd is not set up to
> export auto-mounted filesystems.  It needs to be able to find a
> filesystem given a UUID (extracted from a filehandle), and it does this
> by walking through the mount table to find one that matches.  So unless
> all btrfs subvols were mounted all the time (which I wouldn't propose),
> it would need major work to fix.
> 
> NFSv4 describes the fsid as having a "major" and "minor" component.
> We've never treated these as having an important meaning - just extra
> bits to encode uniqueness in.  Maybe we should have used "major" for the
> vfsmnt, and kept "minor" for the subvol.....
> 
> The idea for a single vfsmnt exposing multiple inode-name-spaces does
> appeal to me.  The "st_dev" is just part of the name, and already a
> fairly blurry part.  Thanks to bind mounts, multiple mounts can have the
> same st_dev.  I see no intrinsic reason that a single mount should not
> have multiple fsids, provided that a coherent picture is provided to
> userspace which doesn't contain too many surprises.
> 

Ok so setting aside btrfs for the moment, how does NFS deal with exporting a 
directory that has multiple other file systems under that tree?  I assume the 
same sort of problem doesn't occur, but why is that?  Is it because it's a 
different vfsmount/sb or is there some other magic making this work?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-15 22:37               ` NeilBrown
  2021-07-19 15:40                 ` Josef Bacik
@ 2021-07-19 15:49                 ` J. Bruce Fields
  2021-07-20  0:02                   ` NeilBrown
  1 sibling, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2021-07-19 15:49 UTC (permalink / raw)
  To: NeilBrown
  Cc: Josef Bacik, Christoph Hellwig, Chuck Lever, Chris Mason,
	David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On Fri, Jul 16, 2021 at 08:37:07AM +1000, NeilBrown wrote:
> On Fri, 16 Jul 2021, Josef Bacik wrote:
> > On 7/15/21 1:24 PM, Christoph Hellwig wrote:
> > > On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
> > >> Because there's no alternative.  We need a way to tell userspace they've
> > >> wandered into a different inode namespace.  There's no argument that what
> > >> we're doing is ugly, but there's never been a clear "do X instead".  Just a
> > >> lot of whinging that btrfs is broken.  This makes userspace happy and is
> > >> simple and straightforward.  I'm open to alternatives, but there have been 0
> > >> workable alternatives proposed in the last decade of complaining about it.
> > > 
> > > Make sure we cross a vfsmount when crossing the "st_dev" domain so
> > > that it is properly reported.   Suggested many times and ignored all
> > > the time beause it requires a bit of work.
> > > 
> > 
> > You keep telling me this but forgetting that I did all this work when you 
> > originally suggested it.  The problem I ran into was the automount stuff 
> > requires that we have a completely different superblock for every vfsmount. 
> > This is fine for things like nfs or samba where the automount literally points 
> > to a completely different mount, but doesn't work for btrfs where it's on the 
> > same file system.  If you have 1000 subvolumes and run sync() you're going to 
> > write the superblock 1000 times for the same file system.  You are going to 
> > reclaim inodes on the same file system 1000 times.  You are going to reclaim 
> > dcache on the same filesytem 1000 times.  You are also going to pin 1000 
> > dentries/inodes into memory whenever you wander into these things because the 
> > super is going to hold them open.
> > 
> > This is not a workable solution.  It's not a matter of simply tying into 
> > existing infrastructure, we'd have to completely rework how the VFS deals with 
> > this stuff in order to be reasonable.  And when I brought this up to Al he told 
> > me I was insane and we absolutely had to have a different SB for every vfsmount, 
> > which means we can't use vfsmount for this, which means we don't have any other 
> > options.  Thanks,
> 
> When I was first looking at this, I thought that separate vfsmnts
> and auto-mounting was the way to go "just like NFS".  NFS still shares a
> lot between the multiple superblock - certainly it shares the same
> connection to the server.
> 
> But I dropped the idea when Bruce pointed out that nfsd is not set up to
> export auto-mounted filesystems.

Yes.  I wish it was....  But we'd need some way to look a
not-currently-mounted filesystem by filehandle:

> It needs to be able to find a
> filesystem given a UUID (extracted from a filehandle), and it does this
> by walking through the mount table to find one that matches.  So unless
> all btrfs subvols were mounted all the time (which I wouldn't propose),
> it would need major work to fix.
> 
> NFSv4 describes the fsid as having a "major" and "minor" component.
> We've never treated these as having an important meaning - just extra
> bits to encode uniqueness in.  Maybe we should have used "major" for the
> vfsmnt, and kept "minor" for the subvol.....

So nfsd would use the "major" ID to find the parent export, and then
btrfs would use the "minor" ID to identify the subvolume?

--b.

> The idea for a single vfsmnt exposing multiple inode-name-spaces does
> appeal to me.  The "st_dev" is just part of the name, and already a
> fairly blurry part.  Thanks to bind mounts, multiple mounts can have the
> same st_dev.  I see no intrinsic reason that a single mount should not
> have multiple fsids, provided that a coherent picture is provided to
> userspace which doesn't contain too many surprises.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-19 15:40                 ` Josef Bacik
@ 2021-07-19 20:00                   ` J. Bruce Fields
  2021-07-19 20:44                     ` Josef Bacik
  0 siblings, 1 reply; 94+ messages in thread
From: J. Bruce Fields @ 2021-07-19 20:00 UTC (permalink / raw)
  To: Josef Bacik
  Cc: NeilBrown, Christoph Hellwig, Chuck Lever, Chris Mason,
	David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On Mon, Jul 19, 2021 at 11:40:28AM -0400, Josef Bacik wrote:
> Ok so setting aside btrfs for the moment, how does NFS deal with
> exporting a directory that has multiple other file systems under
> that tree?  I assume the same sort of problem doesn't occur, but why
> is that?  Is it because it's a different vfsmount/sb or is there
> some other magic making this work?  Thanks,

There are two main ways an NFS client can look up a file: by name or by
filehandle.  The former's the normal filesystem directory lookup that
we're used to.  If the name refers to a mountpoint, the server can cross
into the mounted filesystem like anyone else.

It's the lookup by filehandle that's interesting.  Typically the
filehandle includes a UUID and an inode number.  The server looks up the
UUID with some help from mountd, and that gives a superblock that nfsd
can use for the inode lookup.

As Neil says, mountd does that basically by searching among mounted
filesystems for one with that uuid.

So if you wanted to be able to handle a uuid for a filesystem that's not
even mounted yet, you'd need some new mechanism to look up such uuids.

That's something we don't currently support but that we'd need to
support if BTRFS subvolumes were automounted.  (And it might have other
uses as well.)

But I'm not entirely sure if that answers your question....

--b.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-19 20:00                   ` J. Bruce Fields
@ 2021-07-19 20:44                     ` Josef Bacik
  2021-07-19 23:53                       ` NeilBrown
  0 siblings, 1 reply; 94+ messages in thread
From: Josef Bacik @ 2021-07-19 20:44 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: NeilBrown, Christoph Hellwig, Chuck Lever, Chris Mason,
	David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On 7/19/21 4:00 PM, J. Bruce Fields wrote:
> On Mon, Jul 19, 2021 at 11:40:28AM -0400, Josef Bacik wrote:
>> Ok so setting aside btrfs for the moment, how does NFS deal with
>> exporting a directory that has multiple other file systems under
>> that tree?  I assume the same sort of problem doesn't occur, but why
>> is that?  Is it because it's a different vfsmount/sb or is there
>> some other magic making this work?  Thanks,
> 
> There are two main ways an NFS client can look up a file: by name or by
> filehandle.  The former's the normal filesystem directory lookup that
> we're used to.  If the name refers to a mountpoint, the server can cross
> into the mounted filesystem like anyone else.
> 
> It's the lookup by filehandle that's interesting.  Typically the
> filehandle includes a UUID and an inode number.  The server looks up the
> UUID with some help from mountd, and that gives a superblock that nfsd
> can use for the inode lookup.
> 
> As Neil says, mountd does that basically by searching among mounted
> filesystems for one with that uuid.
> 
> So if you wanted to be able to handle a uuid for a filesystem that's not
> even mounted yet, you'd need some new mechanism to look up such uuids.
> 
> That's something we don't currently support but that we'd need to
> support if BTRFS subvolumes were automounted.  (And it might have other
> uses as well.)
> 
> But I'm not entirely sure if that answers your question....
> 

Right, because btrfs handles the filehandles ourselves properly with the 
export_operations and we encode the subvolume id's into those things to make 
sure we can always do the proper lookup.

I suppose the real problem is that NFS is exposing the inode->i_ino to the 
client without understanding that it's on a different subvolume.

Our trick of simply allocating an anonymous bdev every time you wander into a 
subvolume to get a unique st_dev doesn't help you guys because you are looking 
for mounted file systems.

I'm not concerned about the FH case, because for that it's already been crafted 
by btrfs and we know what to do with it, so it's always going to be correct.

The actual problem is that we can do

getattr(/file1)
getattr(/snap/file1)

on the client and the NFS server just blind sends i_ino with the same fsid 
because / and /snap are the same fsid.

Which brings us back to what HCH is complaining about.  In his view if we had a 
vfsmount for /snap then you would know that it was a different fs.  However that 
would only actually work if we generated a completely different superblock and 
thus gave /snap a unique fsid, right?

If we did the automount thing, and the NFS server went down and came back up and 
got a getattr(/snap/file1) from a previously generated FH it would still work 
right, because it would come into the export_operations with the format that 
btrfs is expecting and it would be able to do the lookup.  This FH lookup would 
do the automount magic it needs to and then NFS would have the fsid it needs, 
correct?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-19 20:44                     ` Josef Bacik
@ 2021-07-19 23:53                       ` NeilBrown
  0 siblings, 0 replies; 94+ messages in thread
From: NeilBrown @ 2021-07-19 23:53 UTC (permalink / raw)
  To: Josef Bacik
  Cc: J. Bruce Fields, Christoph Hellwig, Chuck Lever, Chris Mason,
	David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On Tue, 20 Jul 2021, Josef Bacik wrote:
> On 7/19/21 4:00 PM, J. Bruce Fields wrote:
> > On Mon, Jul 19, 2021 at 11:40:28AM -0400, Josef Bacik wrote:
> >> Ok so setting aside btrfs for the moment, how does NFS deal with
> >> exporting a directory that has multiple other file systems under
> >> that tree?  I assume the same sort of problem doesn't occur, but why
> >> is that?  Is it because it's a different vfsmount/sb or is there
> >> some other magic making this work?  Thanks,
> > 
> > There are two main ways an NFS client can look up a file: by name or by
> > filehandle.  The former's the normal filesystem directory lookup that
> > we're used to.  If the name refers to a mountpoint, the server can cross
> > into the mounted filesystem like anyone else.
> > 
> > It's the lookup by filehandle that's interesting.  Typically the
> > filehandle includes a UUID and an inode number.  The server looks up the
> > UUID with some help from mountd, and that gives a superblock that nfsd
> > can use for the inode lookup.
> > 
> > As Neil says, mountd does that basically by searching among mounted
> > filesystems for one with that uuid.
> > 
> > So if you wanted to be able to handle a uuid for a filesystem that's not
> > even mounted yet, you'd need some new mechanism to look up such uuids.
> > 
> > That's something we don't currently support but that we'd need to
> > support if BTRFS subvolumes were automounted.  (And it might have other
> > uses as well.)
> > 
> > But I'm not entirely sure if that answers your question....
> > 
> 
> Right, because btrfs handles the filehandles ourselves properly with the 
> export_operations and we encode the subvolume id's into those things to make 
> sure we can always do the proper lookup.
> 
> I suppose the real problem is that NFS is exposing the inode->i_ino to the 
> client without understanding that it's on a different subvolume.
> 
> Our trick of simply allocating an anonymous bdev every time you wander into a 
> subvolume to get a unique st_dev doesn't help you guys because you are looking 
> for mounted file systems.
> 
> I'm not concerned about the FH case, because for that it's already been crafted 
> by btrfs and we know what to do with it, so it's always going to be correct.
> 
> The actual problem is that we can do
> 
> getattr(/file1)
> getattr(/snap/file1)
> 
> on the client and the NFS server just blind sends i_ino with the same fsid 
> because / and /snap are the same fsid.
> 
> Which brings us back to what HCH is complaining about.  In his view if we had a 
> vfsmount for /snap then you would know that it was a different fs.  However that 
> would only actually work if we generated a completely different superblock and 
> thus gave /snap a unique fsid, right?

No, I don't think it needs to be a different superblock to have a
vfsmount.  (I don't know if it does to keep HCH happy).

If I "mount --bind /snap /snap" then I've created a vfsmnt with the
upper and lower directories identical - same inode, same superblock.
This is an existence-proof that you don't need a separate super-block.

> 
> If we did the automount thing, and the NFS server went down and came back up and 
> got a getattr(/snap/file1) from a previously generated FH it would still work 
> right, because it would come into the export_operations with the format that 
> btrfs is expecting and it would be able to do the lookup.  This FH lookup would 
> do the automount magic it needs to and then NFS would have the fsid it needs, 
> correct?  Thanks,

Not quite.
An NFS filehandle (as generated by linux-nfsd) has two components (plus
a header).  The filesystem-part and the file-part.
The filesystem-part is managed by userspace (/usr/sbin/mountd).  The
code relies on every filesystem appearing in /proc/self/mounts.
The bytes chosen are either based on the uuid reported by 'libblkid', or the
fsid reported by statfs(), based on a black-list of filesystems for
which libblkid is not useful.  This list includes btrfs.
The file-part is managed in the kernel using export_operations.

For any given 'struct path' in the kernel, a filehandle is generated
(conceptually) by finding the closest vfsmnt (close to inode, far from
root) and asking user-space to map that.  Then passing the inode to the
filesystem and asking it to map that.

So, in your example, if /snap were a mount point, the kernel would ask
mountd to determine the filesystem-part of /snap, and the fact that the
file-part from btrfs contained the objectid for snap just be redundant
information.  If /snap couldn't be found in /proc/self/mounts after a
server restart, the filehandle would be stale.

If btrfs were to use automounts and create the vfsmnts that one might
normally expect, then nfsd would need there to be two different sorts of
mount points, ideally visible in /proc/mounts (maybe a new flag that
appears in the list of mount options? "internal" ??).

- there needs to be the current mountpoint which a expected to be
  present after a reboot, and is likely to introduce a new filesystem,
  and
- there are these "new" mountpoints which are on-demand and expose
  something that is (in some sense) part of the same filesystem.
  The key property that NFSd would depend on is that these mount points
  do NOT introduce a new name-space for file-handles (in the sense of
  export_operations).

To expand on that last point:
- If a filehandle is requested for an inode above the "new" mountpoint
  and another "below" the new mountpoint, they are guaranteed to be
  different.
- If a filehandle that was "below" the new mountpoint is passed to
  exportfs_decode_fh() together with the vfsmnt that was *above* the
  mountpoint, then it somehow does "the right thing".  Probably
  that would require changing exportfs_decode_fh() to return a
  'struct path' rather than just a 'struct dentry *'.

When nfsd detected one of these "internal" mountpoints during a lookup,
it would *not* call-out to user-space to create a new export, but it
*would* ensure that a new fsid was reported for all inodes in the new
vfsmnt.

NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-19  9:16               ` Christoph Hellwig
@ 2021-07-19 23:54                 ` NeilBrown
  2021-07-20  6:23                   ` Christoph Hellwig
  0 siblings, 1 reply; 94+ messages in thread
From: NeilBrown @ 2021-07-19 23:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Josef Bacik, Christoph Hellwig, J. Bruce Fields, Chuck Lever,
	Chris Mason, David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher,
	linux-btrfs

On Mon, 19 Jul 2021, Christoph Hellwig wrote:
> On Thu, Jul 15, 2021 at 02:01:11PM -0400, Josef Bacik wrote:
> > This is not a workable solution.  It's not a matter of simply tying into
> > existing infrastructure, we'd have to completely rework how the VFS deals
> > with this stuff in order to be reasonable.  And when I brought this up to Al
> > he told me I was insane and we absolutely had to have a different SB for
> > every vfsmount, which means we can't use vfsmount for this, which means we
> > don't have any other options.  Thanks,
> 
> Then fix the problem another way.  The problem is known, old and keeps
> breaking stuff.  Don't paper over it, fix it. 

Do you have any pointers to other breakage caused by this particular
behaviour of btrfs? It would to have all requirements clearly on the
table while designing a solution.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-19 15:49                 ` J. Bruce Fields
@ 2021-07-20  0:02                   ` NeilBrown
  0 siblings, 0 replies; 94+ messages in thread
From: NeilBrown @ 2021-07-20  0:02 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Josef Bacik, Christoph Hellwig, Chuck Lever, Chris Mason,
	David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On Tue, 20 Jul 2021, J. Bruce Fields wrote:
> On Fri, Jul 16, 2021 at 08:37:07AM +1000, NeilBrown wrote:
> > On Fri, 16 Jul 2021, Josef Bacik wrote:
> > > On 7/15/21 1:24 PM, Christoph Hellwig wrote:
> > > > On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
> > > >> Because there's no alternative.  We need a way to tell userspace they've
> > > >> wandered into a different inode namespace.  There's no argument that what
> > > >> we're doing is ugly, but there's never been a clear "do X instead".  Just a
> > > >> lot of whinging that btrfs is broken.  This makes userspace happy and is
> > > >> simple and straightforward.  I'm open to alternatives, but there have been 0
> > > >> workable alternatives proposed in the last decade of complaining about it.
> > > > 
> > > > Make sure we cross a vfsmount when crossing the "st_dev" domain so
> > > > that it is properly reported.   Suggested many times and ignored all
> > > > the time beause it requires a bit of work.
> > > > 
> > > 
> > > You keep telling me this but forgetting that I did all this work when you 
> > > originally suggested it.  The problem I ran into was the automount stuff 
> > > requires that we have a completely different superblock for every vfsmount. 
> > > This is fine for things like nfs or samba where the automount literally points 
> > > to a completely different mount, but doesn't work for btrfs where it's on the 
> > > same file system.  If you have 1000 subvolumes and run sync() you're going to 
> > > write the superblock 1000 times for the same file system.  You are going to 
> > > reclaim inodes on the same file system 1000 times.  You are going to reclaim 
> > > dcache on the same filesytem 1000 times.  You are also going to pin 1000 
> > > dentries/inodes into memory whenever you wander into these things because the 
> > > super is going to hold them open.
> > > 
> > > This is not a workable solution.  It's not a matter of simply tying into 
> > > existing infrastructure, we'd have to completely rework how the VFS deals with 
> > > this stuff in order to be reasonable.  And when I brought this up to Al he told 
> > > me I was insane and we absolutely had to have a different SB for every vfsmount, 
> > > which means we can't use vfsmount for this, which means we don't have any other 
> > > options.  Thanks,
> > 
> > When I was first looking at this, I thought that separate vfsmnts
> > and auto-mounting was the way to go "just like NFS".  NFS still shares a
> > lot between the multiple superblock - certainly it shares the same
> > connection to the server.
> > 
> > But I dropped the idea when Bruce pointed out that nfsd is not set up to
> > export auto-mounted filesystems.
> 
> Yes.  I wish it was....  But we'd need some way to look a
> not-currently-mounted filesystem by filehandle:
> 
> > It needs to be able to find a
> > filesystem given a UUID (extracted from a filehandle), and it does this
> > by walking through the mount table to find one that matches.  So unless
> > all btrfs subvols were mounted all the time (which I wouldn't propose),
> > it would need major work to fix.
> > 
> > NFSv4 describes the fsid as having a "major" and "minor" component.
> > We've never treated these as having an important meaning - just extra
> > bits to encode uniqueness in.  Maybe we should have used "major" for the
> > vfsmnt, and kept "minor" for the subvol.....
> 
> So nfsd would use the "major" ID to find the parent export, and then
> btrfs would use the "minor" ID to identify the subvolume?

Maybe, though I don't think it would be really useful - just a
thought-bubble.

As the spec doesn't define any behaviour of these two numbers, there is
no point trying to impose any.
But (as described in another email) I think we do need to clearly
differentiate between "volume" and "subvolume" in the Linux API.
We cannot really use "different mount point" to mean "different volume"
as bind mounts broke that model long ago.

I think that "different st_dev" means "different subvolume" is a core
requirement as many applications assume that.  So the question is "how
to determine if two objects in different subvolumes are still in the
same volume".  This is something that nfsd needs to know.

NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-19 23:54                 ` NeilBrown
@ 2021-07-20  6:23                   ` Christoph Hellwig
  2021-07-20  7:17                     ` NeilBrown
  0 siblings, 1 reply; 94+ messages in thread
From: Christoph Hellwig @ 2021-07-20  6:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christoph Hellwig, Josef Bacik, J. Bruce Fields, Chuck Lever,
	Chris Mason, David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher,
	linux-btrfs

On Tue, Jul 20, 2021 at 09:54:44AM +1000, NeilBrown wrote:
> Do you have any pointers to other breakage caused by this particular
> behaviour of btrfs? It would to have all requirements clearly on the
> table while designing a solution.

A quick google find:

https://lore.kernel.org/linux-btrfs/b5e7e64a-741c-baee-bc4d-cd51ca9b3a38@gmail.com/T/
https://savannah.gnu.org/bugs/?50859
https://github.com/coreos/bugs/issues/301
https://bugs.kde.org/show_bug.cgi?id=317127
https://github.com/borgbackup/borg/issues/4009
https://bugs.python.org/issue37339
http://mail.openjdk.java.net/pipermail/nio-dev/2017-June/004292.html

and that is just the first 2 or three pages of trivial search results.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-20  6:23                   ` Christoph Hellwig
@ 2021-07-20  7:17                     ` NeilBrown
  2021-07-20  8:00                       ` Christoph Hellwig
  0 siblings, 1 reply; 94+ messages in thread
From: NeilBrown @ 2021-07-20  7:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, Josef Bacik, J. Bruce Fields, Chuck Lever,
	Chris Mason, David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher,
	linux-btrfs

On Tue, 20 Jul 2021, Christoph Hellwig wrote:
> On Tue, Jul 20, 2021 at 09:54:44AM +1000, NeilBrown wrote:
> > Do you have any pointers to other breakage caused by this particular
> > behaviour of btrfs? It would to have all requirements clearly on the
> > table while designing a solution.
> 
> A quick google find:
> 
> https://lore.kernel.org/linux-btrfs/b5e7e64a-741c-baee-bc4d-cd51ca9b3a38@gmail.com/T/
> https://savannah.gnu.org/bugs/?50859
> https://github.com/coreos/bugs/issues/301
> https://bugs.kde.org/show_bug.cgi?id=317127
> https://github.com/borgbackup/borg/issues/4009
> https://bugs.python.org/issue37339
> http://mail.openjdk.java.net/pipermail/nio-dev/2017-June/004292.html
> 
> and that is just the first 2 or three pages of trivial search results.
> 


Thanks a lot for these!  Very helpful.

The details vary, but the core problem seems to be that the device
number found in /proc/self/mountinfo is the same for all mounts from a
given btrfs filesystem, no matter which subvol happens to be found at or
beneath that mountpoint.  So it can even be that 'stat' on a mountpoint
returns different numbers to what is found for that mountpoint in
/proc/self/mountinfo.

To address these issues we would need to:
1/ make every btrfs subvol which is not already a mountpoint into an
   automount point which mounts the subvol (similar to the use of
   automount in NFS).
2/ either give each subvol a separate 'struct super_block' (which is
   apparently a bad idea) or change show_mountinfo() to allow an
   alternate dev_t to be used. e.g. some new s_op which is given
   mnt->mnt_root and returns a dev_t.  If the new s_op is not
   available, sb->s_dev is used.

For nfsd to be able to work with this, those automount points need to
have an inode in the parent filesystem with a distinct inode number, and
the mount must be marked in some way that nfsd can tell that it is
"internal".  Possibly a helper function that tests if mnt_parent has the
same mnt.mnt_sb would be sufficient, though it might be nice to export
this fact to user-space somehow.

Also exportfs_decode_fh() needs to be enhanced, probably to return a
'struct path'.

Does anything there seem unreasonable to you?

Thanks,
NeilBrown

 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-20  7:17                     ` NeilBrown
@ 2021-07-20  8:00                       ` Christoph Hellwig
  2021-07-20 23:11                         ` NeilBrown
  0 siblings, 1 reply; 94+ messages in thread
From: Christoph Hellwig @ 2021-07-20  8:00 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christoph Hellwig, Josef Bacik, J. Bruce Fields, Chuck Lever,
	Chris Mason, David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher,
	linux-btrfs

On Tue, Jul 20, 2021 at 05:17:12PM +1000, NeilBrown wrote:
> Does anything there seem unreasonable to you?

This is what I've been asking for for years.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-15 18:01             ` Josef Bacik
  2021-07-15 22:37               ` NeilBrown
  2021-07-19  9:16               ` Christoph Hellwig
@ 2021-07-20 22:10               ` J. Bruce Fields
  2 siblings, 0 replies; 94+ messages in thread
From: J. Bruce Fields @ 2021-07-20 22:10 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christoph Hellwig, NeilBrown, Chuck Lever, Chris Mason,
	David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On Thu, Jul 15, 2021 at 02:01:11PM -0400, Josef Bacik wrote:
> The problem I ran into was the automount stuff requires that we have a
> completely different superblock for every vfsmount. This is fine for
> things like nfs or samba where the automount literally points to a
> completely different mount, but doesn't work for btrfs where it's on
> the same file system.  If you have 1000 subvolumes and run sync()
> you're going to write the superblock 1000 times for the same file
> system.

Dumb question: why do you have to write the superblock 1000 times, and
why is that slower than writing to 1000 different filesystems?

> You are
> going to reclaim inodes on the same file system 1000 times.  You are
> going to reclaim dcache on the same filesytem 1000 times.  You are
> also going to pin 1000 dentries/inodes into memory whenever you
> wander into these things because the super is going to hold them
> open.

That last part at least is the same for the 1000-different-filesystems
case, isn't it?

--b.

> This is not a workable solution.  It's not a matter of simply tying
> into existing infrastructure, we'd have to completely rework how the
> VFS deals with this stuff in order to be reasonable.  And when I
> brought this up to Al he told me I was insane and we absolutely had
> to have a different SB for every vfsmount, which means we can't use
> vfsmount for this, which means we don't have any other options.
> Thanks,
> 
> Josef

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
  2021-07-20  8:00                       ` Christoph Hellwig
@ 2021-07-20 23:11                         ` NeilBrown
  0 siblings, 0 replies; 94+ messages in thread
From: NeilBrown @ 2021-07-20 23:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Josef Bacik, J. Bruce Fields, Chuck Lever, Chris Mason,
	David Sterba, linux-nfs, Wang Yugui, Ulli Horlacher, linux-btrfs

On Tue, 20 Jul 2021, Christoph Hellwig wrote:
> On Tue, Jul 20, 2021 at 05:17:12PM +1000, NeilBrown wrote:
> > Does anything there seem unreasonable to you?
> 
> This is what I've been asking for for years.
> 
> 
Execellent - we seem to be on the same page.
I'll aim to have some prelimiary patches for review within a week.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: cannot use btrfs for nfs server
  2021-07-19 12:06                           ` Forza
  2021-07-19 13:07                             ` Forza
@ 2021-07-27 11:27                             ` Ulli Horlacher
  1 sibling, 0 replies; 94+ messages in thread
From: Ulli Horlacher @ 2021-07-27 11:27 UTC (permalink / raw)
  To: Btrfs BTRFS

On Mon 2021-07-19 (14:06), Forza wrote:

> > And the error messages are annoying!
> > 
> > root@tsmsrvj:/etc# exportfs -v
> > /data/fex       localhost.localdomain(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
> > /data/snapshots localhost.localdomain(rw,async,wdelay,crossmnt,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
> > 
> > root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/fex /nfs/localhost/fex
> > root@tsmsrvj:/etc# mount -o vers=3 localhost:/data/snapshots /nfs/localhost/snapshots
> > root@tsmsrvj:/etc# mount | grep localhost
> > localhost:/data/fex on /nfs/localhost/fex type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1)
> > localhost:/data/snapshots on /nfs/localhost/snapshots type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=37961,mountproto=udp,local_lock=none,addr=127.0.0.1)
> > 
> 
> What kind of NFS server is this? 

Default Ubuntu kernel NFS-server.


> Isn't UDP mounts legacy and not normally used by default?

See above, I am using tcp!


> Can you switch to an nfs4 server and try again? I also still think you 
> should use fsid export option.

No change. The error is still there.



-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<2b53b9dd-4353-a73e-59b3-c87b6419ebf4@tnonline.net>

^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2021-07-27 11:35 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-13  3:53 any idea about auto export multiple btrfs snapshots? Wang Yugui
2021-03-10  7:46 ` nfs subvolume access? Ulli Horlacher
2021-03-10  7:59   ` Hugo Mills
2021-03-10  8:09     ` Ulli Horlacher
2021-03-10  9:35       ` Graham Cobb
2021-03-10 15:55         ` Ulli Horlacher
2021-03-10 17:29           ` Forza
2021-03-10 17:46             ` Ulli Horlacher
2021-03-10  8:17   ` Ulli Horlacher
2021-03-11  7:46   ` Ulli Horlacher
2021-07-08 22:17     ` cannot use btrfs for nfs server Ulli Horlacher
2021-07-09  0:05       ` Graham Cobb
2021-07-09  4:05         ` NeilBrown
2021-07-09  6:53         ` Ulli Horlacher
2021-07-09  7:23           ` Forza
2021-07-09  7:24             ` Hugo Mills
2021-07-09  7:34             ` Ulli Horlacher
2021-07-09 16:30               ` Chris Murphy
2021-07-10  6:35                 ` Ulli Horlacher
2021-07-11 11:41                   ` Forza
2021-07-12  7:17                     ` Ulli Horlacher
2021-07-09 16:35           ` Chris Murphy
2021-07-10  6:56             ` Ulli Horlacher
2021-07-10 22:17               ` Chris Murphy
2021-07-12  7:25                 ` Ulli Horlacher
2021-07-12 13:06                   ` Graham Cobb
2021-07-12 16:16                     ` Ulli Horlacher
2021-07-12 22:56                       ` g.btrfs
2021-07-13  7:37                         ` Ulli Horlacher
2021-07-19 12:06                           ` Forza
2021-07-19 13:07                             ` Forza
2021-07-19 13:35                               ` Forza
2021-07-27 11:27                             ` Ulli Horlacher
2021-07-09 16:06       ` Lord Vader
2021-07-10  7:03         ` Ulli Horlacher
     [not found]   ` <162632387205.13764.6196748476850020429@noble.neil.brown.name>
2021-07-15 14:09     ` [PATCH/RFC] NFSD: handle BTRFS subvolumes better Josef Bacik
2021-07-15 16:45       ` Christoph Hellwig
2021-07-15 17:11         ` Josef Bacik
2021-07-15 17:24           ` Christoph Hellwig
2021-07-15 18:01             ` Josef Bacik
2021-07-15 22:37               ` NeilBrown
2021-07-19 15:40                 ` Josef Bacik
2021-07-19 20:00                   ` J. Bruce Fields
2021-07-19 20:44                     ` Josef Bacik
2021-07-19 23:53                       ` NeilBrown
2021-07-19 15:49                 ` J. Bruce Fields
2021-07-20  0:02                   ` NeilBrown
2021-07-19  9:16               ` Christoph Hellwig
2021-07-19 23:54                 ` NeilBrown
2021-07-20  6:23                   ` Christoph Hellwig
2021-07-20  7:17                     ` NeilBrown
2021-07-20  8:00                       ` Christoph Hellwig
2021-07-20 23:11                         ` NeilBrown
2021-07-20 22:10               ` J. Bruce Fields
2021-07-15 23:02       ` NeilBrown
2021-07-15 15:45     ` J. Bruce Fields
2021-07-15 23:08       ` NeilBrown
2021-06-14 22:50 ` any idea about auto export multiple btrfs snapshots? NeilBrown
2021-06-15 15:13   ` Wang Yugui
2021-06-15 15:41     ` Wang Yugui
2021-06-16  5:47     ` Wang Yugui
2021-06-17  3:02     ` NeilBrown
2021-06-17  4:28       ` Wang Yugui
2021-06-18  0:32         ` NeilBrown
2021-06-18  7:26           ` Wang Yugui
2021-06-18 13:34             ` Wang Yugui
2021-06-19  6:47               ` Wang Yugui
2021-06-20 12:27             ` Wang Yugui
2021-06-21  4:52             ` NeilBrown
2021-06-21  5:13               ` NeilBrown
2021-06-21  8:34                 ` Wang Yugui
2021-06-22  1:28                   ` NeilBrown
2021-06-22  3:22                     ` Wang Yugui
2021-06-22  7:14                       ` Wang Yugui
2021-06-23  0:59                         ` NeilBrown
2021-06-23  6:14                           ` Wang Yugui
2021-06-23  6:29                             ` NeilBrown
2021-06-23  9:34                               ` Wang Yugui
2021-06-23 23:38                                 ` NeilBrown
2021-06-23 15:35                           ` J. Bruce Fields
2021-06-23 22:04                             ` NeilBrown
2021-06-23 22:25                               ` J. Bruce Fields
2021-06-23 23:29                                 ` NeilBrown
2021-06-23 23:41                                   ` Frank Filz
2021-06-24  0:01                                   ` J. Bruce Fields
2021-06-24 21:58                               ` Patrick Goetz
2021-06-24 23:27                                 ` NeilBrown
2021-06-21 14:35               ` Frank Filz
2021-06-21 14:55                 ` Wang Yugui
2021-06-21 17:49                   ` Frank Filz
2021-06-21 22:41                     ` Wang Yugui
2021-06-22 17:34                       ` Frank Filz
2021-06-22 22:48                         ` Wang Yugui
2021-06-17  2:15   ` Wang Yugui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.