linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Regression with initramfs and nfsroot (appears to be in the dcache)
@ 2012-11-29 19:16 Patrick McLean
  2012-11-29 21:33 ` Al Viro
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick McLean @ 2012-11-29 19:16 UTC (permalink / raw)
  To: Alexander Viro; +Cc: linux-fsdevel, linux-kernel

With 3.6-rc1 and up, when using a (dracut) initramfs with a read-only
nfs root, all accesses to /proc. /sys and /dev return EBUSY.

Bisecting finds this commit as where this was introduced:

> ee3efa91e240f513898050ef305a49a653c8ed90 is the first bad commit
> commit ee3efa91e240f513898050ef305a49a653c8ed90
> Author: Al Viro <viro@zeniv.linux.org.uk>
> Date:   Fri Jun 8 15:59:33 2012 -0400
>
>     __d_unalias() should refuse to move mountpoints
>
>     Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
>
> :040000 040000 1d6ecde959d3f8252b33f4adff3c4bf1e67f2b92 992ec34563b90fb349957418f76d4673c1af4ab6 M fs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-29 19:16 Regression with initramfs and nfsroot (appears to be in the dcache) Patrick McLean
@ 2012-11-29 21:33 ` Al Viro
  2012-11-29 22:06   ` Patrick McLean
  0 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2012-11-29 21:33 UTC (permalink / raw)
  To: Patrick McLean; +Cc: linux-fsdevel, linux-kernel

On Thu, Nov 29, 2012 at 11:16:59AM -0800, Patrick McLean wrote:
> With 3.6-rc1 and up, when using a (dracut) initramfs with a read-only
> nfs root, all accesses to /proc. /sys and /dev return EBUSY.

See "[PATCH] Revert "__d_unalias() should refuse to move mountpoints"
thread.  If you have a convenient reproducer, could you check if
the fixes the breakage?  If so, we'll need to look into false negatives
from nfs_same_file() in there...

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ce8cb92..55436f5 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -450,7 +450,10 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
 			nfs_refresh_inode(dentry->d_inode, entry->fattr);
 			goto out;
 		} else {
-			d_drop(dentry);
+			if (d_invalidate(dentry) != 0) {
+				WARN_ON(1);
+				goto out;
+			}
 			dput(dentry);
 		}
 	}

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-29 21:33 ` Al Viro
@ 2012-11-29 22:06   ` Patrick McLean
  2012-11-29 22:21     ` Al Viro
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick McLean @ 2012-11-29 22:06 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Thu, Nov 29, 2012 at 1:33 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Nov 29, 2012 at 11:16:59AM -0800, Patrick McLean wrote:
>> With 3.6-rc1 and up, when using a (dracut) initramfs with a read-only
>> nfs root, all accesses to /proc. /sys and /dev return EBUSY.
>
> See "[PATCH] Revert "__d_unalias() should refuse to move mountpoints"
> thread.  If you have a convenient reproducer, could you check if
> the fixes the breakage?  If so, we'll need to look into false negatives
> from nfs_same_file() in there...
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index ce8cb92..55436f5 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -450,7 +450,10 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
>                         nfs_refresh_inode(dentry->d_inode, entry->fattr);
>                         goto out;
>                 } else {
> -                       d_drop(dentry);
> +                       if (d_invalidate(dentry) != 0) {
> +                               WARN_ON(1);
> +                               goto out;
> +                       }
>                         dput(dentry);
>                 }
>         }

I have a trivial reproducer and am happy to help debug in any way that
I can. That patch seems to fix the problem, and produces these
warnings in dmesg:

[    3.306483] dracut: Switching root
[    4.324378] systemd-udevd[552]: starting version 195
[    9.254972] ------------[ cut here ]------------
[    9.254981] WARNING: at fs/nfs/dir.c:454
nfs_readdir_page_filler+0x1cc/0x3a2()
[    9.254983] Hardware name: Bochs
[    9.254984] Modules linked in:
[    9.254989] Pid: 676, comm: ls Not tainted 3.7.0-rc7+ #35
[    9.254990] Call Trace:
[    9.254999]  [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[    9.255002]  [<ffffffff8117de91>] ? nfs_readdir_page_filler+0x1cc/0x3a2
[    9.255005]  [<ffffffff8117e683>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[    9.255009]  [<ffffffff8117e70c>] ? nfs_readdir_filler+0x1c/0x6b
[    9.255014]  [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[    9.255017]  [<ffffffff8117e6f0>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[    9.255020]  [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[    9.255025]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.255028]  [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[    9.255031]  [<ffffffff8117e888>] ? nfs_readdir+0x12d/0x435
[    9.255036]  [<ffffffff8118e653>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[    9.255039]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.255042]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.255045]  [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[    9.255049]  [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[    9.255053]  [<ffffffff814ac769>] ? system_call_fastpath+0x16/0x1b
[    9.255055] ---[ end trace 5e8b5f37fe752ab1 ]---
[    9.255062] ------------[ cut here ]------------
[    9.255065] WARNING: at fs/nfs/dir.c:454
nfs_readdir_page_filler+0x1cc/0x3a2()
[    9.255066] Hardware name: Bochs
[    9.255067] Modules linked in:
[    9.255070] Pid: 676, comm: ls Tainted: G        W    3.7.0-rc7+ #35
[    9.255071] Call Trace:
[    9.255075]  [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[    9.255077]  [<ffffffff8117de91>] ? nfs_readdir_page_filler+0x1cc/0x3a2
[    9.255080]  [<ffffffff8117e683>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[    9.255083]  [<ffffffff8117e70c>] ? nfs_readdir_filler+0x1c/0x6b
[    9.255087]  [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[    9.255089]  [<ffffffff8117e6f0>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[    9.255093]  [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[    9.255096]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.255099]  [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[    9.255102]  [<ffffffff8117e888>] ? nfs_readdir+0x12d/0x435
[    9.255105]  [<ffffffff8118e653>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[    9.255109]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.255112]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.255115]  [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[    9.255118]  [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[    9.255121]  [<ffffffff814ac769>] ? system_call_fastpath+0x16/0x1b
[    9.255122] ---[ end trace 5e8b5f37fe752ab2 ]---
[    9.255133] ------------[ cut here ]------------
[    9.255135] WARNING: at fs/nfs/dir.c:454
nfs_readdir_page_filler+0x1cc/0x3a2()
[    9.255136] Hardware name: Bochs
[    9.255137] Modules linked in:
[    9.255140] Pid: 676, comm: ls Tainted: G        W    3.7.0-rc7+ #35
[    9.255141] Call Trace:
[    9.255144]  [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[    9.255147]  [<ffffffff8117de91>] ? nfs_readdir_page_filler+0x1cc/0x3a2
[    9.255150]  [<ffffffff8117e683>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[    9.255153]  [<ffffffff8117e70c>] ? nfs_readdir_filler+0x1c/0x6b
[    9.255157]  [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[    9.255159]  [<ffffffff8117e6f0>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[    9.255162]  [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[    9.255166]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.255169]  [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[    9.255171]  [<ffffffff8117e888>] ? nfs_readdir+0x12d/0x435
[    9.255175]  [<ffffffff8118e653>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[    9.255178]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.255181]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.255184]  [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[    9.255188]  [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[    9.255190]  [<ffffffff814ac769>] ? system_call_fastpath+0x16/0x1b
[    9.255192] ---[ end trace 5e8b5f37fe752ab3 ]---

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-29 22:06   ` Patrick McLean
@ 2012-11-29 22:21     ` Al Viro
  2012-11-29 22:53       ` Patrick McLean
  0 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2012-11-29 22:21 UTC (permalink / raw)
  To: Patrick McLean; +Cc: linux-fsdevel, linux-kernel

On Thu, Nov 29, 2012 at 02:06:22PM -0800, Patrick McLean wrote:

> I have a trivial reproducer and am happy to help debug in any way that
> I can. That patch seems to fix the problem, and produces these
> warnings in dmesg:
> 
> [    3.306483] dracut: Switching root
> [    4.324378] systemd-udevd[552]: starting version 195
> [    9.254972] ------------[ cut here ]------------
> [    9.254981] WARNING: at fs/nfs/dir.c:454
> nfs_readdir_page_filler+0x1cc/0x3a2()
> [    9.254983] Hardware name: Bochs
> [    9.254984] Modules linked in:
> [    9.254989] Pid: 676, comm: ls Not tainted 3.7.0-rc7+ #35
> [    9.254990] Call Trace:
> [    9.254999]  [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
> [    9.255002]  [<ffffffff8117de91>] ? nfs_readdir_page_filler+0x1cc/0x3a2
> [    9.255005]  [<ffffffff8117e683>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
> [    9.255009]  [<ffffffff8117e70c>] ? nfs_readdir_filler+0x1c/0x6b
> [    9.255014]  [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
> [    9.255017]  [<ffffffff8117e6f0>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
> [    9.255020]  [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
> [    9.255025]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
> [    9.255028]  [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
> [    9.255031]  [<ffffffff8117e888>] ? nfs_readdir+0x12d/0x435
> [    9.255036]  [<ffffffff8118e653>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
> [    9.255039]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
> [    9.255042]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
> [    9.255045]  [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
> [    9.255049]  [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
> [    9.255053]  [<ffffffff814ac769>] ? system_call_fastpath+0x16/0x1b
> [    9.255055] ---[ end trace 5e8b5f37fe752ab1 ]---

OK...  So we have differing entry->fh and NFS_FH(dentry->d_inode).  Something
like
static void dump_fh(const struct nfs_fh *fh)
{
	int i;
	printk(KERN_INFO "FH(%d)", fh->size);
	for (i = 0; i < fh->size; i++)
		printk(KERN_CONT "%c%02x", i ? ' ' : '[', fh->data[i]);
	printk(KERN_CONT "]\n");
}
with dump_fh(entry->fh); dump_fh(NFS_FH(dentry->d_inode)); added next to
that WARN_ON(1) would probably be interesting.  And probably would make
sense to print filename->name as well, to see which files it is about.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-29 22:21     ` Al Viro
@ 2012-11-29 22:53       ` Patrick McLean
  2012-11-29 23:43         ` Al Viro
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick McLean @ 2012-11-29 22:53 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On 29/11/12 02:21 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 02:06:22PM -0800, Patrick McLean wrote:
> 
>> I have a trivial reproducer and am happy to help debug in any way that
>> I can. That patch seems to fix the problem, and produces these
>> warnings in dmesg:
> 
> OK...  So we have differing entry->fh and NFS_FH(dentry->d_inode).  Something
> like
> static void dump_fh(const struct nfs_fh *fh)
> {
> 	int i;
> 	printk(KERN_INFO "FH(%d)", fh->size);
> 	for (i = 0; i < fh->size; i++)
> 		printk(KERN_CONT "%c%02x", i ? ' ' : '[', fh->data[i]);
> 	printk(KERN_CONT "]\n");
> }
> with dump_fh(entry->fh); dump_fh(NFS_FH(dentry->d_inode)); added next to
> that WARN_ON(1) would probably be interesting.  And probably would make
> sense to print filename->name as well, to see which files it is about.
> 

Here is the output of the first of the 3 times it hits the WARN_ON (I can include all 3 if desired), with the filename.name at the end:

[    8.821503] ------------[ cut here ]------------
[    8.821512] WARNING: at fs/nfs/dir.c:463 nfs_readdir_page_filler+0x1d0/0x3d2()
[    8.821513] Hardware name: Bochs
[    8.821515] Modules linked in:
[    8.821519] Pid: 630, comm: bash Not tainted 3.7.0-rc7+ #36
[    8.821520] Call Trace:
[    8.821528]  [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[    8.821531]  [<ffffffff8117de95>] ? nfs_readdir_page_filler+0x1d0/0x3d2
[    8.821535]  [<ffffffff8117e6b3>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[    8.821538]  [<ffffffff8117e73c>] ? nfs_readdir_filler+0x1c/0x6b
[    8.821543]  [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[    8.821546]  [<ffffffff8117e720>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[    8.821549]  [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[    8.821554]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    8.821557]  [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[    8.821560]  [<ffffffff8117e8b8>] ? nfs_readdir+0x12d/0x435
[    8.821564]  [<ffffffff8118e683>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[    8.821568]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    8.821571]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    8.821574]  [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[    8.821577]  [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[    8.821581]  [<ffffffff814ac7e9>] ? system_call_fastpath+0x16/0x1b
[    8.821583] ---[ end trace 89263124889205c1 ]---
[    8.821584] FH(0)]
[    8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
[    8.821601] filename: proc

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-29 22:53       ` Patrick McLean
@ 2012-11-29 23:43         ` Al Viro
  2012-11-30  0:19           ` Patrick McLean
  0 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2012-11-29 23:43 UTC (permalink / raw)
  To: Patrick McLean; +Cc: linux-fsdevel, linux-kernel, Trond Myklebust, linux-nfs

On Thu, Nov 29, 2012 at 02:53:13PM -0800, Patrick McLean wrote:
> On 29/11/12 02:21 PM, Al Viro wrote:
> > On Thu, Nov 29, 2012 at 02:06:22PM -0800, Patrick McLean wrote:
> > 
> >> I have a trivial reproducer and am happy to help debug in any way that
> >> I can. That patch seems to fix the problem, and produces these
> >> warnings in dmesg:
> > 
> > OK...  So we have differing entry->fh and NFS_FH(dentry->d_inode).  Something
> > like
> > static void dump_fh(const struct nfs_fh *fh)
> > {
> > 	int i;
> > 	printk(KERN_INFO "FH(%d)", fh->size);
> > 	for (i = 0; i < fh->size; i++)
> > 		printk(KERN_CONT "%c%02x", i ? ' ' : '[', fh->data[i]);
> > 	printk(KERN_CONT "]\n");
> > }
> > with dump_fh(entry->fh); dump_fh(NFS_FH(dentry->d_inode)); added next to
> > that WARN_ON(1) would probably be interesting.  And probably would make
> > sense to print filename->name as well, to see which files it is about.

> [    8.821584] FH(0)]
> [    8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
> [    8.821601] filename: proc

*whoa*

So we have zero entry->fh->size?  No wonder it doesn't match...  Which NFS
version it is?  entry->fh->size is set by nfs[34]_decode_dirent().

NFS folks: any ideas on best way to debug it?  The brute-force way would be
to capture all NFS traffic with tcpdump and see what's going on, but that
would be a lot of work...

Looks like we have READDIRPLUS attempted and succeeded, but fhandle was not
given.  Result: nfs_prime_dcache() is doing blind d_drop() on perfectly
valid dentries, no matter how busy.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-29 23:43         ` Al Viro
@ 2012-11-30  0:19           ` Patrick McLean
  2012-11-30  0:35             ` Al Viro
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick McLean @ 2012-11-30  0:19 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Trond Myklebust, linux-nfs

On 29/11/12 03:43 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 02:53:13PM -0800, Patrick McLean wrote:
>> On 29/11/12 02:21 PM, Al Viro wrote:
>>> On Thu, Nov 29, 2012 at 02:06:22PM -0800, Patrick McLean wrote:
>>>
>>>> I have a trivial reproducer and am happy to help debug in any way that
>>>> I can. That patch seems to fix the problem, and produces these
>>>> warnings in dmesg:
>>>
>>> OK...  So we have differing entry->fh and NFS_FH(dentry->d_inode).  Something
>>> like
>>> static void dump_fh(const struct nfs_fh *fh)
>>> {
>>> 	int i;
>>> 	printk(KERN_INFO "FH(%d)", fh->size);
>>> 	for (i = 0; i < fh->size; i++)
>>> 		printk(KERN_CONT "%c%02x", i ? ' ' : '[', fh->data[i]);
>>> 	printk(KERN_CONT "]\n");
>>> }
>>> with dump_fh(entry->fh); dump_fh(NFS_FH(dentry->d_inode)); added next to
>>> that WARN_ON(1) would probably be interesting.  And probably would make
>>> sense to print filename->name as well, to see which files it is about.
> 
>> [    8.821584] FH(0)]
>> [    8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
>> [    8.821601] filename: proc
> 
> *whoa*
> 
> So we have zero entry->fh->size?  No wonder it doesn't match...  Which NFS
> version it is?  entry->fh->size is set by nfs[34]_decode_dirent().

This is nfs v3 over TCP on Linus git at commit e9296e89b85604862bd9ec2d54dc43edad775c0d with nfs-utils-1.2.6 userspace.

> NFS folks: any ideas on best way to debug it?  The brute-force way would be
> to capture all NFS traffic with tcpdump and see what's going on, but that
> would be a lot of work...
> 
> Looks like we have READDIRPLUS attempted and succeeded, but fhandle was not
> given.  Result: nfs_prime_dcache() is doing blind d_drop() on perfectly
> valid dentries, no matter how busy.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-30  0:19           ` Patrick McLean
@ 2012-11-30  0:35             ` Al Viro
  2012-11-30  0:57               ` Patrick McLean
  0 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2012-11-30  0:35 UTC (permalink / raw)
  To: Patrick McLean; +Cc: linux-fsdevel, linux-kernel, Trond Myklebust, linux-nfs

On Thu, Nov 29, 2012 at 04:19:51PM -0800, Patrick McLean wrote:
> >> [    8.821584] FH(0)]
> >> [    8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
> >> [    8.821601] filename: proc
> > 
> > *whoa*
> > 
> > So we have zero entry->fh->size?  No wonder it doesn't match...  Which NFS
> > version it is?  entry->fh->size is set by nfs[34]_decode_dirent().
> 
> This is nfs v3 over TCP on Linus git at commit e9296e89b85604862bd9ec2d54dc43edad775c0d with nfs-utils-1.2.6 userspace.

So we have nfs3_decode_dirent(), stepping into
                /* In fact, a post_op_fh3: */
                p = xdr_inline_decode(xdr, 4);
                if (unlikely(p == NULL))
                        goto out_overflow;
                if (*p != xdr_zero) {
                        error = decode_nfs_fh3(xdr, entry->fh);
                        if (unlikely(error)) {
                                if (error == -E2BIG)
                                        goto out_truncated;
                                return error;
                        }
                } else
                        zero_nfs_fh3(entry->fh);
Interesting...  Server-side that should've been produced by
encode_entryplus_baggage(), which looks like failing compose_entry_fh()...
which has explicit
        if (d_mountpoint(dchild))
                goto out;
resulting in ENOENT on everything that's overmounted on server.

Do you, by any chance, have the server really exporting its own root
filesystem?  Another thing to check: have nfs_prime_dcache() print
filename.name of everything that fails nfs_same_entry() and has
zero entry->fh->size, regardless of d_invalidate() results.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-30  0:35             ` Al Viro
@ 2012-11-30  0:57               ` Patrick McLean
  2012-11-30  1:36                 ` Al Viro
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick McLean @ 2012-11-30  0:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Patrick McLean, linux-fsdevel, linux-kernel, Trond Myklebust, linux-nfs

On 29/11/12 04:35 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 04:19:51PM -0800, Patrick McLean wrote:
>>>> [    8.821584] FH(0)]
>>>> [    8.821586] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
>>>> [    8.821601] filename: proc
>>>
>>> *whoa*
>>>
>>> So we have zero entry->fh->size?  No wonder it doesn't match...  Which NFS
>>> version it is?  entry->fh->size is set by nfs[34]_decode_dirent().
>>
>> This is nfs v3 over TCP on Linus git at commit e9296e89b85604862bd9ec2d54dc43edad775c0d with nfs-utils-1.2.6 userspace.
> 
> So we have nfs3_decode_dirent(), stepping into
>                 /* In fact, a post_op_fh3: */
>                 p = xdr_inline_decode(xdr, 4);
>                 if (unlikely(p == NULL))
>                         goto out_overflow;
>                 if (*p != xdr_zero) {
>                         error = decode_nfs_fh3(xdr, entry->fh);
>                         if (unlikely(error)) {
>                                 if (error == -E2BIG)
>                                         goto out_truncated;
>                                 return error;
>                         }
>                 } else
>                         zero_nfs_fh3(entry->fh);
> Interesting...  Server-side that should've been produced by
> encode_entryplus_baggage(), which looks like failing compose_entry_fh()...
> which has explicit
>         if (d_mountpoint(dchild))
>                 goto out;
> resulting in ENOENT on everything that's overmounted on server.
> 
> Do you, by any chance, have the server really exporting its own root
> filesystem?  Another thing to check: have nfs_prime_dcache() print
> filename.name of everything that fails nfs_same_entry() and has
> zero entry->fh->size, regardless of d_invalidate() results.

The server is running 3.6.6 and is just exporting a subdir of an xfs filesystem (which does not happen to be the root filesystem).

The client is running as a KVM guest on the machine that is serving the NFS. I am reproducing this by booting the guest via an initramfs, and doing
"ls /" at in single user mode.

I added a check that prints the filename.name of everything that fails nfs_same_file, and it appears to just be triggered by the same filesystems that
are triggering the WARN_ON, the relevant dmesg is below.

[    9.495217] entry->fh->size is 0 on: proc
[    9.495222] ------------[ cut here ]------------
[    9.495230] WARNING: at fs/nfs/dir.c:464 nfs_readdir_page_filler+0x1ef/0x3eb()
[    9.495232] Hardware name: Bochs
[    9.495233] Modules linked in:
[    9.495237] Pid: 655, comm: ls Not tainted 3.7.0-rc7+ #40
[    9.495239] Call Trace:
[    9.495247]  [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[    9.495250]  [<ffffffff8117deb4>] ? nfs_readdir_page_filler+0x1ef/0x3eb
[    9.495254]  [<ffffffff8117e6cc>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[    9.495257]  [<ffffffff8117e755>] ? nfs_readdir_filler+0x1c/0x6b
[    9.495263]  [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[    9.495266]  [<ffffffff8117e739>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[    9.495269]  [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[    9.495274]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.495277]  [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[    9.495280]  [<ffffffff8117e8d1>] ? nfs_readdir+0x12d/0x435
[    9.495285]  [<ffffffff8118e69b>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[    9.495288]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.495291]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.495294]  [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[    9.495298]  [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[    9.495302]  [<ffffffff814ac829>] ? system_call_fastpath+0x16/0x1b
[    9.495304] ---[ end trace e502c5d56c594e85 ]---
[    9.495306] FH(0)]
[    9.495308] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a7 3c 1b 80 33 38 e3 62]
[    9.495323] filename: proc
[    9.495330] entry->fh->size is 0 on: dev
[    9.495332] ------------[ cut here ]------------
[    9.495335] WARNING: at fs/nfs/dir.c:464 nfs_readdir_page_filler+0x1ef/0x3eb()
[    9.495336] Hardware name: Bochs
[    9.495337] Modules linked in:
[    9.495340] Pid: 655, comm: ls Tainted: G        W    3.7.0-rc7+ #40
[    9.495341] Call Trace:
[    9.495345]  [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[    9.495348]  [<ffffffff8117deb4>] ? nfs_readdir_page_filler+0x1ef/0x3eb
[    9.495351]  [<ffffffff8117e6cc>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[    9.495354]  [<ffffffff8117e755>] ? nfs_readdir_filler+0x1c/0x6b
[    9.495358]  [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[    9.495361]  [<ffffffff8117e739>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[    9.495364]  [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[    9.495368]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.495371]  [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[    9.495373]  [<ffffffff8117e8d1>] ? nfs_readdir+0x12d/0x435
[    9.495377]  [<ffffffff8118e69b>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[    9.495380]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.495383]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.495387]  [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[    9.495390]  [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[    9.495393]  [<ffffffff814ac829>] ? system_call_fastpath+0x16/0x1b
[    9.495395] ---[ end trace e502c5d56c594e86 ]---
[    9.495396] FH(0)]
[    9.495397] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 85 00 00 40 d6 39 e0 7d]
[    9.495412] filename: dev
[    9.495422] entry->fh->size is 0 on: sys
[    9.495423] ------------[ cut here ]------------
[    9.495426] WARNING: at fs/nfs/dir.c:464 nfs_readdir_page_filler+0x1ef/0x3eb()
[    9.495427] Hardware name: Bochs
[    9.495428] Modules linked in:
[    9.495430] Pid: 655, comm: ls Tainted: G        W    3.7.0-rc7+ #40
[    9.495431] Call Trace:
[    9.495435]  [<ffffffff8108534c>] ? warn_slowpath_common+0x76/0x8a
[    9.495438]  [<ffffffff8117deb4>] ? nfs_readdir_page_filler+0x1ef/0x3eb
[    9.495441]  [<ffffffff8117e6cc>] ? nfs_readdir_xdr_to_array+0x1c0/0x22d
[    9.495444]  [<ffffffff8117e755>] ? nfs_readdir_filler+0x1c/0x6b
[    9.495448]  [<ffffffff810dca9a>] ? add_to_page_cache_lru+0x2c/0x36
[    9.495450]  [<ffffffff8117e739>] ? nfs_readdir_xdr_to_array+0x22d/0x22d
[    9.495454]  [<ffffffff810dcbe3>] ? do_read_cache_page+0x7d/0x12b
[    9.495457]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.495460]  [<ffffffff810dccc6>] ? read_cache_page+0x7/0x10
[    9.495463]  [<ffffffff8117e8d1>] ? nfs_readdir+0x12d/0x435
[    9.495466]  [<ffffffff8118e69b>] ? nfs3_xdr_dec_create3res+0xc5/0xc5
[    9.495470]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.495473]  [<ffffffff811274f8>] ? sys_ioctl+0x7a/0x7a
[    9.495476]  [<ffffffff811277b3>] ? vfs_readdir+0x6c/0xa7
[    9.495479]  [<ffffffff811278da>] ? sys_getdents+0x7e/0xdc
[    9.495482]  [<ffffffff814ac829>] ? system_call_fastpath+0x16/0x1b
[    9.495484] ---[ end trace e502c5d56c594e87 ]---
[    9.495485] FH(0)]
[    9.495486] FH(36)[01 00 07 01 89 00 00 00 00 00 00 00 e1 21 fe c4 9e 38 44 dc bf 1b d5 95 d6 76 d6 d9 a3 0e 00 40 42 57 d3 90]
[    9.495511] filename: sys

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-30  0:57               ` Patrick McLean
@ 2012-11-30  1:36                 ` Al Viro
  2012-11-30  1:54                   ` Patrick McLean
  0 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2012-11-30  1:36 UTC (permalink / raw)
  To: Patrick McLean
  Cc: Patrick McLean, linux-fsdevel, linux-kernel, Trond Myklebust, linux-nfs

On Thu, Nov 29, 2012 at 04:57:19PM -0800, Patrick McLean wrote:
> > Interesting...  Server-side that should've been produced by
> > encode_entryplus_baggage(), which looks like failing compose_entry_fh()...
> > which has explicit
> >         if (d_mountpoint(dchild))
> >                 goto out;
> > resulting in ENOENT on everything that's overmounted on server.
> > 
> > Do you, by any chance, have the server really exporting its own root
> > filesystem?  Another thing to check: have nfs_prime_dcache() print
> > filename.name of everything that fails nfs_same_entry() and has
> > zero entry->fh->size, regardless of d_invalidate() results.
> 
> The server is running 3.6.6 and is just exporting a subdir of an xfs filesystem (which does not happen to be the root filesystem).
> 
> The client is running as a KVM guest on the machine that is serving the NFS. I am reproducing this by booting the guest via an initramfs, and doing
> "ls /" at in single user mode.
> 
> I added a check that prints the filename.name of everything that fails nfs_same_file, and it appears to just be triggered by the same filesystems that
> are triggering the WARN_ON, the relevant dmesg is below.

[the same /dev, /proc and /sys]

	Very interesting.  Do you have anything mounted on the corresponding
directories on server?  The picture looks like you are getting empty
fhandles in readdir+ respons for exactly the same directories that happen
to be mountpoints on client.  In any case, we shouldn't do that blind
d_drop() - empty fhandles can happen.  The only remaining question is
why do they happen on that set of entries.  From my reading of
encode_entryplus_baggage() it looks like we have compose_entry_fh()
failing for those entries and those entries alone.  One possible cause
would be d_mountpoint(dchild) being true on server.  If it is true, we
can declare the case closed; if not, I really wonder what's going on.

Note that if the same fs is mounted elsewhere, d_mountpoint() would mean
that something is mounted on top of that directory in _some_ instance;
not necessary the exported one.  Can you slap printks on fs/nfsd/nfs3xdr.c
compose_entry_fh() failure exits and see which one triggers server-side?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-30  1:36                 ` Al Viro
@ 2012-11-30  1:54                   ` Patrick McLean
  2012-11-30  2:00                     ` Al Viro
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick McLean @ 2012-11-30  1:54 UTC (permalink / raw)
  To: Al Viro
  Cc: Patrick McLean, linux-fsdevel, linux-kernel, Trond Myklebust, linux-nfs

On 29/11/12 05:36 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 04:57:19PM -0800, Patrick McLean wrote:
>>> Interesting...  Server-side that should've been produced by
>>> encode_entryplus_baggage(), which looks like failing compose_entry_fh()...
>>> which has explicit
>>>         if (d_mountpoint(dchild))
>>>                 goto out;
>>> resulting in ENOENT on everything that's overmounted on server.
>>>
>>> Do you, by any chance, have the server really exporting its own root
>>> filesystem?  Another thing to check: have nfs_prime_dcache() print
>>> filename.name of everything that fails nfs_same_entry() and has
>>> zero entry->fh->size, regardless of d_invalidate() results.
>>
>> The server is running 3.6.6 and is just exporting a subdir of an xfs filesystem (which does not happen to be the root filesystem).
>>
>> The client is running as a KVM guest on the machine that is serving the NFS. I am reproducing this by booting the guest via an initramfs, and doing
>> "ls /" at in single user mode.
>>
>> I added a check that prints the filename.name of everything that fails nfs_same_file, and it appears to just be triggered by the same filesystems that
>> are triggering the WARN_ON, the relevant dmesg is below.
> 
> [the same /dev, /proc and /sys]
> 
> 	Very interesting.  Do you have anything mounted on the corresponding
> directories on server?  The picture looks like you are getting empty
> fhandles in readdir+ respons for exactly the same directories that happen
> to be mountpoints on client.  In any case, we shouldn't do that blind
> d_drop() - empty fhandles can happen.  The only remaining question is
> why do they happen on that set of entries.  From my reading of
> encode_entryplus_baggage() it looks like we have compose_entry_fh()
> failing for those entries and those entries alone.  One possible cause
> would be d_mountpoint(dchild) being true on server.  If it is true, we
> can declare the case closed; if not, I really wonder what's going on.

Those directories do have the server's own copies of the said directories bind mounted at the moment in a separate mount namespace.

Unmounting those directories on the server does appear to stop the WARN_ON from triggering.

> Note that if the same fs is mounted elsewhere, d_mountpoint() would mean
> that something is mounted on top of that directory in _some_ instance;
> not necessary the exported one.  Can you slap printks on fs/nfsd/nfs3xdr.c
> compose_entry_fh() failure exits and see which one triggers server-side?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-30  1:54                   ` Patrick McLean
@ 2012-11-30  2:00                     ` Al Viro
  2012-11-30  2:33                       ` Patrick McLean
                                         ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Al Viro @ 2012-11-30  2:00 UTC (permalink / raw)
  To: Patrick McLean
  Cc: Patrick McLean, linux-fsdevel, linux-kernel, Trond Myklebust, linux-nfs

On Thu, Nov 29, 2012 at 05:54:02PM -0800, Patrick McLean wrote:
> > 	Very interesting.  Do you have anything mounted on the corresponding
> > directories on server?  The picture looks like you are getting empty
> > fhandles in readdir+ respons for exactly the same directories that happen
> > to be mountpoints on client.  In any case, we shouldn't do that blind
> > d_drop() - empty fhandles can happen.  The only remaining question is
> > why do they happen on that set of entries.  From my reading of
> > encode_entryplus_baggage() it looks like we have compose_entry_fh()
> > failing for those entries and those entries alone.  One possible cause
> > would be d_mountpoint(dchild) being true on server.  If it is true, we
> > can declare the case closed; if not, I really wonder what's going on.
> 
> Those directories do have the server's own copies of the said directories bind mounted at the moment in a separate mount namespace.
> 
> Unmounting those directories on the server does appear to stop the WARN_ON from triggering.

OK, that settles it.  WARN_ON() and printks in the area can be dropped;
the right fix is below.  However, there's a similar place in cifs that
also needs to be dealt with and I really, really wonder why the hell do
we do d_drop() in nfs_revalidate_lookup().  It's not relevant in this
bug, but I would like to understand what's wrong with simply returning
0 from ->d_revalidate() and letting the caller (in fs/namei.c) take care
of unhashing, etc. itself.  Would make have_submounts() in there pointless
as well - we could just return 0 and let d_invalidate() take care of the
checks...  Trond?

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -450,7 +450,8 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
 			nfs_refresh_inode(dentry->d_inode, entry->fattr);
 			goto out;
 		} else {
-			d_drop(dentry);
+			if (d_invalidate(dentry) != 0)
+				goto out;
 			dput(dentry);
 		}
 	}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-30  2:00                     ` Al Viro
@ 2012-11-30  2:33                       ` Patrick McLean
  2012-11-30  4:11                         ` Al Viro
  2012-11-30 13:58                       ` Myklebust, Trond
  2012-12-01  2:18                       ` Simon Kirby
  2 siblings, 1 reply; 17+ messages in thread
From: Patrick McLean @ 2012-11-30  2:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Patrick McLean, linux-fsdevel, linux-kernel, Trond Myklebust, linux-nfs

On 29/11/12 06:00 PM, Al Viro wrote:
> On Thu, Nov 29, 2012 at 05:54:02PM -0800, Patrick McLean wrote:
>>> 	Very interesting.  Do you have anything mounted on the corresponding
>>> directories on server?  The picture looks like you are getting empty
>>> fhandles in readdir+ respons for exactly the same directories that happen
>>> to be mountpoints on client.  In any case, we shouldn't do that blind
>>> d_drop() - empty fhandles can happen.  The only remaining question is
>>> why do they happen on that set of entries.  From my reading of
>>> encode_entryplus_baggage() it looks like we have compose_entry_fh()
>>> failing for those entries and those entries alone.  One possible cause
>>> would be d_mountpoint(dchild) being true on server.  If it is true, we
>>> can declare the case closed; if not, I really wonder what's going on.
>>
>> Those directories do have the server's own copies of the said directories bind mounted at the moment in a separate mount namespace.
>>
>> Unmounting those directories on the server does appear to stop the WARN_ON from triggering.
> 
> OK, that settles it.  WARN_ON() and printks in the area can be dropped;
> the right fix is below.  However, there's a similar place in cifs that
> also needs to be dealt with and I really, really wonder why the hell do
> we do d_drop() in nfs_revalidate_lookup().  It's not relevant in this
> bug, but I would like to understand what's wrong with simply returning
> 0 from ->d_revalidate() and letting the caller (in fs/namei.c) take care
> of unhashing, etc. itself.  Would make have_submounts() in there pointless
> as well - we could just return 0 and let d_invalidate() take care of the
> checks...  Trond?
> 
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -450,7 +450,8 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
>  			nfs_refresh_inode(dentry->d_inode, entry->fattr);
>  			goto out;
>  		} else {
> -			d_drop(dentry);
> +			if (d_invalidate(dentry) != 0)
> +				goto out;
>  			dput(dentry);
>  		}
>  	}

Excellent, thanks. Is there any chance this will make it to 3.7? Also we might want to cc stable@ on this as well since it is a regression in 3.6.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-30  2:33                       ` Patrick McLean
@ 2012-11-30  4:11                         ` Al Viro
  0 siblings, 0 replies; 17+ messages in thread
From: Al Viro @ 2012-11-30  4:11 UTC (permalink / raw)
  To: Patrick McLean; +Cc: linux-fsdevel, linux-kernel, Trond Myklebust, linux-nfs

On Thu, Nov 29, 2012 at 06:33:53PM -0800, Patrick McLean wrote:

> Excellent, thanks. Is there any chance this will make it to 3.7? Also we might want to cc stable@ on this as well since it is a regression in 3.6.

Definitely.  I've dropped that into vfs.git#for-linus and vfs.git#for-next
and tomorrow to Linus it goes...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-30  2:00                     ` Al Viro
  2012-11-30  2:33                       ` Patrick McLean
@ 2012-11-30 13:58                       ` Myklebust, Trond
  2012-12-01 21:40                         ` Al Viro
  2012-12-01  2:18                       ` Simon Kirby
  2 siblings, 1 reply; 17+ messages in thread
From: Myklebust, Trond @ 2012-11-30 13:58 UTC (permalink / raw)
  To: Al Viro
  Cc: Patrick McLean, Patrick McLean, linux-fsdevel, linux-kernel, linux-nfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2377 bytes --]

On Fri, 2012-11-30 at 02:00 +0000, Al Viro wrote:
> On Thu, Nov 29, 2012 at 05:54:02PM -0800, Patrick McLean wrote:
> > > 	Very interesting.  Do you have anything mounted on the corresponding
> > > directories on server?  The picture looks like you are getting empty
> > > fhandles in readdir+ respons for exactly the same directories that happen
> > > to be mountpoints on client.  In any case, we shouldn't do that blind
> > > d_drop() - empty fhandles can happen.  The only remaining question is
> > > why do they happen on that set of entries.  From my reading of
> > > encode_entryplus_baggage() it looks like we have compose_entry_fh()
> > > failing for those entries and those entries alone.  One possible cause
> > > would be d_mountpoint(dchild) being true on server.  If it is true, we
> > > can declare the case closed; if not, I really wonder what's going on.
> > 
> > Those directories do have the server's own copies of the said directories bind mounted at the moment in a separate mount namespace.
> > 
> > Unmounting those directories on the server does appear to stop the WARN_ON from triggering.
> 
> OK, that settles it.  WARN_ON() and printks in the area can be dropped;
> the right fix is below.  However, there's a similar place in cifs that
> also needs to be dealt with and I really, really wonder why the hell do
> we do d_drop() in nfs_revalidate_lookup().  It's not relevant in this
> bug, but I would like to understand what's wrong with simply returning
> 0 from ->d_revalidate() and letting the caller (in fs/namei.c) take care
> of unhashing, etc. itself.  Would make have_submounts() in there pointless
> as well - we could just return 0 and let d_invalidate() take care of the
> checks...  Trond?

The reason for the choice of d_drop over d_invalidate() is the d_count
checks. It really doesn't matter whether or not the client thinks it has
users for a directory if the server is telling you that it is ESTALE. So
we force a d_drop to prevent further lookups from finding it.

IOW: It is there in order to fix the case where the user does
'rmdir("foo"); mkdir("foo")' on the server.


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-30  2:00                     ` Al Viro
  2012-11-30  2:33                       ` Patrick McLean
  2012-11-30 13:58                       ` Myklebust, Trond
@ 2012-12-01  2:18                       ` Simon Kirby
  2 siblings, 0 replies; 17+ messages in thread
From: Simon Kirby @ 2012-12-01  2:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Patrick McLean, Patrick McLean, linux-fsdevel, linux-kernel,
	Trond Myklebust, linux-nfs

On Fri, Nov 30, 2012 at 02:00:48AM +0000, Al Viro wrote:

> OK, that settles it.  WARN_ON() and printks in the area can be dropped;
> the right fix is below.  However, there's a similar place in cifs that
> also needs to be dealt with and I really, really wonder why the hell do
> we do d_drop() in nfs_revalidate_lookup().  It's not relevant in this
> bug, but I would like to understand what's wrong with simply returning
> 0 from ->d_revalidate() and letting the caller (in fs/namei.c) take care
> of unhashing, etc. itself.  Would make have_submounts() in there pointless
> as well - we could just return 0 and let d_invalidate() take care of the
> checks...  Trond?
> 
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -450,7 +450,8 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
>  			nfs_refresh_inode(dentry->d_inode, entry->fattr);
>  			goto out;
>  		} else {
> -			d_drop(dentry);
> +			if (d_invalidate(dentry) != 0)
> +				goto out;
>  			dput(dentry);
>  		}
>  	}

Hello,

With your previous patch (with the WARN_ON), I hit the WARN_ON() in the
test case described here: https://patchwork.kernel.org/patch/1446851/ .
The __d_move()ing mountpoint case no longer hits, and there is no longer
an EBUSY, so this seems to work for me (in 3.6, where it broke).

Simon-

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Regression with initramfs and nfsroot (appears to be in the dcache)
  2012-11-30 13:58                       ` Myklebust, Trond
@ 2012-12-01 21:40                         ` Al Viro
  0 siblings, 0 replies; 17+ messages in thread
From: Al Viro @ 2012-12-01 21:40 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Patrick McLean, Patrick McLean, linux-fsdevel, linux-kernel, linux-nfs

On Fri, Nov 30, 2012 at 01:58:18PM +0000, Myklebust, Trond wrote:

> The reason for the choice of d_drop over d_invalidate() is the d_count
> checks. It really doesn't matter whether or not the client thinks it has
> users for a directory if the server is telling you that it is ESTALE. So
> we force a d_drop to prevent further lookups from finding it.
> 
> IOW: It is there in order to fix the case where the user does
> 'rmdir("foo"); mkdir("foo")' on the server.

You do realize that your have_submounts() check in there is inherently
racy, right?

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-12-01 21:40 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-29 19:16 Regression with initramfs and nfsroot (appears to be in the dcache) Patrick McLean
2012-11-29 21:33 ` Al Viro
2012-11-29 22:06   ` Patrick McLean
2012-11-29 22:21     ` Al Viro
2012-11-29 22:53       ` Patrick McLean
2012-11-29 23:43         ` Al Viro
2012-11-30  0:19           ` Patrick McLean
2012-11-30  0:35             ` Al Viro
2012-11-30  0:57               ` Patrick McLean
2012-11-30  1:36                 ` Al Viro
2012-11-30  1:54                   ` Patrick McLean
2012-11-30  2:00                     ` Al Viro
2012-11-30  2:33                       ` Patrick McLean
2012-11-30  4:11                         ` Al Viro
2012-11-30 13:58                       ` Myklebust, Trond
2012-12-01 21:40                         ` Al Viro
2012-12-01  2:18                       ` Simon Kirby

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).