[PATCH] Should we expect close-to-open consistency on directories?

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Should we expect close-to-open consistency on directories?
@ 2010-04-20  7:22 Neil Brown
  2010-04-20 13:02 ` Trond Myklebust
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2010-04-20  7:22 UTC (permalink / raw)
  To: Trond Myklebust, linux-nfs

Hi Trond et al,

It has come to my attention that NFS directories don't behave consistently
in terms of cache consistency.

If, on the client, you have a loop like:

 while true; do sleep 1; ls -l $dirname ; done

and then on the server you make changes to the named directory, there are
some cases where you will see changes promptly and some where you wont.

In particular, if $dirname is '.' or the name of an NFS mountpoint, then
changes can be delayed by up to acdirmax.  If it is any other path, i.e. with
a non-trivial path component that is in the NFS filesystem, then changes
are seen promptly.

This seems to me to relate to "close to open" consistency.  Of course with
directories the 'close' side isn't relevant, but I still think it should be
that when you open a directory it validates the 'change' attribute on that
directory over the wire.

However the Linux VFS never tells NFS when a directory is opened.  The
current correct behaviour for most directories is achieved through
d_revalidate == nfs_lookup_revalidate.

For '.' and mountpoints we need a different approach.  Possibly the VFS could
be changed to tell the filesystem when such a directory is opened.  However I
don't feel up to that at the moment.

An alternative is to do a revalidation in nfs_readdir as below.  i.e. when
readdir see f_pos == 0, it requests a revalidation of the page cache.
This has two problems:
1/ a seek before the first read would cause the revalidation to be skipped.
   This can be fixed by putting a similar test in nfs_llseek_dir, or maybe
   triggering off 'dir_cookie == NULL' rather than 'f_pos == 0'.
2/ A normal open/readdir sequence will validate a directory twice, once in the
   lookup and once in the readdir.  This is probably undesirable, but it is
   not clear to me how to fix it.

So: is it reasonable to view the current behaviour as 'wrong'?
    any suggestions on how to craft a less problematic fix?

Thanks,
NeilBrown

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a1f6b44..df4f0a6 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -560,6 +560,9 @@ static int nfs_readdir(struct file *filp, void *dirent, filldir_t filldir)
 	desc->entry = &my_entry;

 	nfs_block_sillyrename(dentry);
+	if (filp->f_pos == 0)
+		/* Force attribute validity at open */
+		NFS_I(inode)->cache_validity |= NFS_INO_REVAL_PAGECACHE;
 	res = nfs_revalidate_mapping(inode, filp->f_mapping);
 	if (res < 0)
 		goto out;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] Should we expect close-to-open consistency on directories?
  2010-04-20  7:22 [PATCH] Should we expect close-to-open consistency on directories? Neil Brown
@ 2010-04-20 13:02 ` Trond Myklebust
  2010-04-21  7:03   ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Trond Myklebust @ 2010-04-20 13:02 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-nfs

On Tue, 2010-04-20 at 17:22 +1000, Neil Brown wrote: 
> Hi Trond et al,
> 
> It has come to my attention that NFS directories don't behave consistently
> in terms of cache consistency.
> 
> If, on the client, you have a loop like:
> 
>  while true; do sleep 1; ls -l $dirname ; done
> 
> and then on the server you make changes to the named directory, there are
> some cases where you will see changes promptly and some where you wont.
> 
> In particular, if $dirname is '.' or the name of an NFS mountpoint, then
> changes can be delayed by up to acdirmax.  If it is any other path, i.e. with
> a non-trivial path component that is in the NFS filesystem, then changes
> are seen promptly.
> 
> This seems to me to relate to "close to open" consistency.  Of course with
> directories the 'close' side isn't relevant, but I still think it should be
> that when you open a directory it validates the 'change' attribute on that
> directory over the wire.
> 
> However the Linux VFS never tells NFS when a directory is opened.  The
> current correct behaviour for most directories is achieved through
> d_revalidate == nfs_lookup_revalidate.
> 
> For '.' and mountpoints we need a different approach.  Possibly the VFS could
> be changed to tell the filesystem when such a directory is opened.  However I
> don't feel up to that at the moment.

I agree that mountpoints are problematic in this case, however why isn't
'.' working correctly? Is the FS_REVAL_DOT mechanism broken?

The other thing is that we should definitely expect the VFS to call
nfs_opendir() once it has opened the file.

> An alternative is to do a revalidation in nfs_readdir as below.  i.e. when
> readdir see f_pos == 0, it requests a revalidation of the page cache.
> This has two problems:
> 1/ a seek before the first read would cause the revalidation to be skipped.
>    This can be fixed by putting a similar test in nfs_llseek_dir, or maybe
>    triggering off 'dir_cookie == NULL' rather than 'f_pos == 0'.
> 2/ A normal open/readdir sequence will validate a directory twice, once in the
>    lookup and once in the readdir.  This is probably undesirable, but it is
>    not clear to me how to fix it.
> 
> 
> So: is it reasonable to view the current behaviour as 'wrong'?
>     any suggestions on how to craft a less problematic fix?

nfs_opendir() should fix case 1/, but still has the issue with case 2/.
How about just having it force a revalidation if we see that this is a
mountpoint?

Cheers
  Trond


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Should we expect close-to-open consistency on directories?
  2010-04-20 13:02 ` Trond Myklebust
@ 2010-04-21  7:03   ` Neil Brown
  2010-05-06  4:13     ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2010-04-21  7:03 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, Alexander Viro

On Tue, 20 Apr 2010 09:02:01 -0400
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> On Tue, 2010-04-20 at 17:22 +1000, Neil Brown wrote: 
> > Hi Trond et al,
> > 
> > It has come to my attention that NFS directories don't behave consistently
> > in terms of cache consistency.
> > 
> > If, on the client, you have a loop like:
> > 
> >  while true; do sleep 1; ls -l $dirname ; done
> > 
> > and then on the server you make changes to the named directory, there are
> > some cases where you will see changes promptly and some where you wont.
> > 
> > In particular, if $dirname is '.' or the name of an NFS mountpoint, then
> > changes can be delayed by up to acdirmax.  If it is any other path, i.e. with
> > a non-trivial path component that is in the NFS filesystem, then changes
> > are seen promptly.
> > 
> > This seems to me to relate to "close to open" consistency.  Of course with
> > directories the 'close' side isn't relevant, but I still think it should be
> > that when you open a directory it validates the 'change' attribute on that
> > directory over the wire.
> > 
> > However the Linux VFS never tells NFS when a directory is opened.  The
> > current correct behaviour for most directories is achieved through
> > d_revalidate == nfs_lookup_revalidate.
> > 
> > For '.' and mountpoints we need a different approach.  Possibly the VFS could
> > be changed to tell the filesystem when such a directory is opened.  However I
> > don't feel up to that at the moment.
> 
> I agree that mountpoints are problematic in this case, however why isn't
> '.' working correctly? Is the FS_REVAL_DOT mechanism broken?

Yes, the FS_REVAL_DOT mechanism is broken.
Specifically, when you open ".",  ->d_revalidate is called by link_path_walk,
but LOOKUP_PARENT is set, and LOOKUP_OPEN is not set, so
nfs_lookup_verify_inode doesn't force a revalidate.

Then in do_last(), LOOKUP_PARENT is no longer set, and LOOKUP_OPEN is, but
do_last doesn't bother calling ->d_revalidate for LAST_DOT.

I verified this understanding with the following patch which causes 
"ls ." to reliably get current (rather than cached) contents of the directory.


diff --git a/fs/namei.c b/fs/namei.c
index 48e60a1..f9204af 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1620,6 +1620,8 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 	switch (nd->last_type) {
 	case LAST_DOTDOT:
 		follow_dotdot(nd);
+		/* fallthrough */
+	case LAST_DOT:
 		dir = nd->path.dentry;
 		if (nd->path.mnt->mnt_sb->s_type->fs_flags & FS_REVAL_DOT) {
 			if (!dir->d_op->d_revalidate(dir, nd)) {
@@ -1627,8 +1629,6 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 				goto exit;
 			}
 		}
-		/* fallthrough */
-	case LAST_DOT:
 	case LAST_ROOT:
 		if (open_flag & O_CREAT)
 			goto exit;

> 
> The other thing is that we should definitely expect the VFS to call
> nfs_opendir() once it has opened the file.

Oh yes, I see that now.  So we could force a cache revalidation there.
But I'm not sure how to test if this is a mountpoint as you suggest below.

Maybe something like the following.  I'm pretty sure this is wrong as it
ignores the return value of d_revalidate, but I didn't know what to do with
the value.
Al ??


--- a/fs/namei.c
+++ b/fs/namei.c
@@ -720,6 +720,11 @@ done:
 	path->mnt = mnt;
 	path->dentry = dentry;
 	__follow_mount(path);
+	if (path->dentry != dentry)
+		if (path->dentry && path->dentry->d_sb &&
+		    (path->dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT))
+			path->dentry->d_op->d_revalidate(
+				path->dentry, nd);
 	return 0;
 
 need_lookup:

Thanks,
NeilBrown


> 
> > An alternative is to do a revalidation in nfs_readdir as below.  i.e. when
> > readdir see f_pos == 0, it requests a revalidation of the page cache.
> > This has two problems:
> > 1/ a seek before the first read would cause the revalidation to be skipped.
> >    This can be fixed by putting a similar test in nfs_llseek_dir, or maybe
> >    triggering off 'dir_cookie == NULL' rather than 'f_pos == 0'.
> > 2/ A normal open/readdir sequence will validate a directory twice, once in the
> >    lookup and once in the readdir.  This is probably undesirable, but it is
> >    not clear to me how to fix it.
> > 
> > 
> > So: is it reasonable to view the current behaviour as 'wrong'?
> >     any suggestions on how to craft a less problematic fix?
> 
> nfs_opendir() should fix case 1/, but still has the issue with case 2/.
> How about just having it force a revalidation if we see that this is a
> mountpoint?
> 
> Cheers
>   Trond


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] Should we expect close-to-open consistency on directories?
  2010-04-21  7:03   ` Neil Brown
@ 2010-05-06  4:13     ` Neil Brown
       [not found]       ` <20100506141347.06451f56-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2010-05-06  4:13 UTC (permalink / raw)
  Cc: Trond Myklebust, linux-nfs, Alexander Viro

On Wed, 21 Apr 2010 17:03:21 +1000
Neil Brown <neilb@suse.de> wrote:

> On Tue, 20 Apr 2010 09:02:01 -0400
> Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> 
> > On Tue, 2010-04-20 at 17:22 +1000, Neil Brown wrote: 
> > > Hi Trond et al,
> > > 
> > > It has come to my attention that NFS directories don't behave consistently
> > > in terms of cache consistency.
> > > 
> > > If, on the client, you have a loop like:
> > > 
> > >  while true; do sleep 1; ls -l $dirname ; done
> > > 
> > > and then on the server you make changes to the named directory, there are
> > > some cases where you will see changes promptly and some where you wont.
> > > 
> > > In particular, if $dirname is '.' or the name of an NFS mountpoint, then
> > > changes can be delayed by up to acdirmax.  If it is any other path, i.e. with
> > > a non-trivial path component that is in the NFS filesystem, then changes
> > > are seen promptly.
> > > 
> > > This seems to me to relate to "close to open" consistency.  Of course with
> > > directories the 'close' side isn't relevant, but I still think it should be
> > > that when you open a directory it validates the 'change' attribute on that
> > > directory over the wire.
> > > 
> > > However the Linux VFS never tells NFS when a directory is opened.  The
> > > current correct behaviour for most directories is achieved through
> > > d_revalidate == nfs_lookup_revalidate.
> > > 
> > > For '.' and mountpoints we need a different approach.  Possibly the VFS could
> > > be changed to tell the filesystem when such a directory is opened.  However I
> > > don't feel up to that at the moment.
> > 
> > I agree that mountpoints are problematic in this case, however why isn't
> > '.' working correctly? Is the FS_REVAL_DOT mechanism broken?
> 
> Yes, the FS_REVAL_DOT mechanism is broken.
> Specifically, when you open ".",  ->d_revalidate is called by link_path_walk,
> but LOOKUP_PARENT is set, and LOOKUP_OPEN is not set, so
> nfs_lookup_verify_inode doesn't force a revalidate.
> 
> Then in do_last(), LOOKUP_PARENT is no longer set, and LOOKUP_OPEN is, but
> do_last doesn't bother calling ->d_revalidate for LAST_DOT.
> 
> I verified this understanding with the following patch which causes 
> "ls ." to reliably get current (rather than cached) contents of the directory.


No replies ... Maybe Al is busy.

I looked at this again, created a patch that I thought looked good and tested
it to ensure it addressed both sides of the problem.

Does it look OK to you Trond?
Thanks.

NFS - ensure directory at end of path is always revalidated.

The FS_REVAL_DOT fs_type flag is meant to ensure that the final component of
a path is always revalidated, even if it isn't a normal (LAST_NORM) path
component (which is always revalidated).
There are two cases where this doesn't happen for NFS
One is where the last component is '.' as the revalidation happens while
 LOOKUP_PARENT is set, so NFS ignores it (see nfs_lookup_check_intent).
The other  is where the directory is a mountpoint, so it is LAST_NORM,
but that directory is different from the mounted directory.

This patches fixes these two issues by 
 1/ extending do_last() to revalidate DOT as well as DOTDOT and
 2/ extending do_lookup() to revalidate after a successful __follow_mount
    if FS_REVAL_DOT is set.

Signed-off-by: NeilBrown <neilb@suse.de>


diff --git a/fs/namei.c b/fs/namei.c
index a7dce91..256ae13 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -719,7 +719,11 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
 done:
 	path->mnt = mnt;
 	path->dentry = dentry;
-	__follow_mount(path);
+	if (__follow_mount(path) &&
+	    (path->mnt->mnt_sb->s_type->fs_flags & FS_REVAL_DOT)) {
+		if (!path->dentry->d_op->d_revalidate(path->dentry, nd))
+			return -ESTALE;
+	}
 	return 0;
 
 need_lookup:
@@ -1619,6 +1623,7 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 	switch (nd->last_type) {
 	case LAST_DOTDOT:
 		follow_dotdot(nd);
+	case LAST_DOT:
 		dir = nd->path.dentry;
 		if (nd->path.mnt->mnt_sb->s_type->fs_flags & FS_REVAL_DOT) {
 			if (!dir->d_op->d_revalidate(dir, nd)) {
@@ -1627,7 +1632,6 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 			}
 		}
 		/* fallthrough */
-	case LAST_DOT:
 	case LAST_ROOT:
 		if (open_flag & O_CREAT)
 			goto exit;

> 
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 48e60a1..f9204af 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1620,6 +1620,8 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
>  	switch (nd->last_type) {
>  	case LAST_DOTDOT:
>  		follow_dotdot(nd);
> +		/* fallthrough */
> +	case LAST_DOT:
>  		dir = nd->path.dentry;
>  		if (nd->path.mnt->mnt_sb->s_type->fs_flags & FS_REVAL_DOT) {
>  			if (!dir->d_op->d_revalidate(dir, nd)) {
> @@ -1627,8 +1629,6 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
>  				goto exit;
>  			}
>  		}
> -		/* fallthrough */
> -	case LAST_DOT:
>  	case LAST_ROOT:
>  		if (open_flag & O_CREAT)
>  			goto exit;
> 
> > 
> > The other thing is that we should definitely expect the VFS to call
> > nfs_opendir() once it has opened the file.
> 
> Oh yes, I see that now.  So we could force a cache revalidation there.
> But I'm not sure how to test if this is a mountpoint as you suggest below.
> 
> Maybe something like the following.  I'm pretty sure this is wrong as it
> ignores the return value of d_revalidate, but I didn't know what to do with
> the value.
> Al ??
> 
> 
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -720,6 +720,11 @@ done:
>  	path->mnt = mnt;
>  	path->dentry = dentry;
>  	__follow_mount(path);
> +	if (path->dentry != dentry)
> +		if (path->dentry && path->dentry->d_sb &&
> +		    (path->dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT))
> +			path->dentry->d_op->d_revalidate(
> +				path->dentry, nd);
>  	return 0;
>  
>  need_lookup:
> 
> Thanks,
> NeilBrown
> 
> 
> > 
> > > An alternative is to do a revalidation in nfs_readdir as below.  i.e. when
> > > readdir see f_pos == 0, it requests a revalidation of the page cache.
> > > This has two problems:
> > > 1/ a seek before the first read would cause the revalidation to be skipped.
> > >    This can be fixed by putting a similar test in nfs_llseek_dir, or maybe
> > >    triggering off 'dir_cookie == NULL' rather than 'f_pos == 0'.
> > > 2/ A normal open/readdir sequence will validate a directory twice, once in the
> > >    lookup and once in the readdir.  This is probably undesirable, but it is
> > >    not clear to me how to fix it.
> > > 
> > > 
> > > So: is it reasonable to view the current behaviour as 'wrong'?
> > >     any suggestions on how to craft a less problematic fix?
> > 
> > nfs_opendir() should fix case 1/, but still has the issue with case 2/.
> > How about just having it force a revalidation if we see that this is a
> > mountpoint?
> > 
> > Cheers
> >   Trond
> 


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] Should we expect close-to-open consistency on directories?
       [not found]       ` <20100506141347.06451f56-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2010-05-06 13:58         ` Trond Myklebust
  2010-05-07 22:34           ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Trond Myklebust @ 2010-05-06 13:58 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-nfs, Alexander Viro

On Thu, 2010-05-06 at 14:13 +1000, Neil Brown wrote: 
> On Wed, 21 Apr 2010 17:03:21 +1000
> Neil Brown <neilb@suse.de> wrote:
> 
> > On Tue, 20 Apr 2010 09:02:01 -0400
> > Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> > 
> > > On Tue, 2010-04-20 at 17:22 +1000, Neil Brown wrote: 
> > > > Hi Trond et al,
> > > > 
> > > > It has come to my attention that NFS directories don't behave consistently
> > > > in terms of cache consistency.
> > > > 
> > > > If, on the client, you have a loop like:
> > > > 
> > > >  while true; do sleep 1; ls -l $dirname ; done
> > > > 
> > > > and then on the server you make changes to the named directory, there are
> > > > some cases where you will see changes promptly and some where you wont.
> > > > 
> > > > In particular, if $dirname is '.' or the name of an NFS mountpoint, then
> > > > changes can be delayed by up to acdirmax.  If it is any other path, i.e. with
> > > > a non-trivial path component that is in the NFS filesystem, then changes
> > > > are seen promptly.
> > > > 
> > > > This seems to me to relate to "close to open" consistency.  Of course with
> > > > directories the 'close' side isn't relevant, but I still think it should be
> > > > that when you open a directory it validates the 'change' attribute on that
> > > > directory over the wire.
> > > > 
> > > > However the Linux VFS never tells NFS when a directory is opened.  The
> > > > current correct behaviour for most directories is achieved through
> > > > d_revalidate == nfs_lookup_revalidate.
> > > > 
> > > > For '.' and mountpoints we need a different approach.  Possibly the VFS could
> > > > be changed to tell the filesystem when such a directory is opened.  However I
> > > > don't feel up to that at the moment.
> > > 
> > > I agree that mountpoints are problematic in this case, however why isn't
> > > '.' working correctly? Is the FS_REVAL_DOT mechanism broken?
> > 
> > Yes, the FS_REVAL_DOT mechanism is broken.
> > Specifically, when you open ".",  ->d_revalidate is called by link_path_walk,
> > but LOOKUP_PARENT is set, and LOOKUP_OPEN is not set, so
> > nfs_lookup_verify_inode doesn't force a revalidate.
> > 
> > Then in do_last(), LOOKUP_PARENT is no longer set, and LOOKUP_OPEN is, but
> > do_last doesn't bother calling ->d_revalidate for LAST_DOT.
> > 
> > I verified this understanding with the following patch which causes 
> > "ls ." to reliably get current (rather than cached) contents of the directory.
> 
> 
> No replies ... Maybe Al is busy.
> 
> I looked at this again, created a patch that I thought looked good and tested
> it to ensure it addressed both sides of the problem.
> 
> Does it look OK to you Trond?
> Thanks.
> 
> NFS - ensure directory at end of path is always revalidated.
> 
> The FS_REVAL_DOT fs_type flag is meant to ensure that the final component of
> a path is always revalidated, even if it isn't a normal (LAST_NORM) path
> component (which is always revalidated).
> There are two cases where this doesn't happen for NFS
> One is where the last component is '.' as the revalidation happens while
>  LOOKUP_PARENT is set, so NFS ignores it (see nfs_lookup_check_intent).
> The other  is where the directory is a mountpoint, so it is LAST_NORM,
> but that directory is different from the mounted directory.
> 
> This patches fixes these two issues by 
>  1/ extending do_last() to revalidate DOT as well as DOTDOT and
>  2/ extending do_lookup() to revalidate after a successful __follow_mount
>     if FS_REVAL_DOT is set.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index a7dce91..256ae13 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -719,7 +719,11 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
>  done:
>  	path->mnt = mnt;
>  	path->dentry = dentry;
> -	__follow_mount(path);
> +	if (__follow_mount(path) &&
> +	    (path->mnt->mnt_sb->s_type->fs_flags & FS_REVAL_DOT)) {
> +		if (!path->dentry->d_op->d_revalidate(path->dentry, nd))
> +			return -ESTALE;

Won't this prevent you from ever being able to unmount the stale
filesystem?

> +	}
>  	return 0;
>  
>  need_lookup:
> @@ -1619,6 +1623,7 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
>  	switch (nd->last_type) {
>  	case LAST_DOTDOT:
>  		follow_dotdot(nd);
> +	case LAST_DOT:
>  		dir = nd->path.dentry;
>  		if (nd->path.mnt->mnt_sb->s_type->fs_flags & FS_REVAL_DOT) {
>  			if (!dir->d_op->d_revalidate(dir, nd)) {
> @@ -1627,7 +1632,6 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
>  			}
>  		}
>  		/* fallthrough */
> -	case LAST_DOT:
>  	case LAST_ROOT:
>  		if (open_flag & O_CREAT)
>  			goto exit;
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Should we expect close-to-open consistency on directories?
  2010-05-06 13:58         ` Trond Myklebust
@ 2010-05-07 22:34           ` Neil Brown
  2010-05-08 13:05             ` Chuck Lever
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2010-05-07 22:34 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, Alexander Viro

On Thu, 06 May 2010 09:58:31 -0400 Trond Myklebust<trond.myklebust@fys.uio.no> wrote:
> > 
> > 
> > diff --git a/fs/namei.c b/fs/namei.c
> > index a7dce91..256ae13 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -719,7 +719,11 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
> >  done:
> >  	path->mnt = mnt;
> >  	path->dentry = dentry;
> > -	__follow_mount(path);
> > +	if (__follow_mount(path) &&
> > +	    (path->mnt->mnt_sb->s_type->fs_flags & FS_REVAL_DOT)) {
> > +		if (!path->dentry->d_op->d_revalidate(path->dentry, nd))
> > +			return -ESTALE;
> 
> Won't this prevent you from ever being able to unmount the stale
> filesystem?
> 

Good point - I think you are right.

It seems to me that ->d_revalidate is being used for two distinct, though
related, tasks.
One is to revalidate the dentry - make sure the name still refers to the same
inode.  The other is to revalidate the inode - make sure the cached
attributes are still valid.  In NFS (v3 and v4 at least) these are both
performed by one call so it makes some sense to combine them.  But from the
VFS perspective I would have thought they were quite separate.
nfs_lookup_revalidate sometimes does a GETATTR, and sometimes does a LOOKUP,
depending to some extent on the 'intent' in the nameidata.  I find this makes
it a bit hard to follow what is really happening, or how d_revalidate should
really be used.

Maybe we should just ignore the return value above.  Or maybe d_revalidate
should never do a GETATTR - that should be done by ->open ??

Confused :-(

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Should we expect close-to-open consistency on directories?
  2010-05-07 22:34           ` Neil Brown
@ 2010-05-08 13:05             ` Chuck Lever
  2010-05-08 22:08               ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever @ 2010-05-08 13:05 UTC (permalink / raw)
  To: Neil Brown; +Cc: Trond Myklebust, linux-nfs, Alexander Viro

On 05/07/2010 06:34 PM, Neil Brown wrote:
> On Thu, 06 May 2010 09:58:31 -0400 Trond Myklebust<trond.myklebust@fys.uio.no>  wrote:
>>>
>>>
>>> diff --git a/fs/namei.c b/fs/namei.c
>>> index a7dce91..256ae13 100644
>>> --- a/fs/namei.c
>>> +++ b/fs/namei.c
>>> @@ -719,7 +719,11 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
>>>   done:
>>>   	path->mnt = mnt;
>>>   	path->dentry = dentry;
>>> -	__follow_mount(path);
>>> +	if (__follow_mount(path)&&
>>> +	    (path->mnt->mnt_sb->s_type->fs_flags&  FS_REVAL_DOT)) {
>>> +		if (!path->dentry->d_op->d_revalidate(path->dentry, nd))
>>> +			return -ESTALE;
>>
>> Won't this prevent you from ever being able to unmount the stale
>> filesystem?
>>
>
> Good point - I think you are right.
>
> It seems to me that ->d_revalidate is being used for two distinct, though
> related, tasks.
> One is to revalidate the dentry - make sure the name still refers to the same
> inode.  The other is to revalidate the inode - make sure the cached
> attributes are still valid.  In NFS (v3 and v4 at least) these are both
> performed by one call so it makes some sense to combine them.  But from the
> VFS perspective I would have thought they were quite separate.
> nfs_lookup_revalidate sometimes does a GETATTR, and sometimes does a LOOKUP,
> depending to some extent on the 'intent' in the nameidata.  I find this makes
> it a bit hard to follow what is really happening, or how d_revalidate should
> really be used.
>
> Maybe we should just ignore the return value above.  Or maybe d_revalidate
> should never do a GETATTR - that should be done by ->open ??
>
> Confused :-(

The reason for this arrangement is that by the time ->open is called, 
it's too late to return ESTALE.  During the path walk, there is still an 
opportunity to renew the mapping between dentry and inode, should the 
cached value of that mapping be stale.

-- 
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Should we expect close-to-open consistency on directories?
  2010-05-08 13:05             ` Chuck Lever
@ 2010-05-08 22:08               ` Neil Brown
  2010-05-10  2:29                 ` Chuck Lever
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2010-05-08 22:08 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond Myklebust, linux-nfs, Alexander Viro

On Sat, 08 May 2010 09:05:04 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> On 05/07/2010 06:34 PM, Neil Brown wrote:
> > On Thu, 06 May 2010 09:58:31 -0400 Trond Myklebust<trond.myklebust@fys.uio.no>  wrote:
> >>>
> >>>
> >>> diff --git a/fs/namei.c b/fs/namei.c
> >>> index a7dce91..256ae13 100644
> >>> --- a/fs/namei.c
> >>> +++ b/fs/namei.c
> >>> @@ -719,7 +719,11 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
> >>>   done:
> >>>   	path->mnt = mnt;
> >>>   	path->dentry = dentry;
> >>> -	__follow_mount(path);
> >>> +	if (__follow_mount(path)&&
> >>> +	    (path->mnt->mnt_sb->s_type->fs_flags&  FS_REVAL_DOT)) {
> >>> +		if (!path->dentry->d_op->d_revalidate(path->dentry, nd))
> >>> +			return -ESTALE;
> >>
> >> Won't this prevent you from ever being able to unmount the stale
> >> filesystem?
> >>
> >
> > Good point - I think you are right.
> >
> > It seems to me that ->d_revalidate is being used for two distinct, though
> > related, tasks.
> > One is to revalidate the dentry - make sure the name still refers to the same
> > inode.  The other is to revalidate the inode - make sure the cached
> > attributes are still valid.  In NFS (v3 and v4 at least) these are both
> > performed by one call so it makes some sense to combine them.  But from the
> > VFS perspective I would have thought they were quite separate.
> > nfs_lookup_revalidate sometimes does a GETATTR, and sometimes does a LOOKUP,
> > depending to some extent on the 'intent' in the nameidata.  I find this makes
> > it a bit hard to follow what is really happening, or how d_revalidate should
> > really be used.
> >
> > Maybe we should just ignore the return value above.  Or maybe d_revalidate
> > should never do a GETATTR - that should be done by ->open ??
> >
> > Confused :-(
> 
> The reason for this arrangement is that by the time ->open is called, 
> it's too late to return ESTALE.  During the path walk, there is still an 
> opportunity to renew the mapping between dentry and inode, should the 
> cached value of that mapping be stale.
> 

Is it?
->open seems to be able to return an error code, and that error code seems to
be propagated up.  What am I missing?

I appreciate that a previous lookup might have already done the equivalent of
'open' (e.g. for O_EXCL) but it seems wrong to assume that the last lookup
will always do the GETATTR required for open (because there isn't always a
'last lookup').
I would rather see the lookup (or d_revalidate) remember how much of an open
it has done, and then ->open does the rest.  i.e. it does a GETATTR if that
hasn't already been done.
I don't think there is currently an easy way for 'open' to know how much work
'd_revalidate' or 'lookup' has done, though I suspect that could be fixed.
Maybe pass (as pointer to) the whole intent structure into open rather than
just the intent.open.file ??

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Should we expect close-to-open consistency on directories?
  2010-05-08 22:08               ` Neil Brown
@ 2010-05-10  2:29                 ` Chuck Lever
  2010-05-10  3:01                   ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever @ 2010-05-10  2:29 UTC (permalink / raw)
  To: Neil Brown; +Cc: Trond Myklebust, linux-nfs, Alexander Viro

On 05/ 8/10 06:08 PM, Neil Brown wrote:
> On Sat, 08 May 2010 09:05:04 -0400
> Chuck Lever<chuck.lever@oracle.com>  wrote:
>
>> On 05/07/2010 06:34 PM, Neil Brown wrote:
>>> On Thu, 06 May 2010 09:58:31 -0400 Trond Myklebust<trond.myklebust@fys.uio.no>   wrote:
>>>>>
>>>>>
>>>>> diff --git a/fs/namei.c b/fs/namei.c
>>>>> index a7dce91..256ae13 100644
>>>>> --- a/fs/namei.c
>>>>> +++ b/fs/namei.c
>>>>> @@ -719,7 +719,11 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
>>>>>    done:
>>>>>    	path->mnt = mnt;
>>>>>    	path->dentry = dentry;
>>>>> -	__follow_mount(path);
>>>>> +	if (__follow_mount(path)&&
>>>>> +	    (path->mnt->mnt_sb->s_type->fs_flags&   FS_REVAL_DOT)) {
>>>>> +		if (!path->dentry->d_op->d_revalidate(path->dentry, nd))
>>>>> +			return -ESTALE;
>>>>
>>>> Won't this prevent you from ever being able to unmount the stale
>>>> filesystem?
>>>>
>>>
>>> Good point - I think you are right.
>>>
>>> It seems to me that ->d_revalidate is being used for two distinct, though
>>> related, tasks.
>>> One is to revalidate the dentry - make sure the name still refers to the same
>>> inode.  The other is to revalidate the inode - make sure the cached
>>> attributes are still valid.  In NFS (v3 and v4 at least) these are both
>>> performed by one call so it makes some sense to combine them.  But from the
>>> VFS perspective I would have thought they were quite separate.
>>> nfs_lookup_revalidate sometimes does a GETATTR, and sometimes does a LOOKUP,
>>> depending to some extent on the 'intent' in the nameidata.  I find this makes
>>> it a bit hard to follow what is really happening, or how d_revalidate should
>>> really be used.
>>>
>>> Maybe we should just ignore the return value above.  Or maybe d_revalidate
>>> should never do a GETATTR - that should be done by ->open ??
>>>
>>> Confused :-(
>>
>> The reason for this arrangement is that by the time ->open is called,
>> it's too late to return ESTALE.  During the path walk, there is still an
>> opportunity to renew the mapping between dentry and inode, should the
>> cached value of that mapping be stale.
>>
>
> Is it?
> ->open seems to be able to return an error code, and that error code seems to
> be propagated up.  What am I missing?

"too late to return ESTALE" ... to the VFS layer.  When d_revalidate 
returns ESTALE while a pathname is being resolved, our VFS layer 
re-walks the pathname from the root, passing in a flag that says "please 
don't use our cache, but do each LOOKUP again."

That way, we can rely on the dentry cache most of the time, and have 
some way to recover when file system objects are renamed on the server.

> I appreciate that a previous lookup might have already done the equivalent of
> 'open' (e.g. for O_EXCL) but it seems wrong to assume that the last lookup
> will always do the GETATTR required for open (because there isn't always a
> 'last lookup').

open(".") has always been funky.

I had an argument years ago with someone who pointed out that the POSIX 
standards are written in such a way that it is entirely optional to 
revalidate ".", and so we don't.  I don't recall the specifics.

> I would rather see the lookup (or d_revalidate) remember how much of an open
> it has done, and then ->open does the rest.  i.e. it does a GETATTR if that
> hasn't already been done.
> I don't think there is currently an easy way for 'open' to know how much work
> 'd_revalidate' or 'lookup' has done, though I suspect that could be fixed.
> Maybe pass (as pointer to) the whole intent structure into open rather than
> just the intent.open.file ??

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Should we expect close-to-open consistency on directories?
  2010-05-10  2:29                 ` Chuck Lever
@ 2010-05-10  3:01                   ` Neil Brown
  0 siblings, 0 replies; 10+ messages in thread
From: Neil Brown @ 2010-05-10  3:01 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond Myklebust, linux-nfs, Alexander Viro

On Sun, 09 May 2010 22:29:27 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> On 05/ 8/10 06:08 PM, Neil Brown wrote:
> > On Sat, 08 May 2010 09:05:04 -0400
> > Chuck Lever<chuck.lever@oracle.com>  wrote:
> >
> >> On 05/07/2010 06:34 PM, Neil Brown wrote:
> >>> On Thu, 06 May 2010 09:58:31 -0400 Trond Myklebust<trond.myklebust@fys.uio.no>   wrote:
> >>>>>
> >>>>>
> >>>>> diff --git a/fs/namei.c b/fs/namei.c
> >>>>> index a7dce91..256ae13 100644
> >>>>> --- a/fs/namei.c
> >>>>> +++ b/fs/namei.c
> >>>>> @@ -719,7 +719,11 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
> >>>>>    done:
> >>>>>    	path->mnt = mnt;
> >>>>>    	path->dentry = dentry;
> >>>>> -	__follow_mount(path);
> >>>>> +	if (__follow_mount(path)&&
> >>>>> +	    (path->mnt->mnt_sb->s_type->fs_flags&   FS_REVAL_DOT)) {
> >>>>> +		if (!path->dentry->d_op->d_revalidate(path->dentry, nd))
> >>>>> +			return -ESTALE;
> >>>>
> >>>> Won't this prevent you from ever being able to unmount the stale
> >>>> filesystem?
> >>>>
> >>>
> >>> Good point - I think you are right.
> >>>
> >>> It seems to me that ->d_revalidate is being used for two distinct, though
> >>> related, tasks.
> >>> One is to revalidate the dentry - make sure the name still refers to the same
> >>> inode.  The other is to revalidate the inode - make sure the cached
> >>> attributes are still valid.  In NFS (v3 and v4 at least) these are both
> >>> performed by one call so it makes some sense to combine them.  But from the
> >>> VFS perspective I would have thought they were quite separate.
> >>> nfs_lookup_revalidate sometimes does a GETATTR, and sometimes does a LOOKUP,
> >>> depending to some extent on the 'intent' in the nameidata.  I find this makes
> >>> it a bit hard to follow what is really happening, or how d_revalidate should
> >>> really be used.
> >>>
> >>> Maybe we should just ignore the return value above.  Or maybe d_revalidate
> >>> should never do a GETATTR - that should be done by ->open ??
> >>>
> >>> Confused :-(
> >>
> >> The reason for this arrangement is that by the time ->open is called,
> >> it's too late to return ESTALE.  During the path walk, there is still an
> >> opportunity to renew the mapping between dentry and inode, should the
> >> cached value of that mapping be stale.
> >>
> >
> > Is it?
> > ->open seems to be able to return an error code, and that error code seems to
> > be propagated up.  What am I missing?
> 
> "too late to return ESTALE" ... to the VFS layer.  When d_revalidate 
> returns ESTALE while a pathname is being resolved, our VFS layer 
> re-walks the pathname from the root, passing in a flag that says "please 
> don't use our cache, but do each LOOKUP again."

I see. path_walk does that, the flag being "LOOKUP_REVAL".
But this is just saying that 'open' is too late to validate a name, and I
agree with that.  I want 'open' to revalidate the cached attributes -
particularly in the case where there is no name to revalidated.
i.e. a mount point or '.'.  In those cases we would want to return ESTALE,
not try to redo the lookup (because there is no lookup to do).

> 
> That way, we can rely on the dentry cache most of the time, and have 
> some way to recover when file system objects are renamed on the server.
> 
> > I appreciate that a previous lookup might have already done the equivalent of
> > 'open' (e.g. for O_EXCL) but it seems wrong to assume that the last lookup
> > will always do the GETATTR required for open (because there isn't always a
> > 'last lookup').
> 
> open(".") has always been funky.
> 
> I had an argument years ago with someone who pointed out that the POSIX 
> standards are written in such a way that it is entirely optional to 
> revalidate ".", and so we don't.  I don't recall the specifics.

The rationale I fall back on is that to preserve "close-to-open-coherency"
you need to get the change attribute (aka mtime) when you open something - a
directory in this case.
This isn't so much "revalidating" as "refreshing".
So prima-facie 'open' should always issue a GETATTR and check if the file has
changed.  In practice that isn't necessary in most cases as a d_revalidate
has issued a LOOKUP or a GETATTR and the cached attributes are current.
But in the cases where there is no dentry to revalidate (mountpoint and "."),
something more is needed.

NeilBrown


> 
> > I would rather see the lookup (or d_revalidate) remember how much of an open
> > it has done, and then ->open does the rest.  i.e. it does a GETATTR if that
> > hasn't already been done.
> > I don't think there is currently an easy way for 'open' to know how much work
> > 'd_revalidate' or 'lookup' has done, though I suspect that could be fixed.
> > Maybe pass (as pointer to) the whole intent structure into open rather than
> > just the intent.open.file ??
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-05-10  3:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-20  7:22 [PATCH] Should we expect close-to-open consistency on directories? Neil Brown
2010-04-20 13:02 ` Trond Myklebust
2010-04-21  7:03   ` Neil Brown
2010-05-06  4:13     ` Neil Brown
     [not found]       ` <20100506141347.06451f56-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2010-05-06 13:58         ` Trond Myklebust
2010-05-07 22:34           ` Neil Brown
2010-05-08 13:05             ` Chuck Lever
2010-05-08 22:08               ` Neil Brown
2010-05-10  2:29                 ` Chuck Lever
2010-05-10  3:01                   ` Neil Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.