All of lore.kernel.org
 help / color / mirror / Atom feed
* Thoughts about cache consistency and directories in particular.
@ 2009-02-20  2:18 Neil Brown
  2009-02-20 18:23 ` J. Bruce Fields
       [not found] ` <18846.4842.625445.980681-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 2 replies; 15+ messages in thread
From: Neil Brown @ 2009-02-20  2:18 UTC (permalink / raw)
  To: J. Bruce Fields, Trond Myklebust; +Cc: linux-nfs


Hi,
 I've been thinking about cache consistency, particularly of
 directories, in response to a customer who's NFS was getting confused
 by their usage of "rsync -a" without the "--omit-dir-times" flag:
 A client would see an old copy of a directory and never get a more
 up-to-date copy because the mtime appeared not to change.

 This results in a situation where a directory has wrong data cache
 and there is no way to force that cache to be flushed.

 This contrasts with files where you can always flush the file
 contents by taking a read lock on the file.

 I also came up with a simple way to demonstrate a related caching
 anomaly:

  - Create a localhost mount
  - create a directory
  - "ls -l" the directory via NFS
  - create a file directly
  - look again via NFS.

  The directory will appear empty via NFS but it is not.  And this
  cache anomily does not time out (though memory pressure could
  eventually remove it).

  There is a script below which reproduces both anomalies (providing
  /export is exported and /mnt is available).
 

 Can anything be done about this?


 1/ The client could flush the cache for a directory when ctime
    changes as well as when mtime or size change.
    This would help solve the "rsync -a without --omit-dir-times"
    problem (and also another weird problem I had reported that
    involved strange behaviour from an NetApp filer).
    It might increase the number of READDIR requests in some cases.
    Would that be enough of an increase to be a real problem?
    It would be no worse than NFSv4 which - as the Linux NFS server
    uses ctime to produce the changeattr - refreshes both directories
    and files when the ctime changes.


 2/ The server could lie about the mtime.
    In particular, if the mtime for a file was the same as the current
    time - to the granularity of the filesystem storing the file -
    then reduce the mtime that is reported by the smallest difference that
    can be reported by the protocol.
    That would be one microsecond for v2, and one nanosecond for v3
    and v4.

    This is something I've thought about (and probably muttered about)
    in various forms at various times over the years, but this time I
    think I am actually happy with the formulation of the solution and
    want to push forward with it.



 Option 1, by itself, would mostly resolve the rsync issue and have
 no effect on my little test case.
 Option 2 by itself would have no effect on the rsync issue but would
 nicely resolve my little test cache.
 Together they should significantly reduce the number of caching
 anomalies.

 Below is the test script, then two patches, one for NFS and one for
 NFSD.  These can also be pulled from 
     git://neil.brown.name/linux-2.6 nfs-devel


 Thoughts?


Thanks,
NeilBrown


------------------------------------------------
mount localhost:/export /mnt
rm -rf /export/adir
mkdir /export/adir

for i in `seq 1 100`
do mkdir /export/adir/$i 
     ls -l /mnt/adir/$i  > /dev/null 2>&1
     echo hello > /export/adir/$i/afile ;
done
echo -n "Files in directory: "
find /export/adir | wc -l
echo -n "Files visible to NFS: "
find /mnt/adir | wc -l

echo ; echo 
rm -rf /export/bdir
mkdir /export/bdir
> /export/bfile
touch -r /export/bfile /export/bdir
sleep 2
ls -l /mnt/bdir > /dev/null
sleep 2
> /export/bdir/bfile
touch -r /export/bfile /export/bdir
echo "Files in directory:"
ls -l /export/bdir
echo "Files visible to NFS:"
ls -l /mnt/bdir
umount /mnt
------------------------------------------------


From: NeilBrown <neilb@suse.de>
Subject: [PATCH 1/2] nfs: flush cached directory information slightly more
	readily.
Date: Fri, 20 Feb 2009 12:57:07 +1100
Message-ID: <20090220015706.12400.98909.stgit@notabene.brown>
User-Agent: StGIT/0.14.2
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit

If cached directory contents becomes incorrect, there is no way to
flush the contents.  This contrasts with files where file locking is
the recommended way to ensure cache consistency between multiple
applications (a read-lock always flushes the cache).

Also while changes to files often change the size of the file (thus
triggering a cache flush), changes to directories often do not change
the apparent size (as the size is often rounded to a block size).

So it is particularly important with directories to avoid the
possibility of an incorrect cache wherever possible.

When the link count on a directory changes it implies (or should
imply) a change in the number of child directories, and so a change
the contents of this directory.  So use that as a trigger to flush
cached contents.

When the ctime changes but the mtime does not, there are two possible
reasons.
 1/ The owner/mode information has been changed.
 2/ utimes has been used to set the mtime backwards.

In the first case, a data-cache flush is not required.
In the second case it is.

So on th basis that correctness trumps performance, flush the
directory contents cache in this case also.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 fs/nfs/inode.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 0c38168..09c1300 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1116,8 +1116,14 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 				nfs_force_lookup_revalidate(inode);
 		}
 		/* If ctime has changed we should definitely clear access+acl caches */
-		if (!timespec_equal(&inode->i_ctime, &fattr->ctime))
+		if (!timespec_equal(&inode->i_ctime, &fattr->ctime)) {
 			invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_ACCESS|NFS_INO_INVALID_ACL;
+			/* and probably clear data for a directory too as utimes can cause
+			 * havoc with our cache.
+			 */
+			if (S_ISDIR(inode->i_mode))
+				invalid |= NFS_INO_INVALID_DATA;
+		}
 	} else if (nfsi->change_attr != fattr->change_attr) {
 		dprintk("NFS: change_attr change on server for file %s/%ld\n",
 				inode->i_sb->s_id, inode->i_ino);
@@ -1151,8 +1157,11 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 	    inode->i_gid != fattr->gid)
 		invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_ACCESS|NFS_INO_INVALID_ACL;
 
-	if (inode->i_nlink != fattr->nlink)
+	if (inode->i_nlink != fattr->nlink) {
 		invalid |= NFS_INO_INVALID_ATTR;
+		if (S_ISDIR(inode->i_mode))
+			invalid |= NFS_INO_INVALID_DATA;
+	}
 
 	inode->i_mode = fattr->mode;
 	inode->i_nlink = fattr->nlink;


From: NeilBrown <neilb@suse.de>
Subject: [PATCH 2/2] nfsd: be creative with 'mtime' so that NFS client can
	reliably discover changes
Date: Fri, 20 Feb 2009 12:57:07 +1100
Message-ID: <20090220015707.12400.80500.stgit@notabene.brown>
In-Reply-To: <20090220015706.12400.98909.stgit@notabene.brown>
References: <20090220015706.12400.98909.stgit@notabene.brown>
User-Agent: StGIT/0.14.2
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit

For NFSv2 and v3, mtime is used as the main mechanism for cache
flushing - when mtime changes, we flush the cache.

However with many filesystems the granularity of mtime is one second,
and multiple changes can occur in one second.

This make it unwise to ever return "now" for the mtime of a file, as
quite some time later there may have been a change since that mtime
was returned, but the mtime will still be the same.

So if the mtime is "now" - to the granularity of the filesystem,
adjust it backwards the smallest possible amount.   This when a
subsequent GETATTR checks the mtime, it will no longer be 'now', so
the correct mtime will be returned, and the client will notice that
the file could have changed.

For NFSv4 we do a similar thing, but rather change the 'changeid'
attribute which is based on 'ctime'.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 fs/nfsd/nfs3xdr.c         |    2 +-
 fs/nfsd/nfs4xdr.c         |    1 +
 fs/nfsd/nfsxdr.c          |    2 +-
 include/linux/nfsd/xdr.h  |   25 +++++++++++++++++++++++++
 include/linux/nfsd/xdr4.h |   26 ++++++++++++++++++++++++++
 5 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 17d0dd9..139500b 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -221,7 +221,7 @@ encode_post_op_attr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp)
 		err = vfs_getattr(fhp->fh_export->ex_path.mnt, dentry, &stat);
 		if (!err) {
 			*p++ = xdr_one;		/* attributes follow */
-			lease_get_mtime(dentry->d_inode, &stat.mtime);
+			nfsd_get_mtime(dentry->d_inode, &stat.mtime, 1);
 			return encode_fattr3(rqstp, p, fhp, &stat);
 		}
 	}
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index f65953b..6bd7339 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1518,6 +1518,7 @@ nfsd4_encode_fattr(struct svc_fh *fhp, struct svc_export *exp,
 		 */
 		if ((buflen -= 8) < 0)
 			goto out_resource;
+		nfs4svc_adjust_mtime(fhp, &stat.ctime);
 		WRITE32(stat.ctime.tv_sec);
 		WRITE32(stat.ctime.tv_nsec);
 	}
diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
index afd08e2..e266d42 100644
--- a/fs/nfsd/nfsxdr.c
+++ b/fs/nfsd/nfsxdr.c
@@ -194,7 +194,7 @@ encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp,
 	*p++ = htonl((u32) stat->ino);
 	*p++ = htonl((u32) stat->atime.tv_sec);
 	*p++ = htonl(stat->atime.tv_nsec ? stat->atime.tv_nsec / 1000 : 0);
-	lease_get_mtime(dentry->d_inode, &time); 
+	nfsd_get_mtime(dentry->d_inode, &time, 1000); 
 	*p++ = htonl((u32) time.tv_sec);
 	*p++ = htonl(time.tv_nsec ? time.tv_nsec / 1000 : 0); 
 	*p++ = htonl((u32) stat->ctime.tv_sec);
diff --git a/include/linux/nfsd/xdr.h b/include/linux/nfsd/xdr.h
index a0132ef..b1a0f5d 100644
--- a/include/linux/nfsd/xdr.h
+++ b/include/linux/nfsd/xdr.h
@@ -174,4 +174,29 @@ int nfssvc_release_fhandle(struct svc_rqst *, __be32 *, struct nfsd_fhandle *);
 __be32 *nfs2svc_encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp);
 __be32 *nfs2svc_decode_fh(__be32 *p, struct svc_fh *fhp);
 
+/* NFSv3 and v2 use mtime to guide caching of file/directory contents.
+ * If a client asks for the mtime of a file and the answer is "now",
+ * then subsequent changes to the file could happen but never be reflected
+ * in a changed mtime.
+ * So in that case, reduce mtime by the smallest value representable in
+ * the protocol.  A subsequent GETATTR will show the mtime having changed
+ * slightly so the client will refresh its cache and be sure to get any
+ * changes.
+ */
+static inline void nfsd_get_mtime(struct inode *inode, struct timespec *time, int nanoseconds)
+{
+	struct timespec now;
+	lease_get_mtime(inode, time);
+	now = current_fs_time(inode->i_sb);
+	if (time->tv_sec == now.tv_sec &&
+	    time->tv_nsec == now.tv_nsec) {
+		if (time->tv_nsec >= nanoseconds)
+			time->tv_nsec -= nanoseconds;
+		else {
+			time->tv_nsec += 1000000000 - nanoseconds;
+			time->tv_sec -= 1;
+		}
+	}
+}
+	
 #endif /* LINUX_NFSD_H */
diff --git a/include/linux/nfsd/xdr4.h b/include/linux/nfsd/xdr4.h
index 27bd3e3..7d2be49 100644
--- a/include/linux/nfsd/xdr4.h
+++ b/include/linux/nfsd/xdr4.h
@@ -422,10 +422,36 @@ struct nfsd4_compoundres {
 #define NFS4_SVC_XDRSIZE		sizeof(struct nfsd4_compoundargs)
 
 static inline void
+nfs4svc_adjust_mtime(struct svc_fh *fhp, struct timespec *time)
+{
+	/* because we use ctime for caching, and because the granularity
+	 * of ctime might be fairly large (e.g. 1 second) it is unsafe
+	 * to ever report 'now' as the changeid - as then we might report
+	 * two change ids that are the same despite there being a change
+	 * to the file between them, and then continue to report that
+	 * same change-id for a long time.
+	 * So if the changeid would correspond to 'now', change it to
+	 * one nanosecond ago so that after 'now' as moved on, the client
+	 * will be able to see a new changeid and will update its cache.
+	 */
+	struct timespec now;
+	now = current_fs_time(fhp->fh_dentry->d_inode->i_sb);
+	if (time->tv_sec == now.tv_sec &&
+	    time->tv_nsec == now.tv_nsec) {
+		if (time->tv_nsec)
+			time->tv_nsec--;
+		else {
+			time->tv_nsec = 999999999;
+			time->tv_sec --;
+		}
+	}
+}
+static inline void
 set_change_info(struct nfsd4_change_info *cinfo, struct svc_fh *fhp)
 {
 	BUG_ON(!fhp->fh_pre_saved || !fhp->fh_post_saved);
 	cinfo->atomic = 1;
+	nfs4svc_adjust_mtime(fhp, &fhp->fh_pre_ctime);
 	cinfo->before_ctime_sec = fhp->fh_pre_ctime.tv_sec;
 	cinfo->before_ctime_nsec = fhp->fh_pre_ctime.tv_nsec;
 	cinfo->after_ctime_sec = fhp->fh_post_attr.ctime.tv_sec;



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
  2009-02-20  2:18 Thoughts about cache consistency and directories in particular Neil Brown
@ 2009-02-20 18:23 ` J. Bruce Fields
  2009-02-20 19:47   ` Neil Brown
       [not found] ` <18846.4842.625445.980681-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  1 sibling, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2009-02-20 18:23 UTC (permalink / raw)
  To: Neil Brown; +Cc: Trond Myklebust, linux-nfs

On Fri, Feb 20, 2009 at 01:18:18PM +1100, Neil Brown wrote:
> 
> Hi,
>  I've been thinking about cache consistency, particularly of
>  directories, in response to a customer who's NFS was getting confused
>  by their usage of "rsync -a" without the "--omit-dir-times" flag:
>  A client would see an old copy of a directory and never get a more
>  up-to-date copy because the mtime appeared not to change.
> 
>  This results in a situation where a directory has wrong data cache
>  and there is no way to force that cache to be flushed.
> 
>  This contrasts with files where you can always flush the file
>  contents by taking a read lock on the file.
> 
>  I also came up with a simple way to demonstrate a related caching
>  anomaly:
> 
>   - Create a localhost mount
>   - create a directory
>   - "ls -l" the directory via NFS
>   - create a file directly

(And do this within a second (or a jiffy, depending on filesystem) of
the directory creation?)

>   - look again via NFS.

> 
>   The directory will appear empty via NFS but it is not.  And this
>   cache anomily does not time out (though memory pressure could
>   eventually remove it).
> 
>   There is a script below which reproduces both anomalies (providing
>   /export is exported and /mnt is available).
>  
> 
>  Can anything be done about this?
> 
> 
>  1/ The client could flush the cache for a directory when ctime
>     changes as well as when mtime or size change.
>     This would help solve the "rsync -a without --omit-dir-times"
>     problem (and also another weird problem I had reported that
>     involved strange behaviour from an NetApp filer).
>     It might increase the number of READDIR requests in some cases.
>     Would that be enough of an increase to be a real problem?
>     It would be no worse than NFSv4 which - as the Linux NFS server
>     uses ctime to produce the changeattr - refreshes both directories
>     and files when the ctime changes.

This has come up before, but I can't remember what the argument was
against it....

>  2/ The server could lie about the mtime.
>     In particular, if the mtime for a file was the same as the current
>     time - to the granularity of the filesystem storing the file -
>     then reduce the mtime that is reported by the smallest difference that
>     can be reported by the protocol.
>     That would be one microsecond for v2, and one nanosecond for v3
>     and v4.

Assume for simplicity's sake the time granularity is a second, and
measure time in seconds in the following examples:

Your proposal offers an improvement in this example (currently,
subsequent getattrs will not reflect the final modification):

	t=0.1 modify
	t=0.2 getattr
	t=0.3 modify

Your proposal causes a regression in the following example:

	t=0.1 modify
	t=0.2 getattr
	t=1.1 modify 
	t=1.2 modify

I think the argument you'd like to make is that the regressions are
rarer than the improvements.  One way to make that argument would be to
completely characterize the two sets of cases (cases where you get an
improvement, and cases where you get a regression), and then somehow
convince us of their relative probabilities.  I think that will be
difficult, and the result may not be what you expect.

Also, the proposed behavior is more complicated (hence harder to
explain to users seeing the bug, and perhaps harder to work around?), so
the bar for the improvement should be set pretty high....

By the way, I have one sadly neglected todo here: ext4 has a real nfsv4
changeattribute, which needs to be hooked up to the nfsd code.

--b.

>     This is something I've thought about (and probably muttered about)
>     in various forms at various times over the years, but this time I
>     think I am actually happy with the formulation of the solution and
>     want to push forward with it.
> 
> 
> 
>  Option 1, by itself, would mostly resolve the rsync issue and have
>  no effect on my little test case.
>  Option 2 by itself would have no effect on the rsync issue but would
>  nicely resolve my little test cache.
>  Together they should significantly reduce the number of caching
>  anomalies.
> 
>  Below is the test script, then two patches, one for NFS and one for
>  NFSD.  These can also be pulled from 
>      git://neil.brown.name/linux-2.6 nfs-devel
> 
> 
>  Thoughts?
> 
> 
> Thanks,
> NeilBrown
> 
> 
> ------------------------------------------------
> mount localhost:/export /mnt
> rm -rf /export/adir
> mkdir /export/adir
> 
> for i in `seq 1 100`
> do mkdir /export/adir/$i 
>      ls -l /mnt/adir/$i  > /dev/null 2>&1
>      echo hello > /export/adir/$i/afile ;
> done
> echo -n "Files in directory: "
> find /export/adir | wc -l
> echo -n "Files visible to NFS: "
> find /mnt/adir | wc -l
> 
> echo ; echo 
> rm -rf /export/bdir
> mkdir /export/bdir
> > /export/bfile
> touch -r /export/bfile /export/bdir
> sleep 2
> ls -l /mnt/bdir > /dev/null
> sleep 2
> > /export/bdir/bfile
> touch -r /export/bfile /export/bdir
> echo "Files in directory:"
> ls -l /export/bdir
> echo "Files visible to NFS:"
> ls -l /mnt/bdir
> umount /mnt
> ------------------------------------------------
> 
> 
> From: NeilBrown <neilb@suse.de>
> Subject: [PATCH 1/2] nfs: flush cached directory information slightly more
> 	readily.
> Date: Fri, 20 Feb 2009 12:57:07 +1100
> Message-ID: <20090220015706.12400.98909.stgit@notabene.brown>
> User-Agent: StGIT/0.14.2
> MIME-Version: 1.0
> Content-Type: text/plain; charset="utf-8"
> Content-Transfer-Encoding: 7bit
> 
> If cached directory contents becomes incorrect, there is no way to
> flush the contents.  This contrasts with files where file locking is
> the recommended way to ensure cache consistency between multiple
> applications (a read-lock always flushes the cache).
> 
> Also while changes to files often change the size of the file (thus
> triggering a cache flush), changes to directories often do not change
> the apparent size (as the size is often rounded to a block size).
> 
> So it is particularly important with directories to avoid the
> possibility of an incorrect cache wherever possible.
> 
> When the link count on a directory changes it implies (or should
> imply) a change in the number of child directories, and so a change
> the contents of this directory.  So use that as a trigger to flush
> cached contents.
> 
> When the ctime changes but the mtime does not, there are two possible
> reasons.
>  1/ The owner/mode information has been changed.
>  2/ utimes has been used to set the mtime backwards.
> 
> In the first case, a data-cache flush is not required.
> In the second case it is.
> 
> So on th basis that correctness trumps performance, flush the
> directory contents cache in this case also.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> 
>  fs/nfs/inode.c |   13 +++++++++++--
>  1 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 0c38168..09c1300 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -1116,8 +1116,14 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
>  				nfs_force_lookup_revalidate(inode);
>  		}
>  		/* If ctime has changed we should definitely clear access+acl caches */
> -		if (!timespec_equal(&inode->i_ctime, &fattr->ctime))
> +		if (!timespec_equal(&inode->i_ctime, &fattr->ctime)) {
>  			invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_ACCESS|NFS_INO_INVALID_ACL;
> +			/* and probably clear data for a directory too as utimes can cause
> +			 * havoc with our cache.
> +			 */
> +			if (S_ISDIR(inode->i_mode))
> +				invalid |= NFS_INO_INVALID_DATA;
> +		}
>  	} else if (nfsi->change_attr != fattr->change_attr) {
>  		dprintk("NFS: change_attr change on server for file %s/%ld\n",
>  				inode->i_sb->s_id, inode->i_ino);
> @@ -1151,8 +1157,11 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
>  	    inode->i_gid != fattr->gid)
>  		invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_ACCESS|NFS_INO_INVALID_ACL;
>  
> -	if (inode->i_nlink != fattr->nlink)
> +	if (inode->i_nlink != fattr->nlink) {
>  		invalid |= NFS_INO_INVALID_ATTR;
> +		if (S_ISDIR(inode->i_mode))
> +			invalid |= NFS_INO_INVALID_DATA;
> +	}
>  
>  	inode->i_mode = fattr->mode;
>  	inode->i_nlink = fattr->nlink;
> 
> 
> From: NeilBrown <neilb@suse.de>
> Subject: [PATCH 2/2] nfsd: be creative with 'mtime' so that NFS client can
> 	reliably discover changes
> Date: Fri, 20 Feb 2009 12:57:07 +1100
> Message-ID: <20090220015707.12400.80500.stgit@notabene.brown>
> In-Reply-To: <20090220015706.12400.98909.stgit@notabene.brown>
> References: <20090220015706.12400.98909.stgit@notabene.brown>
> User-Agent: StGIT/0.14.2
> MIME-Version: 1.0
> Content-Type: text/plain; charset="utf-8"
> Content-Transfer-Encoding: 7bit
> 
> For NFSv2 and v3, mtime is used as the main mechanism for cache
> flushing - when mtime changes, we flush the cache.
> 
> However with many filesystems the granularity of mtime is one second,
> and multiple changes can occur in one second.
> 
> This make it unwise to ever return "now" for the mtime of a file, as
> quite some time later there may have been a change since that mtime
> was returned, but the mtime will still be the same.
> 
> So if the mtime is "now" - to the granularity of the filesystem,
> adjust it backwards the smallest possible amount.   This when a
> subsequent GETATTR checks the mtime, it will no longer be 'now', so
> the correct mtime will be returned, and the client will notice that
> the file could have changed.
> 
> For NFSv4 we do a similar thing, but rather change the 'changeid'
> attribute which is based on 'ctime'.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> 
>  fs/nfsd/nfs3xdr.c         |    2 +-
>  fs/nfsd/nfs4xdr.c         |    1 +
>  fs/nfsd/nfsxdr.c          |    2 +-
>  include/linux/nfsd/xdr.h  |   25 +++++++++++++++++++++++++
>  include/linux/nfsd/xdr4.h |   26 ++++++++++++++++++++++++++
>  5 files changed, 54 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 17d0dd9..139500b 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -221,7 +221,7 @@ encode_post_op_attr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp)
>  		err = vfs_getattr(fhp->fh_export->ex_path.mnt, dentry, &stat);
>  		if (!err) {
>  			*p++ = xdr_one;		/* attributes follow */
> -			lease_get_mtime(dentry->d_inode, &stat.mtime);
> +			nfsd_get_mtime(dentry->d_inode, &stat.mtime, 1);
>  			return encode_fattr3(rqstp, p, fhp, &stat);
>  		}
>  	}
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index f65953b..6bd7339 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1518,6 +1518,7 @@ nfsd4_encode_fattr(struct svc_fh *fhp, struct svc_export *exp,
>  		 */
>  		if ((buflen -= 8) < 0)
>  			goto out_resource;
> +		nfs4svc_adjust_mtime(fhp, &stat.ctime);
>  		WRITE32(stat.ctime.tv_sec);
>  		WRITE32(stat.ctime.tv_nsec);
>  	}
> diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
> index afd08e2..e266d42 100644
> --- a/fs/nfsd/nfsxdr.c
> +++ b/fs/nfsd/nfsxdr.c
> @@ -194,7 +194,7 @@ encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp,
>  	*p++ = htonl((u32) stat->ino);
>  	*p++ = htonl((u32) stat->atime.tv_sec);
>  	*p++ = htonl(stat->atime.tv_nsec ? stat->atime.tv_nsec / 1000 : 0);
> -	lease_get_mtime(dentry->d_inode, &time); 
> +	nfsd_get_mtime(dentry->d_inode, &time, 1000); 
>  	*p++ = htonl((u32) time.tv_sec);
>  	*p++ = htonl(time.tv_nsec ? time.tv_nsec / 1000 : 0); 
>  	*p++ = htonl((u32) stat->ctime.tv_sec);
> diff --git a/include/linux/nfsd/xdr.h b/include/linux/nfsd/xdr.h
> index a0132ef..b1a0f5d 100644
> --- a/include/linux/nfsd/xdr.h
> +++ b/include/linux/nfsd/xdr.h
> @@ -174,4 +174,29 @@ int nfssvc_release_fhandle(struct svc_rqst *, __be32 *, struct nfsd_fhandle *);
>  __be32 *nfs2svc_encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp);
>  __be32 *nfs2svc_decode_fh(__be32 *p, struct svc_fh *fhp);
>  
> +/* NFSv3 and v2 use mtime to guide caching of file/directory contents.
> + * If a client asks for the mtime of a file and the answer is "now",
> + * then subsequent changes to the file could happen but never be reflected
> + * in a changed mtime.
> + * So in that case, reduce mtime by the smallest value representable in
> + * the protocol.  A subsequent GETATTR will show the mtime having changed
> + * slightly so the client will refresh its cache and be sure to get any
> + * changes.
> + */
> +static inline void nfsd_get_mtime(struct inode *inode, struct timespec *time, int nanoseconds)
> +{
> +	struct timespec now;
> +	lease_get_mtime(inode, time);
> +	now = current_fs_time(inode->i_sb);
> +	if (time->tv_sec == now.tv_sec &&
> +	    time->tv_nsec == now.tv_nsec) {
> +		if (time->tv_nsec >= nanoseconds)
> +			time->tv_nsec -= nanoseconds;
> +		else {
> +			time->tv_nsec += 1000000000 - nanoseconds;
> +			time->tv_sec -= 1;
> +		}
> +	}
> +}
> +	
>  #endif /* LINUX_NFSD_H */
> diff --git a/include/linux/nfsd/xdr4.h b/include/linux/nfsd/xdr4.h
> index 27bd3e3..7d2be49 100644
> --- a/include/linux/nfsd/xdr4.h
> +++ b/include/linux/nfsd/xdr4.h
> @@ -422,10 +422,36 @@ struct nfsd4_compoundres {
>  #define NFS4_SVC_XDRSIZE		sizeof(struct nfsd4_compoundargs)
>  
>  static inline void
> +nfs4svc_adjust_mtime(struct svc_fh *fhp, struct timespec *time)
> +{
> +	/* because we use ctime for caching, and because the granularity
> +	 * of ctime might be fairly large (e.g. 1 second) it is unsafe
> +	 * to ever report 'now' as the changeid - as then we might report
> +	 * two change ids that are the same despite there being a change
> +	 * to the file between them, and then continue to report that
> +	 * same change-id for a long time.
> +	 * So if the changeid would correspond to 'now', change it to
> +	 * one nanosecond ago so that after 'now' as moved on, the client
> +	 * will be able to see a new changeid and will update its cache.
> +	 */
> +	struct timespec now;
> +	now = current_fs_time(fhp->fh_dentry->d_inode->i_sb);
> +	if (time->tv_sec == now.tv_sec &&
> +	    time->tv_nsec == now.tv_nsec) {
> +		if (time->tv_nsec)
> +			time->tv_nsec--;
> +		else {
> +			time->tv_nsec = 999999999;
> +			time->tv_sec --;
> +		}
> +	}
> +}
> +static inline void
>  set_change_info(struct nfsd4_change_info *cinfo, struct svc_fh *fhp)
>  {
>  	BUG_ON(!fhp->fh_pre_saved || !fhp->fh_post_saved);
>  	cinfo->atomic = 1;
> +	nfs4svc_adjust_mtime(fhp, &fhp->fh_pre_ctime);
>  	cinfo->before_ctime_sec = fhp->fh_pre_ctime.tv_sec;
>  	cinfo->before_ctime_nsec = fhp->fh_pre_ctime.tv_nsec;
>  	cinfo->after_ctime_sec = fhp->fh_post_attr.ctime.tv_sec;
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
       [not found] ` <18846.4842.625445.980681-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2009-02-20 18:56   ` Trond Myklebust
  2009-02-20 19:52     ` Neil Brown
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2009-02-20 18:56 UTC (permalink / raw)
  To: Neil Brown; +Cc: J. Bruce Fields, linux-nfs

On Fri, 2009-02-20 at 13:18 +1100, Neil Brown wrote:
> Hi,
>  I've been thinking about cache consistency, particularly of
>  directories, in response to a customer who's NFS was getting confused
>  by their usage of "rsync -a" without the "--omit-dir-times" flag:
>  A client would see an old copy of a directory and never get a more
>  up-to-date copy because the mtime appeared not to change.
> 
>  This results in a situation where a directory has wrong data cache
>  and there is no way to force that cache to be flushed.
> 
>  This contrasts with files where you can always flush the file
>  contents by taking a read lock on the file.
> 
>  I also came up with a simple way to demonstrate a related caching
>  anomaly:
> 
>   - Create a localhost mount
>   - create a directory
>   - "ls -l" the directory via NFS
>   - create a file directly
>   - look again via NFS.
> 
>   The directory will appear empty via NFS but it is not.  And this
>   cache anomily does not time out (though memory pressure could
>   eventually remove it).
> 
>   There is a script below which reproduces both anomalies (providing
>   /export is exported and /mnt is available).
>  
> 
>  Can anything be done about this?
> 
> 
>  1/ The client could flush the cache for a directory when ctime
>     changes as well as when mtime or size change.
>     This would help solve the "rsync -a without --omit-dir-times"
>     problem (and also another weird problem I had reported that
>     involved strange behaviour from an NetApp filer).
>     It might increase the number of READDIR requests in some cases.
>     Would that be enough of an increase to be a real problem?
>     It would be no worse than NFSv4 which - as the Linux NFS server
>     uses ctime to produce the changeattr - refreshes both directories
>     and files when the ctime changes.

It should work fine. The ctime tracks the mtime in all cases except when
you setacl, setfattr, chown, chgrp, chmod, or touch the directory. Those
should be very rare operations for pretty much any workload...

>  2/ The server could lie about the mtime.
>     In particular, if the mtime for a file was the same as the current
>     time - to the granularity of the filesystem storing the file -
>     then reduce the mtime that is reported by the smallest difference that
>     can be reported by the protocol.
>     That would be one microsecond for v2, and one nanosecond for v3
>     and v4.
> 
>     This is something I've thought about (and probably muttered about)
>     in various forms at various times over the years, but this time I
>     think I am actually happy with the formulation of the solution and
>     want to push forward with it.
> 
> 
> 
>  Option 1, by itself, would mostly resolve the rsync issue and have
>  no effect on my little test case.
>  Option 2 by itself would have no effect on the rsync issue but would
>  nicely resolve my little test cache.
>  Together they should significantly reduce the number of caching
>  anomalies.

I'm assuming that option 2 applies to the ctime as well as the mtime,
otherwise applying option 1 will void the effects of option 2?

Note also that the client now has the 'lookupcache' mount option that
can be set to ensure stricter revalidation of lookups.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
  2009-02-20 18:23 ` J. Bruce Fields
@ 2009-02-20 19:47   ` Neil Brown
       [not found]     ` <18847.2286.101191.989726-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Neil Brown @ 2009-02-20 19:47 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Trond Myklebust, linux-nfs

On Friday February 20, bfields@fieldses.org wrote:
> On Fri, Feb 20, 2009 at 01:18:18PM +1100, Neil Brown wrote:
> >  I also came up with a simple way to demonstrate a related caching
> >  anomaly:
> > 
> >   - Create a localhost mount
> >   - create a directory
> >   - "ls -l" the directory via NFS
> >   - create a file directly
> 
> (And do this within a second (or a jiffy, depending on filesystem) of
> the directory creation?)

Sorry, I meant to include that detail -  Yes.


> 
> >   - look again via NFS.
> 
> > 
> >   The directory will appear empty via NFS but it is not.  And this
> >   cache anomily does not time out (though memory pressure could
> >   eventually remove it).
> > 
> >   There is a script below which reproduces both anomalies (providing
> >   /export is exported and /mnt is available).
> >  
> > 
> >  Can anything be done about this?
> > 
> > 
> >  1/ The client could flush the cache for a directory when ctime
> >     changes as well as when mtime or size change.
> >     This would help solve the "rsync -a without --omit-dir-times"
> >     problem (and also another weird problem I had reported that
> >     involved strange behaviour from an NetApp filer).
> >     It might increase the number of READDIR requests in some cases.
> >     Would that be enough of an increase to be a real problem?
> >     It would be no worse than NFSv4 which - as the Linux NFS server
> >     uses ctime to produce the changeattr - refreshes both directories
> >     and files when the ctime changes.
> 
> This has come up before, but I can't remember what the argument was
> against it....

Trond seems happy with it now.  And the NFSv4 server effectively
imposes it.  So maybe there are no remaining arguments against it ??

> 
> >  2/ The server could lie about the mtime.
> >     In particular, if the mtime for a file was the same as the current
> >     time - to the granularity of the filesystem storing the file -
> >     then reduce the mtime that is reported by the smallest difference that
> >     can be reported by the protocol.
> >     That would be one microsecond for v2, and one nanosecond for v3
> >     and v4.
> 
> Assume for simplicity's sake the time granularity is a second, and
> measure time in seconds in the following examples:
> 
> Your proposal offers an improvement in this example (currently,
> subsequent getattrs will not reflect the final modification):
> 
> 	t=0.1 modify
> 	t=0.2 getattr
> 	t=0.3 modify
> 
> Your proposal causes a regression in the following example:
> 
> 	t=0.1 modify
> 	t=0.2 getattr
> 	t=1.1 modify 
> 	t=1.2 modify
> 

I cannot see how there is a regression here.  Subsequent getattrs will
show all modifications (if you wait at least one second).
The first gettattr returns '-0.000000001', which is different from any
previously returned mtime.
Any subsequent getattr will return 0.999999999 or 1, depending on when
it arrives.

The only possible regression is that sometimes we will flush the cache
when previously we didn't.  In each case where that changes, the
client can not possible know whether it needs to or not, so flushing
rather than not flushing is the safest option.

> 
> By the way, I have one sadly neglected todo here: ext4 has a real nfsv4
> changeattribute, which needs to be hooked up to the nfsd code.

Does it?
I just had a quick look, found that it stores a 64 bit number on disk
which is stored in inode->i_version.
And this is incremented for directory operations.  But it doesn't seem
to be changed for file operations.

But maybe I missed something.


Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
  2009-02-20 18:56   ` Trond Myklebust
@ 2009-02-20 19:52     ` Neil Brown
       [not found]       ` <18847.2578.480148.216735-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Neil Brown @ 2009-02-20 19:52 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: J. Bruce Fields, linux-nfs

On Friday February 20, trond.myklebust@fys.uio.no wrote:
> On Fri, 2009-02-20 at 13:18 +1100, Neil Brown wrote:
> > Hi,
> >  I've been thinking about cache consistency, particularly of
> >  directories, in response to a customer who's NFS was getting confused
> >  by their usage of "rsync -a" without the "--omit-dir-times" flag:
> >  A client would see an old copy of a directory and never get a more
> >  up-to-date copy because the mtime appeared not to change.
> > 
> >  This results in a situation where a directory has wrong data cache
> >  and there is no way to force that cache to be flushed.
> > 
> >  This contrasts with files where you can always flush the file
> >  contents by taking a read lock on the file.
> > 
> >  I also came up with a simple way to demonstrate a related caching
> >  anomaly:
> > 
> >   - Create a localhost mount
> >   - create a directory
> >   - "ls -l" the directory via NFS
> >   - create a file directly
> >   - look again via NFS.
> > 
> >   The directory will appear empty via NFS but it is not.  And this
> >   cache anomily does not time out (though memory pressure could
> >   eventually remove it).
> > 
> >   There is a script below which reproduces both anomalies (providing
> >   /export is exported and /mnt is available).
> >  
> > 
> >  Can anything be done about this?
> > 
> > 
> >  1/ The client could flush the cache for a directory when ctime
> >     changes as well as when mtime or size change.
> >     This would help solve the "rsync -a without --omit-dir-times"
> >     problem (and also another weird problem I had reported that
> >     involved strange behaviour from an NetApp filer).
> >     It might increase the number of READDIR requests in some cases.
> >     Would that be enough of an increase to be a real problem?
> >     It would be no worse than NFSv4 which - as the Linux NFS server
> >     uses ctime to produce the changeattr - refreshes both directories
> >     and files when the ctime changes.
> 
> It should work fine. The ctime tracks the mtime in all cases except when
> you setacl, setfattr, chown, chgrp, chmod, or touch the directory. Those
> should be very rare operations for pretty much any workload...
> 

Does that mean you'll take the patch ??

> >  2/ The server could lie about the mtime.
> >     In particular, if the mtime for a file was the same as the current
> >     time - to the granularity of the filesystem storing the file -
> >     then reduce the mtime that is reported by the smallest difference that
> >     can be reported by the protocol.
> >     That would be one microsecond for v2, and one nanosecond for v3
> >     and v4.
> > 
> >     This is something I've thought about (and probably muttered about)
> >     in various forms at various times over the years, but this time I
> >     think I am actually happy with the formulation of the solution and
> >     want to push forward with it.
> > 
> > 
> > 
> >  Option 1, by itself, would mostly resolve the rsync issue and have
> >  no effect on my little test case.
> >  Option 2 by itself would have no effect on the rsync issue but would
> >  nicely resolve my little test cache.
> >  Together they should significantly reduce the number of caching
> >  anomalies.
> 
> I'm assuming that option 2 applies to the ctime as well as the mtime,
> otherwise applying option 1 will void the effects of option 2?

Yes, of course.   My code didn't do that, but it will.
Thanks.

> 
> Note also that the client now has the 'lookupcache' mount option that
> can be set to ensure stricter revalidation of lookups.

I wasn't aware of that ... goes and looks ... that affects lookup but
not readdir.  So: useful, but not directly relevant to the current
situation. 

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
       [not found]     ` <18847.2286.101191.989726-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2009-02-20 20:14       ` J. Bruce Fields
  2009-02-20 21:04         ` Neil Brown
  0 siblings, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2009-02-20 20:14 UTC (permalink / raw)
  To: Neil Brown; +Cc: Trond Myklebust, linux-nfs

On Sat, Feb 21, 2009 at 06:47:58AM +1100, Neil Brown wrote:
> 
> Trond seems happy with it now.  And the NFSv4 server effectively
> imposes it.  So maybe there are no remaining arguments against it ??

None from me.

> 
> > 
> > >  2/ The server could lie about the mtime.
> > >     In particular, if the mtime for a file was the same as the current
> > >     time - to the granularity of the filesystem storing the file -
> > >     then reduce the mtime that is reported by the smallest difference that
> > >     can be reported by the protocol.
> > >     That would be one microsecond for v2, and one nanosecond for v3
> > >     and v4.
> > 
> > Assume for simplicity's sake the time granularity is a second, and
> > measure time in seconds in the following examples:
> > 
> > Your proposal offers an improvement in this example (currently,
> > subsequent getattrs will not reflect the final modification):
> > 
> > 	t=0.1 modify
> > 	t=0.2 getattr
> > 	t=0.3 modify
> > 
> > Your proposal causes a regression in the following example:
> > 
> > 	t=0.1 modify
> > 	t=0.2 getattr
> > 	t=1.1 modify 
> > 	t=1.2 modify
> > 
> 
> I cannot see how there is a regression here.  Subsequent getattrs will
> show all modifications (if you wait at least one second).
> The first gettattr returns '-0.000000001', which is different from any
> previously returned mtime.
> Any subsequent getattr will return 0.999999999 or 1, depending on when
> it arrives.

Sorry, I guess I misread "smallest difference that can be reported by
the protocol" as "smallest difference supported by the filesystem"!

The former is currently *always* smaller than the latter, so you're
reporting an mtime that will never arise in any other way.  So you're
right, this results in strictly more cache revalidations in every case.

It may turn out that this mtime-1 case ends up being the typical case,
since a single logical file modification may appear as multiple writes
on the server, and those are likely to come in rapid succession.

A "make" that takes less than one second, on an ext3 export, may result
in targets with earlier mtimes than sources.

(Why not mtime+1?  And why not ctime?)

> The only possible regression is that sometimes we will flush the cache
> when previously we didn't.  In each case where that changes, the
> client can not possible know whether it needs to or not, so flushing
> rather than not flushing is the safest option.
> 
> > 
> > By the way, I have one sadly neglected todo here: ext4 has a real nfsv4
> > changeattribute, which needs to be hooked up to the nfsd code.
> 
> Does it?
> I just had a quick look, found that it stores a 64 bit number on disk
> which is stored in inode->i_version.
> And this is incremented for directory operations.  But it doesn't seem
> to be changed for file operations.
> 
> But maybe I missed something.

After some mucking around with git and git grep... looks like the
inode_inc_iversion() calls do the job.  Note there's an i_version mount
option that's required.

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
       [not found]       ` <18847.2578.480148.216735-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2009-02-20 20:32         ` Trond Myklebust
  2009-02-20 21:06           ` Neil Brown
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2009-02-20 20:32 UTC (permalink / raw)
  To: Neil Brown; +Cc: J. Bruce Fields, linux-nfs

On Sat, 2009-02-21 at 06:52 +1100, Neil Brown wrote:
> On Friday February 20, trond.myklebust@fys.uio.no wrote:
> > It should work fine. The ctime tracks the mtime in all cases except when
> > you setacl, setfattr, chown, chgrp, chmod, or touch the directory. Those
> > should be very rare operations for pretty much any workload...
> > 
> 
> Does that mean you'll take the patch ??

Not as it stands. You really want to be calling
nfs_force_lookup_revalidate() instead of invalidating the directory
contents.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
  2009-02-20 20:14       ` J. Bruce Fields
@ 2009-02-20 21:04         ` Neil Brown
       [not found]           ` <18847.6886.50844.260910-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Neil Brown @ 2009-02-20 21:04 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Trond Myklebust, linux-nfs

On Friday February 20, bfields@fieldses.org wrote:
> On Sat, Feb 21, 2009 at 06:47:58AM +1100, Neil Brown wrote:
> > 
> > I cannot see how there is a regression here.  Subsequent getattrs will
> > show all modifications (if you wait at least one second).
> > The first gettattr returns '-0.000000001', which is different from any
> > previously returned mtime.
> > Any subsequent getattr will return 0.999999999 or 1, depending on when
> > it arrives.
> 
> Sorry, I guess I misread "smallest difference that can be reported by
> the protocol" as "smallest difference supported by the filesystem"!
> 
> The former is currently *always* smaller than the latter, so you're
> reporting an mtime that will never arise in any other way.  So you're
> right, this results in strictly more cache revalidations in every case.
> 
> It may turn out that this mtime-1 case ends up being the typical case,
> since a single logical file modification may appear as multiple writes
> on the server, and those are likely to come in rapid succession.

When you look at a file that is changing, yes.  When you look at a
file that hasn't changed for a while, you just get normal mtime.

> 
> A "make" that takes less than one second, on an ext3 export, may result
> in targets with earlier mtimes than sources.

No.  If they were both changed in the last second, they will both have
1 nanosecond subtracted from the mtime, so they will still look like
they have the same mtime.

> 
> (Why not mtime+1?  And why not ctime?)

Not "mtime + 1" because mtime should normally be monotonically
increasing.
If you touch a file into the future, then when that moment comes, its
mtime will jump backwards 1 nanosecond, then jump forwards again.  But
I don't think that is a problem.

Why not what ctime?
I agree with Trond that we need to apply this adjustment to ctime too.
But we cannot set mtime == ctime.  That would just be wrong.


> 
> > The only possible regression is that sometimes we will flush the cache
> > when previously we didn't.  In each case where that changes, the
> > client can not possible know whether it needs to or not, so flushing
> > rather than not flushing is the safest option.
> > 
> > > 
> > > By the way, I have one sadly neglected todo here: ext4 has a real nfsv4
> > > changeattribute, which needs to be hooked up to the nfsd code.
> > 
> > Does it?
> > I just had a quick look, found that it stores a 64 bit number on disk
> > which is stored in inode->i_version.
> > And this is incremented for directory operations.  But it doesn't seem
> > to be changed for file operations.
> > 
> > But maybe I missed something.
> 
> After some mucking around with git and git grep... looks like the
> inode_inc_iversion() calls do the job.  Note there's an i_version mount
> option that's required.

Ahh.... Now if only they had called that function
"inode_inc_i_version", then my grep would have found it...

So in nfsd we can simply do:

 if (IS_I_VERSION(fhp->fh_dentry->d_inode)) {
	change_attribute = fhp->fh_dentry->d_inode->i_version;
 } else {
        change_attribute = kstat.ctime;
 }

??

NeilBrown

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
  2009-02-20 20:32         ` Trond Myklebust
@ 2009-02-20 21:06           ` Neil Brown
       [not found]             ` <18847.6988.418374.839185-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Neil Brown @ 2009-02-20 21:06 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: J. Bruce Fields, linux-nfs

On Friday February 20, trond.myklebust@fys.uio.no wrote:
> On Sat, 2009-02-21 at 06:52 +1100, Neil Brown wrote:
> > On Friday February 20, trond.myklebust@fys.uio.no wrote:
> > > It should work fine. The ctime tracks the mtime in all cases except when
> > > you setacl, setfattr, chown, chgrp, chmod, or touch the directory. Those
> > > should be very rare operations for pretty much any workload...
> > > 
> > 
> > Does that mean you'll take the patch ??
> 
> Not as it stands. You really want to be calling
> nfs_force_lookup_revalidate() instead of invalidating the directory
> contents.

Surely it is 'aswell' rather than 'instead' ??
I need 
   invalid |= NFS_INO_INVALID_DATA;
to flush the readdir cache, and
   nfs_force_lookup_revalidate(inode)
to flush the lookup cache 
??

I'll respin the patches later today.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
       [not found]             ` <18847.6988.418374.839185-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2009-02-20 22:14               ` Trond Myklebust
  0 siblings, 0 replies; 15+ messages in thread
From: Trond Myklebust @ 2009-02-20 22:14 UTC (permalink / raw)
  To: Neil Brown; +Cc: J. Bruce Fields, linux-nfs

On Sat, 2009-02-21 at 08:06 +1100, Neil Brown wrote:
> On Friday February 20, trond.myklebust@fys.uio.no wrote:
> > On Sat, 2009-02-21 at 06:52 +1100, Neil Brown wrote:
> > > On Friday February 20, trond.myklebust@fys.uio.no wrote:
> > > > It should work fine. The ctime tracks the mtime in all cases except when
> > > > you setacl, setfattr, chown, chgrp, chmod, or touch the directory. Those
> > > > should be very rare operations for pretty much any workload...
> > > > 
> > > 
> > > Does that mean you'll take the patch ??
> > 
> > Not as it stands. You really want to be calling
> > nfs_force_lookup_revalidate() instead of invalidating the directory
> > contents.
> 
> Surely it is 'aswell' rather than 'instead' ??
> I need 
>    invalid |= NFS_INO_INVALID_DATA;
> to flush the readdir cache, and
>    nfs_force_lookup_revalidate(inode)
> to flush the lookup cache 
> ??

You're right. Sorry...

Cheers
  Trond


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
       [not found]           ` <18847.6886.50844.260910-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2009-04-21 21:43             ` J. Bruce Fields
  2009-04-23 21:34               ` J. Bruce Fields
  0 siblings, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2009-04-21 21:43 UTC (permalink / raw)
  To: Neil Brown; +Cc: Trond Myklebust, linux-nfs, Theodore Tso

On Sat, Feb 21, 2009 at 08:04:38AM +1100, Neil Brown wrote:
> On Friday February 20, bfields@fieldses.org wrote:
> > On Sat, Feb 21, 2009 at 06:47:58AM +1100, Neil Brown wrote:
Once upon a time, I said:
> > > > By the way, I have one sadly neglected todo here: ext4 has a real nfsv4
> > > > changeattribute, which needs to be hooked up to the nfsd code.
> > > 
> > > Does it?
> > > I just had a quick look, found that it stores a 64 bit number on disk
> > > which is stored in inode->i_version.
> > > And this is incremented for directory operations.  But it doesn't seem
> > > to be changed for file operations.
> > > 
> > > But maybe I missed something.
> > 
> > After some mucking around with git and git grep... looks like the
> > inode_inc_iversion() calls do the job.  Note there's an i_version mount
> > option that's required.
> 
> Ahh.... Now if only they had called that function
> "inode_inc_i_version", then my grep would have found it...
> 
> So in nfsd we can simply do:
> 
>  if (IS_I_VERSION(fhp->fh_dentry->d_inode)) {
> 	change_attribute = fhp->fh_dentry->d_inode->i_version;
>  } else {
>         change_attribute = kstat.ctime;
>  }
> 
> ??

There's also the pre-/post- op attributes to take care of.  I'm thinking
of applying something like the following.

Actually, I did this and then realized: I'm using IS_I_VERSION(inode) to
turn on use of the i_version as the change attribute on both files and
directories.  But it actually only makes a difference for files.

So I guess I should just be using i_version as the change attribute
unconditionally for directories?  (Will that work on any filesystem?)

--b.

commit 6bc1eaa4eb1bb8af1878be354962e6cedcaece60
Author: J. Bruce Fields <bfields@citi.umich.edu>
Date:   Thu Apr 16 17:33:25 2009 -0400

    nfsd: support ext4 i_version
    
    ext4 supports a real NFSv4 change attribute, which is bumped whenever
    the ctime would be updated, including times when two updates arrive
    within a jiffy of each other.  (Note that although ext4 has space for
    nanosecond-precision ctime, the real resolution is lower: it actually
    uses jiffies as the time-source.)  This ensures clients will invalidate
    their caches when they need to.
    
    There is some fear that keeping the i_version up-to-date could have
    performance drawbacks, so for now it's turned on only by a mount option.
    We hope to do something better eventually.
    
    Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
    Cc: Theodore Tso <tytso@mit.edu>

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 17d0dd9..01d4ec1 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -272,6 +272,7 @@ void fill_post_wcc(struct svc_fh *fhp)
 
 	err = vfs_getattr(fhp->fh_export->ex_path.mnt, fhp->fh_dentry,
 			&fhp->fh_post_attr);
+	fhp->fh_post_change = fhp->fh_dentry->d_inode->i_version;
 	if (err)
 		fhp->fh_post_saved = 0;
 	else
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 4a71fcd..12d36a7 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1490,13 +1490,41 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
 	memcpy(p, ptr, nbytes);					\
 	p += XDR_QUADLEN(nbytes);				\
 }} while (0)
-#define WRITECINFO(c)		do {				\
-	*p++ = htonl(c.atomic);					\
-	*p++ = htonl(c.before_ctime_sec);				\
-	*p++ = htonl(c.before_ctime_nsec);				\
-	*p++ = htonl(c.after_ctime_sec);				\
-	*p++ = htonl(c.after_ctime_nsec);				\
-} while (0)
+
+static void write32(__be32 **p, u32 n)
+{
+	*(*p)++ = n;
+}
+
+static void write64(__be32 **p, u64 n)
+{
+	write32(p, (u32)(n >> 32));
+	write32(p, (u32)n);
+}
+
+static void write_change(__be32 **p, struct kstat *stat, struct inode *inode)
+{
+	if (IS_I_VERSION(inode)) {
+		write64(p, inode->i_version);
+	} else {
+		write32(p, stat->ctime.tv_sec);
+		write32(p, stat->ctime.tv_nsec);
+	}
+}
+
+static void write_cinfo(__be32 **p, struct nfsd4_change_info *c)
+{
+	write32(p, c->atomic);
+	if (c->change_supported) {
+		write64(p, c->before_change);
+		write64(p, c->after_change);
+	} else {
+		write32(p, c->before_ctime_sec);
+		write32(p, c->before_ctime_nsec);
+		write32(p, c->after_ctime_sec);
+		write32(p, c->after_ctime_nsec);
+	}
+}
 
 #define RESERVE_SPACE(nbytes)	do {				\
 	p = resp->p;						\
@@ -1849,16 +1877,9 @@ nfsd4_encode_fattr(struct svc_fh *fhp, struct svc_export *exp,
 			WRITE32(NFS4_FH_PERSISTENT|NFS4_FH_VOL_RENAME);
 	}
 	if (bmval0 & FATTR4_WORD0_CHANGE) {
-		/*
-		 * Note: This _must_ be consistent with the scheme for writing
-		 * change_info, so any changes made here must be reflected there
-		 * as well.  (See xdr4.h:set_change_info() and the WRITECINFO()
-		 * macro above.)
-		 */
 		if ((buflen -= 8) < 0)
 			goto out_resource;
-		WRITE32(stat.ctime.tv_sec);
-		WRITE32(stat.ctime.tv_nsec);
+		write_change(&p, &stat, dentry->d_inode);
 	}
 	if (bmval0 & FATTR4_WORD0_SIZE) {
 		if ((buflen -= 8) < 0)
@@ -2364,7 +2385,7 @@ nfsd4_encode_create(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4_
 
 	if (!nfserr) {
 		RESERVE_SPACE(32);
-		WRITECINFO(create->cr_cinfo);
+		write_cinfo(&p, &create->cr_cinfo);
 		WRITE32(2);
 		WRITE32(create->cr_bmval[0]);
 		WRITE32(create->cr_bmval[1]);
@@ -2475,7 +2496,7 @@ nfsd4_encode_link(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4_li
 
 	if (!nfserr) {
 		RESERVE_SPACE(20);
-		WRITECINFO(link->li_cinfo);
+		write_cinfo(&p, &link->li_cinfo);
 		ADJUST_ARGS();
 	}
 	return nfserr;
@@ -2493,7 +2514,7 @@ nfsd4_encode_open(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4_op
 
 	nfsd4_encode_stateid(resp, &open->op_stateid);
 	RESERVE_SPACE(40);
-	WRITECINFO(open->op_cinfo);
+	write_cinfo(&p, &open->op_cinfo);
 	WRITE32(open->op_rflags);
 	WRITE32(2);
 	WRITE32(open->op_bmval[0]);
@@ -2771,7 +2792,7 @@ nfsd4_encode_remove(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4_
 
 	if (!nfserr) {
 		RESERVE_SPACE(20);
-		WRITECINFO(remove->rm_cinfo);
+		write_cinfo(&p, &remove->rm_cinfo);
 		ADJUST_ARGS();
 	}
 	return nfserr;
@@ -2784,8 +2805,8 @@ nfsd4_encode_rename(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4_
 
 	if (!nfserr) {
 		RESERVE_SPACE(40);
-		WRITECINFO(rename->rn_sinfo);
-		WRITECINFO(rename->rn_tinfo);
+		write_cinfo(&p, &rename->rn_sinfo);
+		write_cinfo(&p, &rename->rn_tinfo);
 		ADJUST_ARGS();
 	}
 	return nfserr;
diff --git a/include/linux/nfsd/nfsfh.h b/include/linux/nfsd/nfsfh.h
index afa1901..8f641c9 100644
--- a/include/linux/nfsd/nfsfh.h
+++ b/include/linux/nfsd/nfsfh.h
@@ -151,9 +151,15 @@ typedef struct svc_fh {
 	__u64			fh_pre_size;	/* size before operation */
 	struct timespec		fh_pre_mtime;	/* mtime before oper */
 	struct timespec		fh_pre_ctime;	/* ctime before oper */
+	/*
+	 * pre-op nfsv4 change attr: note must check IS_I_VERSION(inode)
+	 *  to find out if it is valid.
+	 */
+	u64			fh_pre_change;
 
 	/* Post-op attributes saved in fh_unlock */
 	struct kstat		fh_post_attr;	/* full attrs after operation */
+	u64			fh_post_change; /* nfsv4 change; see above */
 #endif /* CONFIG_NFSD_V3 */
 
 } svc_fh;
@@ -298,6 +304,7 @@ fill_pre_wcc(struct svc_fh *fhp)
 		fhp->fh_pre_mtime = inode->i_mtime;
 		fhp->fh_pre_ctime = inode->i_ctime;
 		fhp->fh_pre_size  = inode->i_size;
+		fhp->fh_pre_change = inode->i_version;
 		fhp->fh_pre_saved = 1;
 	}
 }
diff --git a/include/linux/nfsd/xdr4.h b/include/linux/nfsd/xdr4.h
index f80d601..d0f050f 100644
--- a/include/linux/nfsd/xdr4.h
+++ b/include/linux/nfsd/xdr4.h
@@ -64,10 +64,13 @@ static inline bool nfsd4_has_session(struct nfsd4_compound_state *cs)
 
 struct nfsd4_change_info {
 	u32		atomic;
+	bool		change_supported;
 	u32		before_ctime_sec;
 	u32		before_ctime_nsec;
+	u64		before_change;
 	u32		after_ctime_sec;
 	u32		after_ctime_nsec;
+	u64		after_change;
 };
 
 struct nfsd4_access {
@@ -503,10 +506,16 @@ set_change_info(struct nfsd4_change_info *cinfo, struct svc_fh *fhp)
 {
 	BUG_ON(!fhp->fh_pre_saved || !fhp->fh_post_saved);
 	cinfo->atomic = 1;
-	cinfo->before_ctime_sec = fhp->fh_pre_ctime.tv_sec;
-	cinfo->before_ctime_nsec = fhp->fh_pre_ctime.tv_nsec;
-	cinfo->after_ctime_sec = fhp->fh_post_attr.ctime.tv_sec;
-	cinfo->after_ctime_nsec = fhp->fh_post_attr.ctime.tv_nsec;
+	cinfo->change_supported = IS_I_VERSION(fhp->fh_dentry->d_inode);
+	if (cinfo->change_supported) {
+		cinfo->before_change = fhp->fh_pre_change;
+		cinfo->after_change = fhp->fh_post_change;
+	} else {
+		cinfo->before_ctime_sec = fhp->fh_pre_ctime.tv_sec;
+		cinfo->before_ctime_nsec = fhp->fh_pre_ctime.tv_nsec;
+		cinfo->after_ctime_sec = fhp->fh_post_attr.ctime.tv_sec;
+		cinfo->after_ctime_nsec = fhp->fh_post_attr.ctime.tv_nsec;
+	}
 }
 
 int nfs4svc_encode_voidres(struct svc_rqst *, __be32 *, void *);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
  2009-04-21 21:43             ` J. Bruce Fields
@ 2009-04-23 21:34               ` J. Bruce Fields
  2009-04-23 21:52                 ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2009-04-23 21:34 UTC (permalink / raw)
  To: Neil Brown; +Cc: Trond Myklebust, linux-nfs, Theodore Tso

On Tue, Apr 21, 2009 at 05:43:49PM -0400, bfields wrote:
> There's also the pre-/post- op attributes to take care of.  I'm thinking
> of applying something like the following.

Having heard no objections I'll go ahead and apply this, though I hope
we'll come up with a better solution for turning on the ext4 i_version
support in the future.  Also:

> Actually, I did this and then realized: I'm using IS_I_VERSION(inode) to
> turn on use of the i_version as the change attribute on both files and
> directories.  But it actually only makes a difference for files.
> 
> So I guess I should just be using i_version as the change attribute
> unconditionally for directories?  (Will that work on any filesystem?)

I'm still curious about this.

--b.

> commit 6bc1eaa4eb1bb8af1878be354962e6cedcaece60
> Author: J. Bruce Fields <bfields@citi.umich.edu>
> Date:   Thu Apr 16 17:33:25 2009 -0400
> 
>     nfsd: support ext4 i_version
>     
>     ext4 supports a real NFSv4 change attribute, which is bumped whenever
>     the ctime would be updated, including times when two updates arrive
>     within a jiffy of each other.  (Note that although ext4 has space for
>     nanosecond-precision ctime, the real resolution is lower: it actually
>     uses jiffies as the time-source.)  This ensures clients will invalidate
>     their caches when they need to.
>     
>     There is some fear that keeping the i_version up-to-date could have
>     performance drawbacks, so for now it's turned on only by a mount option.
>     We hope to do something better eventually.
>     
>     Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
>     Cc: Theodore Tso <tytso@mit.edu>
> 
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 17d0dd9..01d4ec1 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -272,6 +272,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>  
>  	err = vfs_getattr(fhp->fh_export->ex_path.mnt, fhp->fh_dentry,
>  			&fhp->fh_post_attr);
> +	fhp->fh_post_change = fhp->fh_dentry->d_inode->i_version;
>  	if (err)
>  		fhp->fh_post_saved = 0;
>  	else
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 4a71fcd..12d36a7 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1490,13 +1490,41 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
>  	memcpy(p, ptr, nbytes);					\
>  	p += XDR_QUADLEN(nbytes);				\
>  }} while (0)
> -#define WRITECINFO(c)		do {				\
> -	*p++ = htonl(c.atomic);					\
> -	*p++ = htonl(c.before_ctime_sec);				\
> -	*p++ = htonl(c.before_ctime_nsec);				\
> -	*p++ = htonl(c.after_ctime_sec);				\
> -	*p++ = htonl(c.after_ctime_nsec);				\
> -} while (0)
> +
> +static void write32(__be32 **p, u32 n)
> +{
> +	*(*p)++ = n;
> +}
> +
> +static void write64(__be32 **p, u64 n)
> +{
> +	write32(p, (u32)(n >> 32));
> +	write32(p, (u32)n);
> +}
> +
> +static void write_change(__be32 **p, struct kstat *stat, struct inode *inode)
> +{
> +	if (IS_I_VERSION(inode)) {
> +		write64(p, inode->i_version);
> +	} else {
> +		write32(p, stat->ctime.tv_sec);
> +		write32(p, stat->ctime.tv_nsec);
> +	}
> +}
> +
> +static void write_cinfo(__be32 **p, struct nfsd4_change_info *c)
> +{
> +	write32(p, c->atomic);
> +	if (c->change_supported) {
> +		write64(p, c->before_change);
> +		write64(p, c->after_change);
> +	} else {
> +		write32(p, c->before_ctime_sec);
> +		write32(p, c->before_ctime_nsec);
> +		write32(p, c->after_ctime_sec);
> +		write32(p, c->after_ctime_nsec);
> +	}
> +}
>  
>  #define RESERVE_SPACE(nbytes)	do {				\
>  	p = resp->p;						\
> @@ -1849,16 +1877,9 @@ nfsd4_encode_fattr(struct svc_fh *fhp, struct svc_export *exp,
>  			WRITE32(NFS4_FH_PERSISTENT|NFS4_FH_VOL_RENAME);
>  	}
>  	if (bmval0 & FATTR4_WORD0_CHANGE) {
> -		/*
> -		 * Note: This _must_ be consistent with the scheme for writing
> -		 * change_info, so any changes made here must be reflected there
> -		 * as well.  (See xdr4.h:set_change_info() and the WRITECINFO()
> -		 * macro above.)
> -		 */
>  		if ((buflen -= 8) < 0)
>  			goto out_resource;
> -		WRITE32(stat.ctime.tv_sec);
> -		WRITE32(stat.ctime.tv_nsec);
> +		write_change(&p, &stat, dentry->d_inode);
>  	}
>  	if (bmval0 & FATTR4_WORD0_SIZE) {
>  		if ((buflen -= 8) < 0)
> @@ -2364,7 +2385,7 @@ nfsd4_encode_create(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4_
>  
>  	if (!nfserr) {
>  		RESERVE_SPACE(32);
> -		WRITECINFO(create->cr_cinfo);
> +		write_cinfo(&p, &create->cr_cinfo);
>  		WRITE32(2);
>  		WRITE32(create->cr_bmval[0]);
>  		WRITE32(create->cr_bmval[1]);
> @@ -2475,7 +2496,7 @@ nfsd4_encode_link(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4_li
>  
>  	if (!nfserr) {
>  		RESERVE_SPACE(20);
> -		WRITECINFO(link->li_cinfo);
> +		write_cinfo(&p, &link->li_cinfo);
>  		ADJUST_ARGS();
>  	}
>  	return nfserr;
> @@ -2493,7 +2514,7 @@ nfsd4_encode_open(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4_op
>  
>  	nfsd4_encode_stateid(resp, &open->op_stateid);
>  	RESERVE_SPACE(40);
> -	WRITECINFO(open->op_cinfo);
> +	write_cinfo(&p, &open->op_cinfo);
>  	WRITE32(open->op_rflags);
>  	WRITE32(2);
>  	WRITE32(open->op_bmval[0]);
> @@ -2771,7 +2792,7 @@ nfsd4_encode_remove(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4_
>  
>  	if (!nfserr) {
>  		RESERVE_SPACE(20);
> -		WRITECINFO(remove->rm_cinfo);
> +		write_cinfo(&p, &remove->rm_cinfo);
>  		ADJUST_ARGS();
>  	}
>  	return nfserr;
> @@ -2784,8 +2805,8 @@ nfsd4_encode_rename(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4_
>  
>  	if (!nfserr) {
>  		RESERVE_SPACE(40);
> -		WRITECINFO(rename->rn_sinfo);
> -		WRITECINFO(rename->rn_tinfo);
> +		write_cinfo(&p, &rename->rn_sinfo);
> +		write_cinfo(&p, &rename->rn_tinfo);
>  		ADJUST_ARGS();
>  	}
>  	return nfserr;
> diff --git a/include/linux/nfsd/nfsfh.h b/include/linux/nfsd/nfsfh.h
> index afa1901..8f641c9 100644
> --- a/include/linux/nfsd/nfsfh.h
> +++ b/include/linux/nfsd/nfsfh.h
> @@ -151,9 +151,15 @@ typedef struct svc_fh {
>  	__u64			fh_pre_size;	/* size before operation */
>  	struct timespec		fh_pre_mtime;	/* mtime before oper */
>  	struct timespec		fh_pre_ctime;	/* ctime before oper */
> +	/*
> +	 * pre-op nfsv4 change attr: note must check IS_I_VERSION(inode)
> +	 *  to find out if it is valid.
> +	 */
> +	u64			fh_pre_change;
>  
>  	/* Post-op attributes saved in fh_unlock */
>  	struct kstat		fh_post_attr;	/* full attrs after operation */
> +	u64			fh_post_change; /* nfsv4 change; see above */
>  #endif /* CONFIG_NFSD_V3 */
>  
>  } svc_fh;
> @@ -298,6 +304,7 @@ fill_pre_wcc(struct svc_fh *fhp)
>  		fhp->fh_pre_mtime = inode->i_mtime;
>  		fhp->fh_pre_ctime = inode->i_ctime;
>  		fhp->fh_pre_size  = inode->i_size;
> +		fhp->fh_pre_change = inode->i_version;
>  		fhp->fh_pre_saved = 1;
>  	}
>  }
> diff --git a/include/linux/nfsd/xdr4.h b/include/linux/nfsd/xdr4.h
> index f80d601..d0f050f 100644
> --- a/include/linux/nfsd/xdr4.h
> +++ b/include/linux/nfsd/xdr4.h
> @@ -64,10 +64,13 @@ static inline bool nfsd4_has_session(struct nfsd4_compound_state *cs)
>  
>  struct nfsd4_change_info {
>  	u32		atomic;
> +	bool		change_supported;
>  	u32		before_ctime_sec;
>  	u32		before_ctime_nsec;
> +	u64		before_change;
>  	u32		after_ctime_sec;
>  	u32		after_ctime_nsec;
> +	u64		after_change;
>  };
>  
>  struct nfsd4_access {
> @@ -503,10 +506,16 @@ set_change_info(struct nfsd4_change_info *cinfo, struct svc_fh *fhp)
>  {
>  	BUG_ON(!fhp->fh_pre_saved || !fhp->fh_post_saved);
>  	cinfo->atomic = 1;
> -	cinfo->before_ctime_sec = fhp->fh_pre_ctime.tv_sec;
> -	cinfo->before_ctime_nsec = fhp->fh_pre_ctime.tv_nsec;
> -	cinfo->after_ctime_sec = fhp->fh_post_attr.ctime.tv_sec;
> -	cinfo->after_ctime_nsec = fhp->fh_post_attr.ctime.tv_nsec;
> +	cinfo->change_supported = IS_I_VERSION(fhp->fh_dentry->d_inode);
> +	if (cinfo->change_supported) {
> +		cinfo->before_change = fhp->fh_pre_change;
> +		cinfo->after_change = fhp->fh_post_change;
> +	} else {
> +		cinfo->before_ctime_sec = fhp->fh_pre_ctime.tv_sec;
> +		cinfo->before_ctime_nsec = fhp->fh_pre_ctime.tv_nsec;
> +		cinfo->after_ctime_sec = fhp->fh_post_attr.ctime.tv_sec;
> +		cinfo->after_ctime_nsec = fhp->fh_post_attr.ctime.tv_nsec;
> +	}
>  }
>  
>  int nfs4svc_encode_voidres(struct svc_rqst *, __be32 *, void *);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
  2009-04-23 21:34               ` J. Bruce Fields
@ 2009-04-23 21:52                 ` Trond Myklebust
       [not found]                   ` <1240523577.8583.13.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2009-04-23 21:52 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Neil Brown, linux-nfs, Theodore Tso

On Thu, 2009-04-23 at 17:34 -0400, J. Bruce Fields wrote:
> On Tue, Apr 21, 2009 at 05:43:49PM -0400, bfields wrote:
> > Actually, I did this and then realized: I'm using IS_I_VERSION(inode) to
> > turn on use of the i_version as the change attribute on both files and
> > directories.  But it actually only makes a difference for files.
> > 
> > So I guess I should just be using i_version as the change attribute
> > unconditionally for directories?  (Will that work on any filesystem?)
> 
> I'm still curious about this.

Few of the common filesystems actually store dir->i_version on permanent
storage, and none of the ramfs type filesystems appear to set
dir->i_version at all.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
       [not found]                   ` <1240523577.8583.13.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2009-04-23 22:07                     ` J. Bruce Fields
  2009-04-23 22:24                       ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2009-04-23 22:07 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Neil Brown, linux-nfs, Theodore Tso

On Thu, Apr 23, 2009 at 05:52:57PM -0400, Trond Myklebust wrote:
> On Thu, 2009-04-23 at 17:34 -0400, J. Bruce Fields wrote:
> > On Tue, Apr 21, 2009 at 05:43:49PM -0400, bfields wrote:
> > > Actually, I did this and then realized: I'm using IS_I_VERSION(inode) to
> > > turn on use of the i_version as the change attribute on both files and
> > > directories.  But it actually only makes a difference for files.
> > > 
> > > So I guess I should just be using i_version as the change attribute
> > > unconditionally for directories?  (Will that work on any filesystem?)
> > 
> > I'm still curious about this.
> 
> Few of the common filesystems actually store dir->i_version on permanent
> storage,

Hopefully ext4 is, at least in the IS_I_VERSION(inode) case?

--b.

> and none of the ramfs type filesystems appear to set
> dir->i_version at all.
> 
> Cheers
>   Trond
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Thoughts about cache consistency and directories in particular.
  2009-04-23 22:07                     ` J. Bruce Fields
@ 2009-04-23 22:24                       ` Trond Myklebust
  0 siblings, 0 replies; 15+ messages in thread
From: Trond Myklebust @ 2009-04-23 22:24 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Neil Brown, linux-nfs, Theodore Tso

On Thu, 2009-04-23 at 18:07 -0400, J. Bruce Fields wrote:
> On Thu, Apr 23, 2009 at 05:52:57PM -0400, Trond Myklebust wrote:
> > On Thu, 2009-04-23 at 17:34 -0400, J. Bruce Fields wrote:
> > > On Tue, Apr 21, 2009 at 05:43:49PM -0400, bfields wrote:
> > > > Actually, I did this and then realized: I'm using IS_I_VERSION(inode) to
> > > > turn on use of the i_version as the change attribute on both files and
> > > > directories.  But it actually only makes a difference for files.
> > > > 
> > > > So I guess I should just be using i_version as the change attribute
> > > > unconditionally for directories?  (Will that work on any filesystem?)
> > > 
> > > I'm still curious about this.
> > 
> > Few of the common filesystems actually store dir->i_version on permanent
> > storage,
> 
> Hopefully ext4 is, at least in the IS_I_VERSION(inode) case?

It looks to me as though ext4 always saves inode->i_version provided you
are using the default 256 byte inode size. Ted?

Trond


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2009-04-23 22:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-20  2:18 Thoughts about cache consistency and directories in particular Neil Brown
2009-02-20 18:23 ` J. Bruce Fields
2009-02-20 19:47   ` Neil Brown
     [not found]     ` <18847.2286.101191.989726-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-02-20 20:14       ` J. Bruce Fields
2009-02-20 21:04         ` Neil Brown
     [not found]           ` <18847.6886.50844.260910-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-04-21 21:43             ` J. Bruce Fields
2009-04-23 21:34               ` J. Bruce Fields
2009-04-23 21:52                 ` Trond Myklebust
     [not found]                   ` <1240523577.8583.13.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-04-23 22:07                     ` J. Bruce Fields
2009-04-23 22:24                       ` Trond Myklebust
     [not found] ` <18846.4842.625445.980681-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-02-20 18:56   ` Trond Myklebust
2009-02-20 19:52     ` Neil Brown
     [not found]       ` <18847.2578.480148.216735-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-02-20 20:32         ` Trond Myklebust
2009-02-20 21:06           ` Neil Brown
     [not found]             ` <18847.6988.418374.839185-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-02-20 22:14               ` Trond Myklebust

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.