linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* How to manage shared persistent local caching (FS-Cache) with NFS?
@ 2007-12-05 17:11 David Howells
  2007-12-05 17:49 ` Jon Masters
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: David Howells @ 2007-12-05 17:11 UTC (permalink / raw)
  To: Peter Staubach, Trond Myklebust
  Cc: dhowells, Steve Dickson, nfsv4, linux-kernel


Okay...  I'm getting to the point where I want to release my local caching
patches again and have NFS work with them.  This means making NFS mounts share
or not share appropriately - something that's engendered a fair bit of
argument.

So I'd like to solicit advice on how best to deal with this problem.

Let me explain the problem in more detail.


================
CURRENT PRACTICE
================

As the kernel currently stands, coherency is ignored for mounts that have
slightly different combinations of parameters, even if these parameters just
affect the properties of network "connection" used or just mark a superblock
as being read-only.

Consider the case of a file remotely available by NFS.  Imagine the client sees
three different views of this file (they could be by three overlapping mounts,
or by three hardlinks or some combination thereof).

This is how NFS currently operates without any superblock sharing:

				+---------+
    Object on server --->	|	  |
				|  inode  |
				|	  |
				+---------+
				    /|\
				   / | \
				  /  |	\
				 /   |	 \
				/    |	  \
			       /     |	   \
			      /	     |	    \
			     /	     |	     \
			    /	     |	      \
			   /	     |	       \
			  /	     |		\
			 |	     |		 |
			 |	     |		 |
 :::::::::::::NFS::::::::|:::::::::::|:::::::::::|:::::::::::::::::::::::::::::
			 |	     |		 |
			 |	     |		 |
			 |	     |		 |
   +---------+	    +---------+	     |		 |
   |	     |	    |	      |	     |		 |
   | mount 1 |----->| super 1 |	     |		 |
   |	     |	    |	      |	     |		 |
   +---------+	    +---------+	     |		 |
				     |		 |
				     |		 |
   +---------+			+---------+	 |
   |	     |			|	  |	 |
   | mount 2 |----------------->| super 2 |	 |
   |	     |			|	  |	 |
   +---------+			+---------+	 |
						 |
						 |
   +---------+				    +---------+
   |	     |				    |	      |
   | mount 3 |----------------------------->| super 3 |
   |	     |				    |	      |
   +---------+				    +---------+

Each view of the file on the client winds up with a separate inode in a
separate superblock and with a separate pagecache.  As far as the client kernel
is concerned, they *are* three different files.  Any incoherency effects are
ignored by the kernel and if they cause a userspace application a problem,
that's just too bad.

Generally, however, this is not a problem because:

  (a) an application is unlikely to be attempting to manipulate multiple views
      of a file simultaneously and

  (b) cross-view hard links haven't been and aren't used that much.


=============================
POSSIBLE FS-CACHE SCENARIO #1
=============================

However, now we're introducing persistent local caching into the mix.  That
means we can no longer ignore such remote possibilities - they are possible,
therefore we have to deal with them, whether we like it or not.

The seemingly simplest way to support this is to give each copy of the remote
file its own cache:

				+---------+
    Object on server --->	|	  |
				|  inode  |
				|	  |
				+---------+
				    /|\
				   / | \
				  /  |	\
				 /   |	 \
				/    |	  \
			       /     |	   \
			      /	     |	    \
			     /	     |	     \
			    /	     |	      \
			   /	     |	       \
			  /	     |		\
			 |	     |		 |
			 |	     |		 |
 :::::::::::::NFS::::::::|:::::::::::|:::::::::::|:::::::::::::::::::::::::::::
			 |	     |		 |	       :
			 |	     |		 |	       : FS-Cache
			 |	     |		 |	       :
   +---------+	    +---------+	     |		 |	       :    +---------+
   |	     |	    |	      |	     |		 |	       :    |	      |
   | mount 1 |----->| super 1 |------|-----------|----------------->| cache 1 |
   |	     |	    |	      |	     |		 |	       :    |	      |
   +---------+	    +---------+	     |		 |	       :    +---------+
				     |		 |	       :
				     |		 |	       :
   +---------+			+---------+	 |	       :    +---------+
   |	     |			|	  |	 |	       :    |	      |
   | mount 2 |----------------->| super 2 |------|----------------->| cache 2 |
   |	     |			|	  |	 |	       :    |	      |
   +---------+			+---------+	 |	       :    +---------+
						 |	       :
						 |	       :
   +---------+				    +---------+	       :    +---------+
   |	     |				    |	      |	       :    |	      |
   | mount 3 |----------------------------->| super 3 |------------>| cache 3 |
   |	     |				    |	      |	       :    |	      |
   +---------+				    +---------+	       :    +---------+

This has one immediately obvious problem: it stores redundant data in the
cache.  We end up with three copies of the same data stored in the cache,
reducing the cache efficiency.

There's a further problem that is less obvious: the cache is persistent - and
so the links from the client inodes into the cache must be reformed for
subsequent mounts.  This is not possible purely from the NFS attributes of the
server file, since each client file corresponds to the same server file.

To get around that, we'd have to add some of the purely client knowledge into
the key, such as root filehandle of a mount or local mount point.  However,
neither of these is sufficient:

 (*) The root filehandle may be mounted multiple times with different NFS
     connection parameters, so all of these must be included too.

 (*) The local mount point depends on the namespace in which it is made, and
     that is anonymous and can't contribute to the key.

Alternatively, we could require user intervention to map the files to their
respective caches (probably at the mount level), but that is in itself a
problem.

Furthermore, should disconnected operation be implemented, we then have the
problems of (a) how to synchronise changes made to the same file through
separate views, and (b) how to propagate changes between views without being
able to use the server as an intermediary.


=============================
POSSIBLE FS-CACHE SCENARIO #2
=============================

So, ideally, what we want to do is to share the local cache.  We could do this
by mapping each of the multiple client views to a single local cache object:

				+---------+
    Object on server --->	|	  |
				|  inode  |
				|	  |
				+---------+
				    /|\
				   / | \
				  /  |	\
				 /   |	 \
				/    |	  \
			       /     |	   \
			      /	     |	    \
			     /	     |	     \
			    /	     |	      \
			   /	     |	       \
			  /	     |		\
			 |	     |		 |
			 |	     |		 |
 :::::::::::::NFS::::::::|:::::::::::|:::::::::::|:::::::::::::::::::::::::::::
			 |	     |		 |	       :
			 |	     |		 |	       : FS-Cache
			 |	     |		 |	       :
   +---------+	    +---------+	     |		 |	       :
   |	     |	    |	      |	     |		 |	       :
   | mount 1 |----->| super 1 |------|-----------|------       :
   |	     |	    |	      |	     |		 |	\      :
   +---------+	    +---------+	     |		 |	 \     :
				     |		 |	  \    :
				     |		 |	   \   :
   +---------+			+---------+	 |	    \  :    +---------+
   |	     |			|	  |	 |	     \ :    |	      |
   | mount 2 |----------------->| super 2 |------|----------------->|  cache  |
   |	     |			|	  |	 |	     / :    |	      |
   +---------+			+---------+	 |	    /  :    +---------+
						 |	   /   :
						 |	  /    :
   +---------+				    +---------+	 /     :
   |	     |				    |	      | /      :
   | mount 3 |----------------------------->| super 3 |-       :
   |	     |				    |	      |	       :
   +---------+				    +---------+	       :

However, this means the kernel now has to deal with coherency maintenance
because it no longer treats the three views of the server file as being
completely separate, but on the other hand, the persistent-store matching
problem is no longer present.

The coherency problems arise from a number of facets:

 (1) Even if all three mounts are read-only, the client views may be updated at
     different times when the server file changes.  However, when one view sees
     a change, the on-disk cache must be flushed, but all the other views must
     notified that the mappings between extant pages and the cache are now
     broken.  This could, perhaps, be rendered down to a change perceived by
     one view causing all the pagecache on the other views being zapped.

 (2) How do we update the cache when writes are made to two or more client
     views?  We could require the changes to a view to be written back to the
     server before any other views are changed, but what about disconnected
     operation?

Basically, we end up treating the inodes that back multiple views of a single
server file as being the same inode - and maintain coherency manually.

Furthermore, we also require the infrastructure to support all of this, and
that requires more memory and processing time to maintain, not to mention the
introduction of cross-inode deadlock potential.


=============================
POSSIBLE FS-CACHE SCENARIO #3
=============================

In fact, the ideal solution is to share client superblocks, inodes and
pagecache content too:

				+---------+
    Object on server --->	|	  |
				|  inode  |
				|	  |
				+---------+
				     |
				     |
				     |
				     |
				     |
				     |
				     |
				     |
				     |
				     |
				     |
				     |
				     |
 :::::::::::::NFS::::::::::::::::::::|:::::::::::::::::::::::::::::::::::::::::
				     |			       :
				     |			       : FS-Cache
				     |			       :
   +---------+			     |			       :
   |	     |			     |			       :
   | mount 1 |----------	     |			       :
   |	     |		\	     |			       :
   +---------+		 \	     |			       :
			  \	     |			       :
			   \	     |			       :
   +---------+		    \	+---------+		       :    +---------+
   |	     |		     \	|	  |		       :    |	      |
   | mount 2 |----------------->|  super  |------------------------>|  cache  |
   |	     |		     /	|	  |		       :    |	      |
   +---------+		    /	+---------+		       :    +---------+
			   /				       :
			  /				       :
   +---------+		 /				       :
   |	     |		/				       :
   | mount 3 |----------				       :
   |	     |						       :
   +---------+						       :

This renders both the intraclient coherency problem and the cache object
reconnection problem nonexistent within the client simply by virtue of only
having one client inode represent *all* the views requested of the server file.

There are other coherency problems, but largely we can't deal with those within
NFS because they involve multiple clients and the NFS protocol doesn't provide
us with the tools.

The downside of this is that each shared superblock only has one NFS connection
to the server, and so only one set of connection parameters can be used.
However, since persistent local caching is novel to Linux, I think that it is
entirely reasonable to overrule the attempts to make mounts with different
parameters if they are to be shared and cached.



====

Okay...  So that's the problem.  Anyone got any suggestions?

My preferred solution is to take any NFS superblock which has fscaching enabled
and forcibly share it with any potentially overlapping superblock that also has
fscaching enabled.  That means the parameters of subsequent mounts are
discarded in favour of retaining the parameters of the first mount in an
fscached set.

The R/O mount flag can be dealt with by moving readonlyness into the vfsmount
rather than having it a property of the superblock.  The superblock would then
be read-only only if all its vfsmounts are also read-only.


There's one other thing to consider:

I've been asked to make the granularity of caching controllable at a directory
or file level.  However, this goes against passing the parameter in the mount
command.  There is an advantage, though: if your NFS mounts are dictated by
automounter, then enabling fscache in the mount options is not necessarily what
you want to do.

Would it be reasonable to have an outside way of setting directory options?
For instance, if there was a table like this:

	FS	SERVER	VOLUME	DIR		OPTIONS
	=======	=======	=======	===============	=========================
	nfs	home0	-	/home/*		fscache
	afs	redhat	data	/data/*		fscache

This could then be loaded into the kernel as a set of rules which directory
lookup by the filesystem involved could attempt to match and apply.

Davod

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to manage shared persistent local caching (FS-Cache) with NFS?
  2007-12-05 17:11 How to manage shared persistent local caching (FS-Cache) with NFS? David Howells
@ 2007-12-05 17:49 ` Jon Masters
  2007-12-05 18:03 ` David Howells
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Jon Masters @ 2007-12-05 17:49 UTC (permalink / raw)
  To: David Howells
  Cc: Peter Staubach, Trond Myklebust, Steve Dickson, nfsv4, linux-kernel

On Wed, 2007-12-05 at 17:11 +0000, David Howells wrote:

> The downside of this is that each shared superblock only has one NFS connection
> to the server, and so only one set of connection parameters can be used.
> However, since persistent local caching is novel to Linux, I think that it is
> entirely reasonable to overrule the attempts to make mounts with different
> parameters if they are to be shared and cached.

I think the shared superblock approach is the right one, but I'm a
little concerned that there would now be different behavior for fscache
and non-cached setups. Not sure of any better idea though.

> The R/O mount flag can be dealt with by moving readonlyness into the vfsmount
> rather than having it a property of the superblock.  The superblock would then
> be read-only only if all its vfsmounts are also read-only.

Given that, how many connection parameters are there that are likely to
actually differ on the same client, talking to the same server? Really?

> Would it be reasonable to have an outside way of setting directory options?
> For instance, if there was a table like this:
> 
> 	FS	SERVER	VOLUME	DIR		OPTIONS
> 	=======	=======	=======	===============	=========================
> 	nfs	home0	-	/home/*		fscache
> 	afs	redhat	data	/data/*		fscache
> 
> This could then be loaded into the kernel as a set of rules which directory
> lookup by the filesystem involved could attempt to match and apply.

You could store the table in a NIS map, for example, and a udev rule or
similar could trigger to load it later.

Jon.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to manage shared persistent local caching (FS-Cache) with NFS?
  2007-12-05 17:11 How to manage shared persistent local caching (FS-Cache) with NFS? David Howells
  2007-12-05 17:49 ` Jon Masters
@ 2007-12-05 18:03 ` David Howells
  2007-12-05 19:54 ` Chuck Lever
  2007-12-06  1:22 ` David Howells
  3 siblings, 0 replies; 9+ messages in thread
From: David Howells @ 2007-12-05 18:03 UTC (permalink / raw)
  To: Jon Masters
  Cc: dhowells, Peter Staubach, Trond Myklebust, Steve Dickson, nfsv4,
	linux-kernel

Jon Masters <jonathan@jonmasters.org> wrote:

> I think the shared superblock approach is the right one, but I'm a
> little concerned that there would now be different behavior for fscache
> and non-cached setups. Not sure of any better idea though.

The behaviour varies a bit anyway because there's a cache...

> > The R/O mount flag can be dealt with by moving readonlyness into the
> > vfsmount rather than having it a property of the superblock.  The
> > superblock would then be read-only only if all its vfsmounts are also
> > read-only.
> 
> Given that, how many connection parameters are there that are likely to
> actually differ on the same client, talking to the same server? Really?

I don't have figures on that, but I do know people have complained about it
for non-cached conditions.

> You could store the table in a NIS map, for example, and a udev rule or
> similar could trigger to load it later.

My point was meant to be that the presence and coverage of a cache is more
likely to reflect the client machine than would the NIS map for the NFS
automounts.  You wouldn't necessarily want to store this table in NIS.

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to manage shared persistent local caching (FS-Cache) with NFS?
  2007-12-05 17:11 How to manage shared persistent local caching (FS-Cache) with NFS? David Howells
  2007-12-05 17:49 ` Jon Masters
  2007-12-05 18:03 ` David Howells
@ 2007-12-05 19:54 ` Chuck Lever
  2007-12-06  1:22 ` David Howells
  3 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2007-12-05 19:54 UTC (permalink / raw)
  To: David Howells; +Cc: Peter Staubach, Trond Myklebust, nfsv4, linux-kernel

On Dec 5, 2007, at 12:11 PM, David Howells wrote:
> Okay...  I'm getting to the point where I want to release my local  
> caching
> patches again and have NFS work with them.  This means making NFS  
> mounts share
> or not share appropriately - something that's engendered a fair bit of
> argument.
>
> So I'd like to solicit advice on how best to deal with this problem.
>
> Let me explain the problem in more detail.
>
>
> ================
> CURRENT PRACTICE
> ================
>
> As the kernel currently stands, coherency is ignored for mounts  
> that have
> slightly different combinations of parameters, even if these  
> parameters just
> affect the properties of network "connection" used or just mark a  
> superblock
> as being read-only.
>
> Consider the case of a file remotely available by NFS.  Imagine the  
> client sees
> three different views of this file (they could be by three  
> overlapping mounts,
> or by three hardlinks or some combination thereof).
>
> This is how NFS currently operates without any superblock sharing:
>
> 				+---------+
>     Object on server --->	|	  |
> 				|  inode  |
> 				|	  |
> 				+---------+
> 				    /|\
> 				   / | \
> 				  /  |	\
> 				 /   |	 \
> 				/    |	  \
> 			       /     |	   \
> 			      /	     |	    \
> 			     /	     |	     \
> 			    /	     |	      \
> 			   /	     |	       \
> 			  /	     |		\
> 			 |	     |		 |
> 			 |	     |		 |
>  :::::::::::::NFS::::::::|:::::::::::|:::::::::::|:::::::::::::::::::: 
> :::::::::
> 			 |	     |		 |
> 			 |	     |		 |
> 			 |	     |		 |
>    +---------+	    +---------+	     |		 |
>    |	     |	    |	      |	     |		 |
>    | mount 1 |----->| super 1 |	     |		 |
>    |	     |	    |	      |	     |		 |
>    +---------+	    +---------+	     |		 |
> 				     |		 |
> 				     |		 |
>    +---------+			+---------+	 |
>    |	     |			|	  |	 |
>    | mount 2 |----------------->| super 2 |	 |
>    |	     |			|	  |	 |
>    +---------+			+---------+	 |
> 						 |
> 						 |
>    +---------+				    +---------+
>    |	     |				    |	      |
>    | mount 3 |----------------------------->| super 3 |
>    |	     |				    |	      |
>    +---------+				    +---------+
>
> Each view of the file on the client winds up with a separate inode  
> in a
> separate superblock and with a separate pagecache.  As far as the  
> client kernel
> is concerned, they *are* three different files.  Any incoherency  
> effects are
> ignored by the kernel and if they cause a userspace application a  
> problem,
> that's just too bad.
>
> Generally, however, this is not a problem because:
>
>   (a) an application is unlikely to be attempting to manipulate  
> multiple views
>       of a file simultaneously and
>
>   (b) cross-view hard links haven't been and aren't used that much.
>
>
> =============================
> POSSIBLE FS-CACHE SCENARIO #1
> =============================
>
> However, now we're introducing persistent local caching into the  
> mix.  That means we can no longer ignore such remote possibilities  
> - they are possible, therefore we have to deal with them, whether  
> we like it or not.


I don't see how persistent local caching means we can no longer  
ignore (a) and (b) above.  Can you amplify this a bit?  Nothing you  
say in the rest of your proposal convinces me that having multiple  
caches for the same export is really more than a theoretical issue.

Frankly, the reason why admins mount exports multiple times is  
precisely because they want different applications to access the  
files in different ways.  Admins *want* one mount point to be  
available ro, and another rw.  They *want* one mount point to use  
'noac' and another not to.  They *want* multiple sockets, more RPC  
slots, and unique caches for different applications.  No one would go  
to the trouble of mounting an export again, using different options,  
unless that's precisely the behavior that they wanted.

This is actually a feature of NFS.  It's used as a standard part of  
production environments, for example, when running Oracle databases  
on NFS.  One mount point is rw and is used by the database engine.   
Another mount point is ro and is used for back-up utilities, like RMAN.

Another example is local software distribution.  One mount point is  
ro, and is accessed by normal users.  Another mount point accesses  
the same export rw, and is used by administrators who provide updates  
for the software.

As useful as the feature is, one can also argue that mounting the  
same export multiple times is infrequent in most normal use cases.   
Practically speaking, why do we really need to worry about it?

The real problem here is that the NFS protocol itself does not  
support strong cache coherence.  I don't see why the Linux kernel  
must fix that problem.

The only real problem with the first scenario is that you may have  
more than one copy of a file in the persistent cache.  How often will  
that be the case?  Since the local persistence cache is probably disk- 
based and thus large relative to memory, what's the problem with  
using a little extra space?

The problems you ascribe to your second and third caching scenarios  
(deadlocking and reconnection) are, however, real and substantial.   
You don't have these issues when caching each mount point separately,  
right?

It seems to me that implementing the first scenario is (a)  
straightforward, (b) has fewer runtime risks (ie deadlocks), (c)  
doesn't take away features that some people still use, and (d) solves  
more than 80% of the issues here (80/20 rule of thumb).

Lastly, there's already a mount option that allows admins to control  
whether the page and attribute caches are shared -- "sharecache".  Is  
this mount option not adequate for persistent caching?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to manage shared persistent local caching (FS-Cache) with NFS?
  2007-12-05 17:11 How to manage shared persistent local caching (FS-Cache) with NFS? David Howells
                   ` (2 preceding siblings ...)
  2007-12-05 19:54 ` Chuck Lever
@ 2007-12-06  1:22 ` David Howells
  2007-12-06 18:28   ` Chuck Lever
  2007-12-06 20:00   ` David Howells
  3 siblings, 2 replies; 9+ messages in thread
From: David Howells @ 2007-12-06  1:22 UTC (permalink / raw)
  To: Chuck Lever
  Cc: dhowells, Peter Staubach, Trond Myklebust, nfsv4, linux-kernel


Chuck Lever <chuck.lever@oracle.com> wrote:

> I don't see how persistent local caching means we can no longer ignore (a)
> and (b) above.  Can you amplify this a bit?

How about I put it like this.  There are two principal problems to be dealt
with:

 (1) Reconnection.

     Imagine that the administrator requests a mount that uses part of a cache.
     The client machine is at some time later rebooted and the administrator
     requests the same mount again.

     Since the cache is meant to be persistent, the administrator is at liberty
     to expect that the second mount immediately begins to use the data that
     the first mount left in the cache.

     For this to occur, the second mount has to be able to determine which part
     of the cache the first mount was using and request to use the same piece
     of cache.

     To aid with this, FS-Cache has the concept of a 'key'.  Each object in the
     cache is addressed by a unique key.  NFS currently builds a key to the
     cache object for a file from: "NFS", the server IP address, port and NFS
     version and the file handle for that file.

 (2) Cache coherency.

     Imagine that the administrator requests a mount that uses part of a
     cache.  The administrator then makes a second mount that overlaps the
     first, maybe because it's a different part of the same server export or
     maybe it uses the same part, but with different parameters.

     Imagine further that a particular server file is accessible through both
     mountpoints.  This means that the kernel, and therefore the user, has two
     views of the one file.

     If the kernel maintains these two views of the files as totally separate
     copies, then coherency is mostly not a kernel problem, it's an application
     problem - as it is now.

     However, if these two views are shared at any level - such as if they
     share an FS-Cache cache object - then coherency can be a problem.

     The two simplest solutions to the coherency problem are (a) to enforce
     sharing at all levels (superblocks, inodes, cache objects), (b) to enforce
     non-sharing.  In-between states are possible, but are much trickier and
     more complex.

     Note that cache coherency management can't be entirely avoided: upon
     reconnection a cache object has to be checked against the server to see
     whether it's still valid.

Note that both these problems only really exist because the cache is
persistent between mounts.  If it were volatile between mounts, then (1) would
not exist, and (2) can be ignored as it is now.

There are three obvious ways of dealing with the problems (ignoring the fact
that all cases have on-reconnection coherency to deal with whatever):

 (a) No sharing at all.
     
     Cache coherency is what it is now with NFS, but reconnection must be
     managed.  A key must be generated to each mount to distinguish that mount
     from an overlapping mount that might contain the same files.

     These keys must be unique (and uniqueness must be enforced) unless two
     superblocks are guaranteed disjoint (eg: on different servers), or are
     guaranteed to share anyway (eg: exact same parameter sets and nosharecache
     not specified).

 (b) Fully shared.

     Cache coherency is a non-issue.  Reconnection is a non-issue.  Any
     particular server inode is guaranteed to be represented by a single inode
     on the client, both in the superblock and the pagecache, and by a single
     FS-Cache cache object.

     The downside of this is that sharing must take priority over different
     connection parameters.  R/O vs R/W can be dealt relatively easily as I
     believe it's a local phenomenon, and is dealt with before the filesystem
     is consulted.  There are patches to do this.

 (c) Implicit full sharing between cached mountpoints; uncached mountpoints
     need not be shared.

     Cached mountpoints have the properties of (b), uncached mountpoints are
     left to themselves.

Note that redundant disk usage is undesirable, but unlikely to cause a real
problem, such as an oops.  Non-unique keys, on the other hand, are a problem.

Having non-shared local inodes sharing cache objects causes even more problems,
and I don't want to go there.

> Nothing you say in the rest of your proposal convinces me that having
> multiple caches for the same export is really more than a theoretical issue.

Okay.  So how do you do reconnection?

The simplest way from what I see is to require that the administrator specify
everything, but this is probably not what you want if you're distributing NFS
mounts by NIS, say.

The next simplest way is to bind all the differentiation parameters (see
nfs_compare_mount_options()) into a key and use that, plus a uniquifier from
the administrator if NFS_MOUNT_UNSHARED is set.

> Frankly, the reason why admins mount exports multiple times is precisely
> because they want different applications to access the  files in different
> ways.

So I've gathered.

> This is actually a feature of NFS.  It's used as a standard part of production
> environments, for example, when running Oracle databases  on NFS.  One mount
> point is rw and is used by the database engine.   Another mount point is ro
> and is used for back-up utilities, like RMAN.

R/O vs R/W is a special case.  There are patches out there to deal with it by
moving the R/O flag off into vfsmount.

> As useful as the feature is, one can also argue that mounting the same export
> multiple times is infrequent in most normal use cases.   Practically speaking,
> why do we really need to worry about it?

Because it's possible.  Because it has to be considered.  Because, as you said,
people do it.  Because if I don't deal with it, the kernel will oops when NFS
asks FS-Cache to do something it doesn't support.

I can't just say: "Well, it'll oops if you configure your NFS shares like that,
so don't.  It's not worth me implementing round it.".

> The real problem here is that the NFS protocol itself does not support strong
> cache coherence.  I don't see why the Linux kernel  must fix that problem.

So you're arguing there shouldn't be local caching for NFS?  Or that there
shouldn't be persistent local caching for NFS?

> The problems you ascribe to your second and third caching scenarios
> (deadlocking and reconnection)

The second only.  Neither occur with the third scenario.

> are, however, real and substantial.  You don't have these issues when caching
> each mount point separately, right?

Not right.  Deadlocking, no; reconnection, YES.  As I said:

	There's a further problem that is less obvious: the cache is persistent
	- and so the links from the client inodes into the cache must be
	reformed for subsequent mounts.

Reconnection is straightforward only in the third scenario because it
eliminates all possibility of alternative possibilities.

> It seems to me that implementing the first scenario is (a) straightforward,
> (b) has fewer runtime risks (ie deadlocks), (c)  doesn't take away features
> that some people still use, and (d) solves  more than 80% of the issues here
> (80/20 rule of thumb).

It seems straightforward at first glance, but you still have to deal with
reconnection.

As for (b), the third scenario has fewest risks and deadlock possibilities by
virtue of making sure they don't arise in the first place.

(c) is a valid point.

(d) isn't true.  Reconnection.

> Lastly, there's already a mount option that allows admins to control whether
> the page and attribute caches are shared -- "sharecache".  Is  this mount
> option not adequate for persistent caching?

Adequate in what way?  It doesn't currently automatically guarantee sharing of
overlapping superblocks.  It merely disables nonsharecache which explicitly
disables cache sharing.

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to manage shared persistent local caching (FS-Cache) with NFS?
  2007-12-06  1:22 ` David Howells
@ 2007-12-06 18:28   ` Chuck Lever
  2007-12-06 20:00   ` David Howells
  1 sibling, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2007-12-06 18:28 UTC (permalink / raw)
  To: David Howells; +Cc: Peter Staubach, Trond Myklebust, nfsv4, linux-kernel

Hi David-

On Dec 5, 2007, at 8:22 PM, David Howells wrote:
> Chuck Lever <chuck.lever@oracle.com> wrote:
>
>> I don't see how persistent local caching means we can no longer  
>> ignore (a)
>> and (b) above.  Can you amplify this a bit?
>
> How about I put it like this.  There are two principal problems to  
> be dealt
> with:
>
>  (1) Reconnection.
>
>      Imagine that the administrator requests a mount that uses part  
> of a cache.
>      The client machine is at some time later rebooted and the  
> administrator
>      requests the same mount again.
>
>      Since the cache is meant to be persistent, the administrator  
> is at liberty
>      to expect that the second mount immediately begins to use the  
> data that
>      the first mount left in the cache.
>
>      For this to occur, the second mount has to be able to  
> determine which part
>      of the cache the first mount was using and request to use the  
> same piece
>      of cache.
>
>      To aid with this, FS-Cache has the concept of a 'key'.  Each  
> object in the
>      cache is addressed by a unique key.  NFS currently builds a  
> key to the
>      cache object for a file from: "NFS", the server IP address,  
> port and NFS
>      version and the file handle for that file.

Why not use the fsid as well?  The NFS client already uses the fsid  
to detect when it is crossing a server-side mount point.  Fsids are  
supposed to be stable over server reboots (although sometimes they  
aren't, it could be made a condition of supporting FS-cache on clients).

I also note the inclusion of server IP address in the key.  For multi- 
homed servers, you have the same unavoidable cache aliasing issues if  
the client mounts the same server and export via different server  
network interfaces.

>  (2) Cache coherency.
>
>      Imagine that the administrator requests a mount that uses part  
> of a
>      cache.  The administrator then makes a second mount that  
> overlaps the
>      first, maybe because it's a different part of the same server  
> export or
>      maybe it uses the same part, but with different parameters.
>
>      Imagine further that a particular server file is accessible  
> through both
>      mountpoints.  This means that the kernel, and therefore the  
> user, has two
>      views of the one file.
>
>      If the kernel maintains these two views of the files as  
> totally separate
>      copies, then coherency is mostly not a kernel problem, it's an  
> application
>      problem - as it is now.
>
>      However, if these two views are shared at any level - such as  
> if they
>      share an FS-Cache cache object - then coherency can be a problem.

Is it a problem because, if there are multiple copies of the same  
remote file in its cache, then FS-cache doesn't know, upon  
reconnection, which item to match against a particular remote file?

I think that's actually going to be a fairly typical situation --  
you'll have conditions where some cache items will become orphaned,  
for example, so you're going to have to deal with that ambiguity as a  
part of normal operation.

For example, if the FS-caching client is disconnected or powered off  
when a remote rename occurs that replaces a file it has cached, the  
client will have an orphaned item left over.  Maybe this use case is  
only a garbage collection problem.

>      The two simplest solutions to the coherency problem are (a) to  
> enforce
>      sharing at all levels (superblocks, inodes, cache objects),  
> (b) to enforce
>      non-sharing.  In-between states are possible, but are much  
> trickier and
>      more complex.
>
>      Note that cache coherency management can't be entirely  
> avoided: upon
>      reconnection a cache object has to be checked against the  
> server to see
>      whether it's still valid.

How do you propose to do that?

First, clearly, FS-cache has to know that it's the same object, so  
fsid and filehandle have to be the same (you refer to that as the  
"reconnection problem", but it may generally be a "cache aliasing  
problem").

I assume FS-cache has a record of the state of the remote file when  
it was last connected -- mtime, ctime, size, change attribute (I'll  
refer to this as the "reconciliation problem")?  Does it, for  
instance, checksum both the cache item and the remote file to detect  
data differences?

You have the same problem here as we have with file system search  
tools such as Beagle.  Reconciling file contents after a reconnection  
event may be too expensive to consider for NFS, especially if a file  
is terabytes in size.

> Note that both these problems only really exist because the cache is
> persistent between mounts.  If it were volatile between mounts,  
> then (1) would not exist, and (2) can be ignored as it is now.

Do you allow administrators to select whether the FS-cache is  
persistent?  Or is it always unconditionally persistent?

An adequate first pass at FS-cache can be done without guaranteeing  
persistence.  There are a host of other issues that need exposure --  
steady-state performance; cache garbage collection and reclamation;  
cache item aliasing; whether all files on a mount point should be  
cached on disk, or some in memory and some on disk; and so on -- that  
can be examined without even beginning to worry about reboot recovery.

And what would it harm if FS-cache decides that certain items in its  
cache have become ambiguous or otherwise unusable after a  
reconnection event, thus it reclaims them instead of re-using them?

> There are three obvious ways of dealing with the problems (ignoring  
> the fact
> that all cases have on-reconnection coherency to deal with whatever):
>
>  (a) No sharing at all.
>
>      Cache coherency is what it is now with NFS, but reconnection  
> must be
>      managed.  A key must be generated to each mount to distinguish  
> that mount
>      from an overlapping mount that might contain the same files.
>
>      These keys must be unique (and uniqueness must be enforced)  
> unless two
>      superblocks are guaranteed disjoint (eg: on different  
> servers), or are
>      guaranteed to share anyway (eg: exact same parameter sets and  
> nosharecache
>      not specified).
>
>  (b) Fully shared.
>
>      Cache coherency is a non-issue.  Reconnection is a non-issue.   
> Any
>      particular server inode is guaranteed to be represented by a  
> single inode
>      on the client, both in the superblock and the pagecache, and  
> by a single
>      FS-Cache cache object.
>
>      The downside of this is that sharing must take priority over  
> different
>      connection parameters.  R/O vs R/W can be dealt relatively  
> easily as I
>      believe it's a local phenomenon, and is dealt with before the  
> filesystem
>      is consulted.  There are patches to do this.
>
>  (c) Implicit full sharing between cached mountpoints; uncached  
> mountpoints
>      need not be shared.
>
>      Cached mountpoints have the properties of (b), uncached  
> mountpoints are
>      left to themselves.
>
> Note that redundant disk usage is undesirable, but unlikely to  
> cause a real
> problem, such as an oops.  Non-unique keys, on the other hand, are  
> a problem.
>
> Having non-shared local inodes sharing cache objects causes even  
> more problems,
> and I don't want to go there.
>
>> Nothing you say in the rest of your proposal convinces me that having
>> multiple caches for the same export is really more than a  
>> theoretical issue.
>
> Okay.  So how do you do reconnection?
>
> The simplest way from what I see is to require that the  
> administrator specify
> everything, but this is probably not what you want if you're  
> distributing NFS
> mounts by NIS, say.

Automatic configuration is preferred.  For example, NFS with Kerberos  
has an administrative scaling problem because some local  
administration (creating a keytab and registering the client with  
KDC) is required for every client that joins a realm.

> The next simplest way is to bind all the differentiation parameters  
> (see
> nfs_compare_mount_options()) into a key and use that, plus a  
> uniquifier from
> the administrator if NFS_MOUNT_UNSHARED is set.

It gives us the proper legacy behavior, but as soon as the  
administrator changes a mount option, all previously cached items for  
that mount point become orphans.

>> As useful as the feature is, one can also argue that mounting the  
>> same export
>> multiple times is infrequent in most normal use cases.    
>> Practically speaking,
>> why do we really need to worry about it?
>
> Because it's possible.  Because it has to be considered.  Because,  
> as you said,
> people do it.  Because if I don't deal with it, the kernel will  
> oops when NFS
> asks FS-Cache to do something it doesn't support.
>
> I can't just say: "Well, it'll oops if you configure your NFS  
> shares like that,
> so don't.  It's not worth me implementing round it.".

What causes that instability?  Why can't you insulate against the  
instability but allow cache incoherence and aliased cache items?

Local file systems are fraught with cases where they protect their  
internal metadata aggressively at the cost of not keeping the disk up  
to date with the memory version of the file system.

Similar compromises might benefit FS-cache.  In other words, FS-cache  
for NFS file systems may be less functional than for, say, AFS, to  
allow the cache to operate reliably.

>> The real problem here is that the NFS protocol itself does not  
>> support strong
>> cache coherence.  I don't see why the Linux kernel  must fix that  
>> problem.
>
> So you're arguing there shouldn't be local caching for NFS?  Or  
> that there
> shouldn't be persistent local caching for NFS?

I'm arguing that cache coherence isn't supported by the NFS protocol,  
so how can FS-cache *require* a facility to support persistent local  
caching that the protocol doesn't have in the first place?

NFS client implementations do the best they can; there are always  
scenarios where coherence issues cause behavior no-one expects.   
Usually NFS clients handle ambiguous cases by invalidating their  
caches.  Invalidating is cheap for in-memory caches.  Frequent  
invalidation is going to be expensive for FS-cache, since it requires  
some disk I/O (and perhaps even file truncation).  One reason why  
chunk caching is better than whole-file caching is that it bounds the  
time and effort to recycle a cache item.

AFS assigns universally unique identities to servers, volumes, and  
files.  NFS doesn't guarantee unique identities to servers or  
exports, and file handles are supposed to be unique only on a given  
server [*].  And unfortunately file handles can be re-used by the  
server without any indication to the client that the file handle it  
has cached is no longer the same file (see the "out_fileid" label in  
fs/nfs/inode.c:nfs_update_inode).  AFS provides client-visible  
generation IDs in its inode numbers for this case.

Thus NFS itself does not provide any good way to help you sort FS- 
cache cache items outside of a single export.  A proper FS-cache  
implementation thus cannot depend on server/export identity to  
guarantee the singularity of cache items.

So FS-cache will have a hard time guaranteeing that there is only one  
item in its cache that maps to a given NFS server file.  It may also  
be difficult to guarantee that multiple NFS server files do not map  
onto the same local cache item (file handle re-use).

This suggests to me that the cache aliasing problem is unsolvable for  
NFS, so you should find a way to make FS-cache work in a world where  
cache aliasing is a fact of life.

>> Lastly, there's already a mount option that allows admins to  
>> control whether
>> the page and attribute caches are shared -- "sharecache".  Is   
>> this mount
>> option not adequate for persistent caching?
>
> Adequate in what way?  It doesn't currently automatically guarantee  
> sharing of
> overlapping superblocks.  It merely disables nonsharecache which  
> explicitly
> disables cache sharing.

The current problem with "sharecache" is that the mount options on  
subsequent mounts of the same export are silently ignored.  You are  
proposing the same behavior for FS-cache-managed mount points, which  
means we're spreading bad UI behavior further.

At least there should be a warning that explains why a file system  
that was mounted with "rw,noac,tcp" is behaving like it's "ro,ac,udp".

Ideally, if we must have cache sharing, the behavior should be: if  
the mount options, the server, and the fsid are the same, then the  
cache should be shared.  If any of that tuple are different, then a  
unique cache is used for that mount point (within the limits of being  
able to determine the unique identity of a server and export).

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

[*] Section 4 of RFC 3530 states:

The filehandle in the NFS protocol is a per server unique identifier  
for a filesystem object.  The contents of the filehandle are opaque  
to the client.  Therefore, the server is responsible for translating  
the filehandle to an internal representation of the filesystem object.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to manage shared persistent local caching (FS-Cache) with NFS?
  2007-12-06  1:22 ` David Howells
  2007-12-06 18:28   ` Chuck Lever
@ 2007-12-06 20:00   ` David Howells
  2007-12-07 17:59     ` Chuck Lever
  2007-12-08  0:52     ` David Howells
  1 sibling, 2 replies; 9+ messages in thread
From: David Howells @ 2007-12-06 20:00 UTC (permalink / raw)
  To: Chuck Lever
  Cc: dhowells, Peter Staubach, Trond Myklebust, nfsv4, linux-kernel

Chuck Lever <chuck.lever@oracle.com> wrote:

> Why not use the fsid as well?  The NFS client already uses the fsid to detect
> when it is crossing a server-side mount point.

Why use the FSID at all?  The file handles are supposed to be unique per
server.

> I also note the inclusion of server IP address in the key.  For multi-
> homed servers, you have the same unavoidable cache aliasing issues if the
> client mounts the same server and export via different server  network
> interfaces.

I'm aware of this, but unless there's:

 (a) a way to specify a logical server group to the kernel, and

 (b) a guarantee that the file handles of each member of the logical group are
     common across the group

there's nothing I can do about it.

AFS deals with these by making servers second class citizens, and defining
"file handles" to be a set within the cell space.

Besides, I can use the IP address of the server as a key.  I just have to hope
that the IP address doesn't get transferred to a different server because, as
far as I know, there isn't any way to detect this in the NFS protocol.

> Is it a problem because, if there are multiple copies of the same remote file
> in its cache, then FS-cache doesn't know, upon  reconnection, which item to
> match against a particular remote file?

There are multiple copies of the same remote file that are described by the
same remote parameters.  Same IP address, same port, same NFS version, same
FSID, same FH.  The difference may be a local connection parameter.

> I think that's actually going to be a fairly typical situation -- 
> you'll have conditions where some cache items will become orphaned, for
> example, so you're going to have to deal with that ambiguity as a  part of
> normal operation.

Orphaned stuff in the cache is eventually culled by cachefilesd when there's
space pressure in the cache.

> For example, if the FS-caching client is disconnected or powered off when a
> remote rename occurs that replaces a file it has cached, the  client will have
> an orphaned item left over.  Maybe this use case is  only a garbage collection
> problem.

Rename isn't a problem provided the FH doesn't change.  NFS effectively caches
inodes, not files.  If the remote file is deleted, then either NFS will try
opening it, will fail and will tell the cache to evict it; or the remote file
will never be opened again and the garbage in the cache will be culled
eventually.  It may even hang around for ever, but if the FH it re-used, the
cache object will be evicted based on mtime + ctime + filesize being
different.

If someone tries hard enough, they can probably muck up the cache, but there's
not a lot I can do about that.

> >      Note that cache coherency management can't be entirely avoided: upon
> >      reconnection a cache object has to be checked against the server to see
> >      whether it's still valid.
> 
> How do you propose to do that?

For NFS, check mtime + ctime + filesize upon opening.  It's in the patch
already.

For AFS there's a data version number.

> First, clearly, FS-cache has to know that it's the same object, so fsid and
> filehandle have to be the same (you refer to that as the  "reconnection
> problem", but it may generally be a "cache aliasing  problem").

FSID is not required.  FH has to be unique per server according to Trond.

> I assume FS-cache has a record of the state of the remote file when it was
> last connected -- mtime, ctime, size, change attribute (I'll  refer to this as
> the "reconciliation problem")?

mtime + ctime + size, yes.  I should add the change attribute if it's present,
I suppose, but that ought to be simple enough.

> Does it, for instance, checksum both the cache item and the remote file to
> detect data differences?

No.  That would be horrendously inefficient.  Besides, if we're going to
checksum the remote file each time, what's the point in having a persistent
cache?

> You have the same problem here as we have with file system search tools such
> as Beagle.  Reconciling file contents after a reconnection  event may be too
> expensive to consider for NFS, especially if a file  is terabytes in size.

Because NFS v2 and v3 don't support proper coherency, there's a limited amount
we can do without being silly about it.  You just have to hope someone doesn't
wind back the clock on the server in order to fudge the ctime to give your
cache conniptions.  But if someone's willing to go to such lengths, you're
stuffed anyway.

I have to make some assumptions about what I can do.  They're probably
reasonable, but there's no guarantee it won't malfunction due to the speed of
today's networks vs the granularity of the time stamps.

If you don't want to take the risk, don't use persistent caching or don't use
NFS2 or 3.

> Do you allow administrators to select whether the FS-cache is persistent?  Or
> is it always unconditionally persistent?

The cache is persistent.  If you don't want it to be persistent, you can have
init delete it during boot.  It would be very easy to have NFS tell the cache
to discard each object as it ditches the inode that's using it.  It already
has to do this anyway, but it could be configured, for example, through the
NFS mount options.

> An adequate first pass at FS-cache can be done without guaranteeing
> persistence.

True.  But it's not particularly interesting to me in such a case.

> There are a host of other issues that need exposure -- steady-state
> performance;

Meaning what?

I have been measuring the performance improvement and degradation numbers, and
I can say that if you've one client and one server, the server has all the
files in memory, and there's gigabit ethernet between them, an on-disk cache
really doesn't help.

Basically, the consideration of whether to use a cache is a compromise between
a host of factors.

> cache garbage collection

Done.

> and reclamation;

Done.

> cache item aliasing;

Partly done.

> whether all files on a mount point should be cached on disk, or some in
> memory and some on disk;

I've thought about that, but no-one seems particularly interested in
discussing it.

> and so on -- that can be examined without even beginning to worry about
> reboot recovery.

Yet persistence, if we're going to have it, needs to be considered up front,
lest you have to go and completely rewrite everything later.

Persistence requires one thing: a unique key for each object.

> And what would it harm if FS-cache decides that certain items in its cache
> have become ambiguous or otherwise unusable after a  reconnection event, thus
> it reclaims them instead of re-using them?

It depends.

At some point I'd like to make disconnected operation possible, and that means
storing data to be written back in the cache.  You can't necessarily just
chuck that away.

But apart from that, that's precisely what the current caching code does.
It's at liberty to discard any clean part of the cache.

> Automatic configuration is preferred.

Indeed.  Furthermore, I'd rather not have the fscache parameters in the NFS
mount options at all, but specified separately.  This would permit
per-directory controls, say.

> > The next simplest way is to bind all the differentiation parameters (see
> > nfs_compare_mount_options()) into a key and use that, plus a uniquifier from
> > the administrator if NFS_MOUNT_UNSHARED is set.
> 
> It gives us the proper legacy behavior, but as soon as the administrator
> changes a mount option, all previously cached items for  that mount point
> become orphans.

That would be unavoidable.

> > I can't just say: "Well, it'll oops if you configure your NFS shares like
> > that,
> > so don't.  It's not worth me implementing round it.".
> 
> What causes that instability?  Why can't you insulate against the instability
> but allow cache incoherence and aliased cache items?

Insulate how?  The only way to do that is to add something to the cache key
that says that these two otherwise identical items are actually diffent
things.

> Local file systems are fraught with cases where they protect their internal
> metadata aggressively at the cost of not keeping the disk up  to date with the
> memory version of the file system.

I don't see that this is applicable.

> Similar compromises might benefit FS-cache.  In other words, FS-cache for NFS
> file systems may be less functional than for, say, AFS, to  allow the cache to
> operate reliably.

Firstly, I'd rather not start adding special exceptions unless I absolutely
have to.  Secondly, you haven't actually shown any compromises that might be
useful.

> I'm arguing that cache coherence isn't supported by the NFS protocol, so how
> can FS-cache *require* a facility to support persistent local  caching that
> the protocol doesn't have in the first place?

NFS has just enough to just about support a persistent local cache for
unmodified files.  It has unique file keys per server, and it has a (limited)
amount of coherency data per file.  That's not really the problem.

The problem is that the client can create loads of different views of a remote
export and the kernel treats them as if they're views of different remote
exports.  These views do not necessarily have *anything* to distinguish them
at all (nosharecache option).  This is a local phenomenon, and not really
anything to do with the server.

Now, for the case of cached clients, we can enforce a reduction of incoherency
by requiring one remote inode maps to a single client inode if that inode is
going to be placed in the persistent cache.

> NFS client implementations do the best they can;

This is no different.  I'm trying to do the best I can, even though it's not
fully supported.

> Usually NFS clients handle ambiguous cases by invalidating their caches.

This is no different.

> Invalidating is cheap for in-memory caches.  Frequent invalidation is going
> to be expensive for FS-cache, since it requires some disk I/O (and perhaps
> even file truncation).

So what?  That's one of the compromises you have to make if you want an
on-disk cache.  The invalidation is asynchronous anyway.  The cachefiles
kernel module renames the dead item into the graveyard directory, and
cachefilesd wakes up sometime later and deletes it.

> One reason why chunk caching is better than whole-file caching is that it
> bounds the time and effort to recycle a cache item.

That's a whole different kettle of miscellaneous swimming things, and also
besides the point.  Yes, I realise chunk caching is more efficient under some
circumstances, but not all.  However, I think that's something that NFS has to
handle, not FS-Cache as NFS is the one that can spam the server.

> AFS assigns universally unique identities to servers, volumes, and files.  NFS
> doesn't guarantee unique identities to servers or  exports, and file handles
> are supposed to be unique only on a given  server [*].  And unfortunately file
> handles can be re-used by the  server without any indication to the client
> that the file handle it  has cached is no longer the same file (see the
> "out_fileid" label in  fs/nfs/inode.c:nfs_update_inode).  AFS provides
> client-visible  generation IDs in its inode numbers for this case.

Yes.  That's one of the compromises you have to make just by using NFS at all.
It's not limited to NFS + FS-Cache.

> Thus NFS itself does not provide any good way to help you sort FS-cache
> cache items outside of a single export.  A proper FS-cache implementation
> thus cannot depend on server/export identity to guarantee the singularity of
> cache items.

What do you mean by server/export identity?  I'm under the impression that
FH's are unique per-server, and that exports don't have anything to do with
it.

> So FS-cache will have a hard time guaranteeing that there is only one item in
> its cache that maps to a given NFS server file.  It may also  be difficult to
> guarantee that multiple NFS server files do not map  onto the same local cache
> item (file handle re-use).

Anything that applies to NFS + FS-Cache here *also* applies to NFS itself.

Yes, I realise that the assumptions may get violated, but that's one of those
compromises you have to make if you want to use NFS.

> This suggests to me that the cache aliasing problem is unsolvable for NFS, so
> you should find a way to make FS-cache work in a world where  cache aliasing
> is a fact of life.

So you shouldn't use NFS at all is what you're saying?

Cache aliasing is a pain and has to be dealt with one way or another.  My
preferred solution is to *reduce* the amount of aliasing that occurs on the
client.  It is possible.  The code is already in the vanilla kernel to do
this.

The other extreme is to manually tag cache objects to distinguish otherwise
undistinguishable unshareable views of a remote file.

Note that there are just three aspects that need to be considered with regard
to managing coherence with the server:

 (1) Server identification.

     How do you detect that the server you were talking to before is still the
     same server?  As far as I know, with NFS, you can't.  You just have to
     hope.

 (2) Remote file identification.

     How do you detect that the file on that server you were accessing before
     is still the same file?

 (3) Remote file data version identification.

     How do you detect that the data in that file on that server you were
     using before is still the same data?

I currently combine points (2) and (3) by checking that the combination of FH,
mtime, ctime and file size is the same.  If it is not, the cache object is not
found or is discarded and a new one is made.  This could be improved for NFS
by using the change attribute if available.

These determine whether persistent caching is possible at all.

The views set up by the sysadmin are artificial non-coherency problems.  Still
valid, obviously, judging by the cries of anguish when they're ignored
preemptively in all cases.

> The current problem with "sharecache" is that the mount options on subsequent
> mounts of the same export are silently ignored.  You are  proposing the same
> behavior for FS-cache-managed mount points, which  means we're spreading bad
> UI behavior further.

The whole point of "sharecache" is to make FS-Cache support easier by reducing
the coherency issues to something that's manageable by simple methods.  That's
why I wrote the patches for it.

However, given that no sysadmins use local caching on Linux until these
patches come along, why is it a problem to impose the added restriction that
fscached superblocks must also be shared?

> At least there should be a warning that explains why a file system that was
> mounted with "rw,noac,tcp" is behaving like it's "ro,ac,udp".

Perhaps.

Note again that R/O and R/W should be handled elsewhere.  Unless they're
reflected on the network at all?

> Ideally, if we must have cache sharing, the behavior should be: if the mount
> options, the server, and the fsid are the same, then the cache should be
> shared.  If any of that tuple are different, then a unique cache is used for
> that mount point (within the limits of being able to determine the unique
> identity of a server and export).

Yes.  This is more or less what I proposed and you've been objecting to.  At
least, I think you have.

I'm not sure that FSID is really relevant.  It doesn't add any useful
information, and can be changed by the server relatively easily - at least it
can on Linux under some circumstances.

Would you say that specifying "nosharecache" verboten if "fsc" is also
requested?  Otherwise, admin intervention is absolutely required in the
following case:

	mount warthog:/a /a -o nosharecache,fsc
	mount warthog:/a /b -o nosharecache,fsc

as there's no way to distinguish between the two mounts.

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to manage shared persistent local caching (FS-Cache) with NFS?
  2007-12-06 20:00   ` David Howells
@ 2007-12-07 17:59     ` Chuck Lever
  2007-12-08  0:52     ` David Howells
  1 sibling, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2007-12-07 17:59 UTC (permalink / raw)
  To: David Howells; +Cc: Peter Staubach, Trond Myklebust, nfsv4, linux-kernel

Hi David-

[ Some history snipped... ]

On Dec 6, 2007, at 3:00 PM, David Howells wrote:
> Chuck Lever <chuck.lever@oracle.com> wrote:
>> Is it a problem because, if there are multiple copies of the same  
>> remote file
>> in its cache, then FS-cache doesn't know, upon  reconnection,  
>> which item to
>> match against a particular remote file?
>
> There are multiple copies of the same remote file that are  
> described by the
> same remote parameters.  Same IP address, same port, same NFS  
> version, same
> FSID, same FH.  The difference may be a local connection parameter.

Why not encode the local mounted-on directory in the key?  A  
cryptographic hash of the directory's absolute pathname would be  
bounded in size.  And the mounted-on directory is usually persistent  
across client reboots.

That way you can use the directory name hash to distinguish the  
different views of the same remote object.

>> An adequate first pass at FS-cache can be done without guaranteeing
>> persistence.
>
> True.  But it's not particularly interesting to me in such a case.
>
>> There are a host of other issues that need exposure -- steady-state
>> performance;
>
> Meaning what?

Meaning your cache is at quota all the time, and to continue  
operation it must eject items constantly.

This is a scenario where it pays to cache the read-mostly items on  
disk, and leave the frequently changing items in memory.

The economics of disk caches is different than memory caches.  Disk  
caches are much larger and cheaper, but their performance tanks when  
they have to track frequently changing files.  Memory caches are  
smaller, but tracking frequently changing data is only a little more  
expensive than tracking data that doesn't change often.

> I have been measuring the performance improvement and degradation  
> numbers, and
> I can say that if you've one client and one server, the server has  
> all the
> files in memory, and there's gigabit ethernet between them, an on- 
> disk cache
> really doesn't help.
>
> Basically, the consideration of whether to use a cache is a  
> compromise between
> a host of factors.
>
>> cache garbage collection
>
> Done.
>
>> and reclamation;
>
> Done.
>
>> cache item aliasing;
>
> Partly done.
>
>> whether all files on a mount point should be cached on disk, or  
>> some in
>> memory and some on disk;
>
> I've thought about that, but no-one seems particularly interested in
> discussing it.

I think it's key to preventing FS-cache from making performance worse  
in many common scenarios.

>> And what would it harm if FS-cache decides that certain items in  
>> its cache
>> have become ambiguous or otherwise unusable after a  reconnection  
>> event, thus
>> it reclaims them instead of re-using them?
>
> It depends.
>
> At some point I'd like to make disconnected operation possible, and  
> that means
> storing data to be written back in the cache.  You can't  
> necessarily just
> chuck that away.

Disconnected operation for NFS is fraught with challenges.  Access to  
data on servers is traditionally gated by the client's IP address,  
for example.  The client may disconnect from the network, then  
reconnect using a different address where suddenly all of its  
accesses are rebuffed.

NFS servers, not clients, traditionally determine the file's mtime  
and ctime, and its file handle.  So file updates and file creation  
become problematic.  The client has to reconcile the server's file  
handle, for files created offline, with its own when reconnecting.

And, for disconnected operation, the cache is required to contain  
every item from the remote.  You can't just drop items from the cache  
because they are inconvenient.

>>> I can't just say: "Well, it'll oops if you configure your NFS  
>>> shares like
>>> that,
>>> so don't.  It's not worth me implementing round it.".
>>
>> What causes that instability?  Why can't you insulate against the  
>> instability
>> but allow cache incoherence and aliased cache items?
>
> Insulate how?  The only way to do that is to add something to the  
> cache key
> that says that these two otherwise identical items are actually  
> diffent
> things.

That something might be the pathname of the mounted-on directory or  
of the file itself.

>> I'm arguing that cache coherence isn't supported by the NFS  
>> protocol, so how
>> can FS-cache *require* a facility to support persistent local   
>> caching that
>> the protocol doesn't have in the first place?
>
> NFS has just enough to just about support a persistent local cache for
> unmodified files.  It has unique file keys per server, and it has a  
> (limited)
> amount of coherency data per file.  That's not really the problem.
>
> The problem is that the client can create loads of different views  
> of a remote
> export and the kernel treats them as if they're views of different  
> remote
> exports.  These views do not necessarily have *anything* to  
> distinguish them
> at all (nosharecache option).

Yes, they do.  The combination of mount options and mounted-on  
directory (or local pathname to the file) gives you a unique identity  
for that view.

> Now, for the case of cached clients, we can enforce a reduction of  
> incoherency
> by requiring one remote inode maps to a single client inode if that  
> inode is
> going to be placed in the persistent cache.

That seems reasonable.  Just don't cache the second and greater  
instances of the same remote file if FS-cache can't handle local  
aliases.

>> Invalidating is cheap for in-memory caches.  Frequent invalidation  
>> is going
>> to be expensive for FS-cache, since it requires some disk I/O (and  
>> perhaps
>> even file truncation).
>
> So what?  That's one of the compromises you have to make if you  
> want an
> on-disk cache.  The invalidation is asynchronous anyway.

So an item is cached in memory until space becomes available in the  
disk cache?
>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to manage shared persistent local caching (FS-Cache) with NFS?
  2007-12-06 20:00   ` David Howells
  2007-12-07 17:59     ` Chuck Lever
@ 2007-12-08  0:52     ` David Howells
  1 sibling, 0 replies; 9+ messages in thread
From: David Howells @ 2007-12-08  0:52 UTC (permalink / raw)
  To: Chuck Lever
  Cc: dhowells, Peter Staubach, Trond Myklebust, nfsv4, linux-kernel

Chuck Lever <chuck.lever@oracle.com> wrote:

> Why not encode the local mounted-on directory in the key?

Can't.  Namespaces.  chroot.

> Meaning your cache is at quota all the time, and to continue operation it must
> eject items constantly.

I've thought about that, thank you.  Go and read the documentation.  There's
configurable hysteresis in the culling algorithm.

> This is a scenario where it pays to cache the read-mostly items on disk, and
> leave the frequently changing items in memory.

Currently any file which is opened for writing is automatically ejected from
the cache.

> The economics of disk caches is different than memory caches.  Disk caches are
> much larger and cheaper, but their performance tanks when  they have to track
> frequently changing files.  Memory caches are  smaller, but tracking
> frequently changing data is only a little more  expensive than tracking data
> that doesn't change often.

I'm aware of all that.  My OLS slides and paper can be found here:

	http://people.redhat.com/~dhowells/fscache/fscache-ols2006.odp
	http://people.redhat.com/~dhowells/fscache/FS-Cache.pdf

Lots of small files also hurt worse than fewer big files in some ways.  Lots
more metadata in the cache.  On the other hand, fragmentation is less of a
problem.

Anyway, this is straying off the main topic.

> I think it's key to preventing FS-cache from making performance worse in many
> common scenarios.

Perhaps.  The problem is that NFS doesn't know what the access pattern on a
file is expected to be.  I've been asked to provide fine-grained cache
controls (perhaps directory level), but Al Viro was, erm, luke warm in his
reception to that idea.

Gathering statistical data dynamically has performance penalties of its own:-/

> Disconnected operation for NFS is fraught with challenges.  Access to data on
> servers is traditionally gated by the client's IP address,  for example.  The
> client may disconnect from the network, then  reconnect using a different
> address where suddenly all of its  accesses are rebuffed.

Agreed, but isn't that one of the design goals for NFS4?

It's also something of interest to other netfs's that might want to use
FS-Cache.  This isn't an NFS-only facility.

> NFS servers, not clients, traditionally determine the file's mtime and ctime,
> and its file handle.  So file updates and file creation  become problematic.
> The client has to reconcile the server's file  handle, for files created
> offline, with its own when reconnecting.

Yes.  Basically it's a major can of major worms.  Doesn't stop people wanting
it, though.

> And, for disconnected operation, the cache is required to contain every item
> from the remote.  You can't just drop items from the cache  because they are
> inconvenient.

Yes.  That's what pinning and reservations are for.

Currently, support for disconnected operations is an idea I'd like to have,
but is otherwise mostly non-existent.

> That something might be the pathname of the mounted-on directory or of the
> file itself.

See above.

> Yes, they do.  The combination of mount options and mounted-on directory (or
> local pathname to the file) gives you a unique identity  for that view.

See above.

> So an item is cached in memory until space becomes available in the disk
> cache?

The item isn't considered for caching until space becomes available in the
disk cache.  It's put on a queue for potential caching, but won't actually be
cached if it gets discarded from the icache or pagecache before being cached.

It's unfortunate, but with a fast network you can download data faster than
you can make space in the cache.  unlink() and rmdir() are (a) slow and (b)
synchronous.  Each unlink() or rmdir() operation requires a task to perform
it, and that task is committed until the op finishes.

I could actually improve cachefilesd (the userspace cache culler) by giving it
multiple threads.

However, having cachefilesd doing lots of parallel synchronous, journalled
disk ops hurts performance in other ways I've noticed:-/

Again, hysteresis is available.  We stop writing stuff into the cache beyond
a limit until the free space drops sufficiently below that limit that we've
got a good go at writing a load new stuff, rather than just a block here and a
block there.

It's all very icky, and depends as much on the filesystem underlying the cache
(ext3 for example) and *its* configuration, as the characteristics of the
netfs and the network link.  It's all about compromise.

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-12-08  0:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-05 17:11 How to manage shared persistent local caching (FS-Cache) with NFS? David Howells
2007-12-05 17:49 ` Jon Masters
2007-12-05 18:03 ` David Howells
2007-12-05 19:54 ` Chuck Lever
2007-12-06  1:22 ` David Howells
2007-12-06 18:28   ` Chuck Lever
2007-12-06 20:00   ` David Howells
2007-12-07 17:59     ` Chuck Lever
2007-12-08  0:52     ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).