linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Redesigning and modernising fscache
@ 2021-01-14 10:45 David Howells
  2021-01-14 16:19 ` Matthew Wilcox
  2021-01-18 23:36 ` Cut down implementation of fscache new API David Howells
  0 siblings, 2 replies; 3+ messages in thread
From: David Howells @ 2021-01-14 10:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-cachefs
  Cc: dhowells, jlayton, dwysocha, Matthew Wilcox, Dominique Martinet,
	Steve French, Trond Myklebust, Anna Schumaker, Christoph Hellwig,
	dchinner, Linus Torvalds, v9fs-developer, linux-afs, ceph-devel,
	linux-cifs, linux-nfs, linux-kernel

Hi,

I've been working on modernising fscache, primarily with help from Jeff Layton
and Dave Wysochanski with porting Ceph and NFS to it, and with Willy helpfully
reinventing the VM I/O interface beneath us;-).

However, there've been some objections to the approach I've taken to
implementing this.  The way I've done it is to disable the use of fscache by
the five network filesystems that use it, remove much of the old code, put in
the reimplementation, then cut the filesystems over.  I.e. rip-and-replace.
It leaves unported filesystems unable to use it - but three of the five are
done (afs, ceph, nfs), and I've supplied partially-done patches for the other
two (9p, cifs).

It's been suggested that it's too hard to review this way and that either I
should go for a gradual phasing in or build the new one in parallel.  The
first is difficult because I want to change how almost everything in there
works - but the parts are tied together; the second is difficult because there
are areas that would *have* to overlap (the UAPI device file, the cache
storage, the cache size limits and at least some state for managing these), so
there would have to be interaction between the two variants.  One refinement
of the latter would be to make the two implementations mutually exclusive: you
can build one or the other, but not both.

However.  Given that I want to replace the on-disk format in cachefiles at
some point, and change what the userspace component does, maybe I should
create a new, separate UAPI interface and do the on-disk format change at the
same time.  In which case, it makes sense to build a parallel variant


Anyway, a bit of background into the why.  There are a number of things that
need to be fixed in fscache/cachefiles:

 (1) The use of bmap to test whether the backing fs contains a cache block.
     This is not reliable in a modern extent-based filesystem as it can insert
     and remove bridging blocks of zeros at will.

     Having discussed this with Christoph Hellwig and Dave Chinner, I think I
     that the cache really needs to keep track of this for itself.

 (2) The use of pagecache waitlist snooping to find out if a backing
     filesystem page has been updated yet.  I have the feeling that this is
     not 100% reliable from untrackdownable bugs that seem to relate to this.

     I really would rather be told directly by the backing fs that the op was
     complete.  Switching over to kiocbs means that can be done.

 (3) Having to go through the pagecache attached to the backing file, copying
     data from it or using vfs_write() to write into it.  This doubles the
     amount of pagecache required and adds a bunch of copies for good measure.

     When I wrote the cachefiles caching backend, using direct I/O from within
     the kernel wasn't possible - but, now that kiocbs are available, I can
     actually do async DIO from the backing files to/from the netfs pages,
     cutting out copies in both direction, and using the kiocb completion
     function to tell me when it's done.

 (4) fscache's structs have a number of pointers back into the netfs, which
     makes it tricky if the netfs instance goes away whilst the cache is
     active.

     I really want no pointers back - apart from very transient I/O completion
     callbacks.  I can store the metadata I need in the cookie.

Modernising this affords the opportunity to make huge simplifications in the
code (shaving off over 1000 lines, maybe as many as 3000).

One thing I've done is to make a helper library that handles a number of
features on behalf of a netfs if it wants to use the library:

 (*) Local caching.

 (*) Segmentation and shaping of read operations.

     This takes a ->readahead() request from the VM and translates it into one
     or more reads against the cache and the netfs, allowing both to
     adjust/expand the size of the individual subops according to internal
     alignments.

     Allowing the cache to expand a read request to put it on a larger
     granularity allows the cache to use less metadata to represent what it
     contains.

     It also provides a place to retry operations (something that's required
     if a read against the cache fails and we need to send it to the server
     instead).

 (*) Transparent huge pages (Willy).

 (*) A place to put fscrypt support (Jeff).

We have the first three working - with some limitations - for afs, nfs and
ceph, and I've produced partial patches for 9p and cifs. afs, nfs and ceph are
able to handle xfstests with a cache now - which is something that the old
fscache code will just explode with.


So, as stated, much of that code is written and working.  However, if I do a
complete replacement all the way out to userspace, there are further changes
I'm thinking of making:

 (*) Get rid of the ability to remove a cache that's in use.  This accounts
     for a *lot* of the complexity in fscache.  All the synchronisation
     required to effect the removal of a live cache at any time whilst it's
     actually being used.

 (*) Change cachefiles so that it uses an index file and a single data file
     and perform culling by marking the index rather than deleting data files.
     Culling would then be moved into the kernel.  cachefilesd is then
     unnecessary, except to load the config and keep the cache open.

     Moving the culling into an index would also make manual invalidation
     easier.

 (*) Rather than using cachefilesd to hold the cache open, do something akin
     to swapon/swapoff to add and remove the cache.
     
     Attempting to remove an in-use cache would either fail EBUSY or mark the
     cache to be removed when it becomes unused and not allow further new
     users.

 (*) Declare the size of the cache up front rather than declaring that it has
     to maintain a certain amount of free space, reducing the cache to make
     more space if the level drops.

 (*) Merge cachefiles into fscache.  Give up the ability to have alternate
     cache backends.  That would allow a bit more reduction in the complexity
     and reduce the number of function pointers gone through.

David


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Redesigning and modernising fscache
  2021-01-14 10:45 Redesigning and modernising fscache David Howells
@ 2021-01-14 16:19 ` Matthew Wilcox
  2021-01-18 23:36 ` Cut down implementation of fscache new API David Howells
  1 sibling, 0 replies; 3+ messages in thread
From: Matthew Wilcox @ 2021-01-14 16:19 UTC (permalink / raw)
  To: David Howells
  Cc: linux-fsdevel, linux-cachefs, jlayton, dwysocha,
	Dominique Martinet, Steve French, Trond Myklebust,
	Anna Schumaker, Christoph Hellwig, dchinner, Linus Torvalds,
	v9fs-developer, linux-afs, ceph-devel, linux-cifs, linux-nfs,
	linux-kernel

On Thu, Jan 14, 2021 at 10:45:06AM +0000, David Howells wrote:
> However, there've been some objections to the approach I've taken to
> implementing this.  The way I've done it is to disable the use of fscache by
> the five network filesystems that use it, remove much of the old code, put in
> the reimplementation, then cut the filesystems over.  I.e. rip-and-replace.
> It leaves unported filesystems unable to use it - but three of the five are
> done (afs, ceph, nfs), and I've supplied partially-done patches for the other
> two (9p, cifs).
> 
> It's been suggested that it's too hard to review this way and that either I
> should go for a gradual phasing in or build the new one in parallel.  The
> first is difficult because I want to change how almost everything in there
> works - but the parts are tied together; the second is difficult because there
> are areas that would *have* to overlap (the UAPI device file, the cache
> storage, the cache size limits and at least some state for managing these), so
> there would have to be interaction between the two variants.  One refinement
> of the latter would be to make the two implementations mutually exclusive: you
> can build one or the other, but not both.

My reservation with "build fscache2" is that it's going to take some
time to do, and I really want rid of ->readpages as soon as possible.

What I'd like to see is netfs_readahead() existing as soon as possible,
built on top of the current core.  Then filesystems can implement
netfs_read_request_ops one by one, and they become insulated from the
transition.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Cut down implementation of fscache new API
  2021-01-14 10:45 Redesigning and modernising fscache David Howells
  2021-01-14 16:19 ` Matthew Wilcox
@ 2021-01-18 23:36 ` David Howells
  1 sibling, 0 replies; 3+ messages in thread
From: David Howells @ 2021-01-18 23:36 UTC (permalink / raw)
  To: linux-fsdevel, linux-cachefs
  Cc: dhowells, jlayton, dwysocha, Matthew Wilcox, Dominique Martinet,
	Steve French, Trond Myklebust, Anna Schumaker, Christoph Hellwig,
	dchinner, Linus Torvalds, v9fs-developer, linux-afs, ceph-devel,
	linux-cifs, linux-nfs, linux-kernel

Take a look at:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/

I've extracted the netfs helper library from my patch set and built an
alternative cut-down I/O API for the existing fscache code as a bridge to
moving to a new fscache implementation.  With this, a netfs now has two
choices: use the existing API as is or use the netfs lib and the alternative
API.  You can't mix the two APIs - a netfs has to use one or the other.

It works with AFS, at least for reading data through a cache, and without a
cache, xfstests is quite happy.  I was able to take a bunch of the AFS patches
from my fscache-iter branch (the full rewrite) and apply them with minimal
changes.  Since it goes through the new I/O API in both cases, those changes
should be the same.  The main differences are in the cookie wrangling API.

The alternative API is different from the current in the following ways:

 (1) It uses kiocbs to do async DIO rather than using readpage() with page
     wake queue snooping and vfs_write().

 (2) It uses SEEK_HOLE/SEEK_DATA rather than bmap() to determine the location
     of data in the file.  This is still broken because we can't rely on this
     information in the backing filesystem.

 (3) It completely changes how PG_fscache is used.  As for the new API, it's
     used to indicate an in progress write to the cache from a page rather
     than a page the cache knows about.

 (4) It doesn't keep track of the netfs's pages beyond the termination of an
     I/O operation.  The old API added pages that have outstanding writes to
     the cache to a radix three for a background writer; now an async kiocb is
     dispatched.

 (5) The netfs needs to call fscache_begin_read_operation() from its
     ->begin_cache_operation() handler as passed to the netfs helper lib.
     This tells the netfs helpers how to access the cache.

 (6) It relies on the netfs helper lib to reissue a failed cache read to the
     server.

 (7) Handles THPs.

 (8) Implements completely ->readahead() and ->readpage() and implements a
     chunk of ->write_begin().

Things it doesn't address:

 (1) Mapping the content independently of the backing filesystem's metadata.

 (2) Getting rid of the backpointers into the netfs.

 (3) Simplifying the management of cookies and objects and their processing.

 (4) Holding an open file to the cache for any great length of time.  It gets
     a new file struct for each read op it does on the cache and drops it
     again afterwards.

 (5) Pinning the cache context/state required to handle a deferred write to
     the cache from ->write_begin() as performed by, say, ->writepages().

David


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-01-18 23:38 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-14 10:45 Redesigning and modernising fscache David Howells
2021-01-14 16:19 ` Matthew Wilcox
2021-01-18 23:36 ` Cut down implementation of fscache new API David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).