linux-cifs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* FS-Cache/CacheFiles rewrite
@ 2019-11-13 17:55 David Howells
  2019-11-13 18:46 ` Jeff Layton
  2019-11-14 13:40 ` How to avoid using bmap in cachefiles -- " David Howells
  0 siblings, 2 replies; 3+ messages in thread
From: David Howells @ 2019-11-13 17:55 UTC (permalink / raw)
  To: Steve French, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steve Dickson, Alexander Viro
  Cc: dhowells, v9fs-developer, linux-afs, linux-cifs, linux-cachefs,
	ceph-devel, linux-nfs, linux-fsdevel, linux-kernel

Hi,

I've been rewriting the local cache for network filesystems with the aim of
simplifying it, speeding it up, reducing its memory overhead and making it
more understandable and easier to debug.

For the moment fscache support is disabled in all network filesystems that
were using, apart from afs.

To this end, I have so far made the following changes to fscache:

 (1) The fscache_cookie_def struct has gone, along with all the callback
     functions it used to contain.  The filesystem stores the auxiliary data
     and file size into the cookie and these are written back lazily (possibly
     too lazily at the moment).  Any necessary information is passed in to
     fscache_acquire_cookie().

 (2) The object state machine has been removed and replaced by a much simpler
     dispatcher that runs the entire cookie instantiation procedure, cookie
     teardown procedure or cache object withdrawal procedure in one go without
     breaking it down into cancellable/abortable states.

     To avoid latency issues, a thread pool is created to which these
     operations will be handed off if any threads are idle; if no threads are
     idle, the operation is run in the process that triggered it.

 (3) The entire I/O API has been deleted and replaced with one that *only*
     provides a "read cache to iter" function and a "write iter to cache"
     function.  The cache therefore neither knows nor cares where netfs pages
     are - and indeed, reads and writes don't need to go to such places.

 (4) The netfs must allow the cache the opportunity to 'shape' a read from the
     server, both from ->readpages() and from ->write_begin(), so that the
     size and start of the read are of a suitably aligned for the cache
     granularity.  Cachefiles is currently using a 256KiB granule size.

     A helper is provided to do most of the work: fscache_read_helper().

 (5) An additional layer, an fscache_io_handle, has been interposed in the I/O
     API that allows the cache to store transient stuff, such as the open file
     struct pointer to the backing file for the duration of the netfs file
     being open.

     I'm tempted on one hand to merge this into the fscache_object struct and
     on the other hand to use this to get rid of 'cookie enablement' and allow
     already open files to be connected to the cache.

 (6) The PG_fscache bit is now set on a page to indicate that the page is
     being written to the cache and cleared upon completion.  write_begin,
     page_mkwrite, releasepage, invalidatepage, etc. can wait on this.

 (7) Cookie removal now read-locks the semaphore that is used to manage
     addition and removal of a cache.  This greatly simplifies the logic in
     detaching an object from a cookie and cleaning it up as relinquishment
     and withdrawal can't then happen simultaneously.

     It does mean, though, that cookie relinquishment is held up by cache
     removal.

And the following changes to cachefiles:

 (1) The I/O code has been replaced.  The page waitqueue snooping and deferred
     backing-page to netfs-page copy is now entirely gone and asynchronous
     direct I/O through kiocbs is now used instead to effect the transfer of
     data to/from the cache.

     This affords a speed increase of something like 40-50% and reduces the
     amount of memory that is pinned during I/O.

 (2) bmap() is no longer used to detect the presence of blocks in the
     filesystem.  With a modern extent based filesystem, this may give both
     false positives and false negatives if the filesystem optimises an extent
     by eliminating a block of zeros or adds a block to bridge between two
     close neighbours.

     Instead, a content map is stored in an xattr on the backing file, with 1
     bit per 256KiB block.  The cache shapes the netfs's read requests to
     request multiple-of-256KiB reads from the server, which are then written
     back.

 (3) The content map and attributes are then stored lazily when the object is
     destroyed.  This may be too lazy.

To aid this I've added the following:

 (1) Wait/wake functions for the PG_fscache bit.

 (2) ITER_MAPPING iterator that refers to a contiguous sequence of pinned
     pages with no holes in a mapping.  This means you don't have to allocate
     a sequence of bio_vecs to represent the same thing.

     As stated, the pages *must* be pinned - such as by PG_locked,
     PG_writeback or PG_fscache - before iov_iter_mapping() is called to set
     the mapping up.

Things that still need doing:

 (1) afs (and any other netfs) needs to write changes to the cache at the same
     time it writes them to the server so that the cache doesn't get out of
     sync.  This is also necessary to implement write-back caching and
     disconnected operation.

 (2) The content map is limited by the maximum xattr size.  Is it possible to
     configure the backing filesystem so that it doesn't merge extents across
     certain boundaries or eliminate blocks of zeros so that I don't need a
     content map?

 (3) Use O_TMPFILE in the cache to effect immediate invalidation.  I/O can
     then continue to progress whilst the cache driver replaces the linkage.

 (4) The file in the cache needs to be truncated if the netfs file is
     shortened by truncation.

 (5) Data insertion into the cache is not currently checked for space
     availability.

 (6) The stats need going over.  Some of them are obsolete and there are no
     I/O stats working at the moment.

 (7) Replacement I/O tracepoints are required.

Future changes:

 (1) Get rid of cookie enablement.

 (2) Frame the limit on the cache capacity in terms of an amount of data that
     can be stored in it rather than an amount of free space that must be
     kept.

 (3) Move culling out of cachefilesd into the kernel.

 (4) Use the I/O handle to add caching to files that are already open, perhaps
     listing I/O handles from the cache tag.

Questions:

 (*) Does it make sense to actually permit multiple caches?

 (*) Do we want to allow multiple filesystem instances (think NFS) to use the
     same cache objects?  fscache no longer knows about the netfs state, and
     the netfs now just reads and writes to the cache, so it's kind of
     possible - but coherency management is tricky and would definitely be up
     to the netfs.

The patches can be found here:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter

I'm not going to post them for the moment unless someone really wants that.

David


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: FS-Cache/CacheFiles rewrite
  2019-11-13 17:55 FS-Cache/CacheFiles rewrite David Howells
@ 2019-11-13 18:46 ` Jeff Layton
  2019-11-14 13:40 ` How to avoid using bmap in cachefiles -- " David Howells
  1 sibling, 0 replies; 3+ messages in thread
From: Jeff Layton @ 2019-11-13 18:46 UTC (permalink / raw)
  To: David Howells, Steve French, Trond Myklebust, Anna Schumaker,
	Steve Dickson, Alexander Viro
  Cc: v9fs-developer, linux-afs, linux-cifs, linux-cachefs, ceph-devel,
	linux-nfs, linux-fsdevel, linux-kernel

On Wed, 2019-11-13 at 17:55 +0000, David Howells wrote:
> Hi,
> 
> I've been rewriting the local cache for network filesystems with the aim of
> simplifying it, speeding it up, reducing its memory overhead and making it
> more understandable and easier to debug.
> 
> For the moment fscache support is disabled in all network filesystems that
> were using, apart from afs.
> 
> To this end, I have so far made the following changes to fscache:
> 
>  (1) The fscache_cookie_def struct has gone, along with all the callback
>      functions it used to contain.  The filesystem stores the auxiliary data
>      and file size into the cookie and these are written back lazily (possibly
>      too lazily at the moment).  Any necessary information is passed in to
>      fscache_acquire_cookie().
> 
>  (2) The object state machine has been removed and replaced by a much simpler
>      dispatcher that runs the entire cookie instantiation procedure, cookie
>      teardown procedure or cache object withdrawal procedure in one go without
>      breaking it down into cancellable/abortable states.
> 
>      To avoid latency issues, a thread pool is created to which these
>      operations will be handed off if any threads are idle; if no threads are
>      idle, the operation is run in the process that triggered it.
> 
>  (3) The entire I/O API has been deleted and replaced with one that *only*
>      provides a "read cache to iter" function and a "write iter to cache"
>      function.  The cache therefore neither knows nor cares where netfs pages
>      are - and indeed, reads and writes don't need to go to such places.
> 
>  (4) The netfs must allow the cache the opportunity to 'shape' a read from the
>      server, both from ->readpages() and from ->write_begin(), so that the
>      size and start of the read are of a suitably aligned for the cache
>      granularity.  Cachefiles is currently using a 256KiB granule size.
> 
>      A helper is provided to do most of the work: fscache_read_helper().
> 
>  (5) An additional layer, an fscache_io_handle, has been interposed in the I/O
>      API that allows the cache to store transient stuff, such as the open file
>      struct pointer to the backing file for the duration of the netfs file
>      being open.
> 
>      I'm tempted on one hand to merge this into the fscache_object struct and
>      on the other hand to use this to get rid of 'cookie enablement' and allow
>      already open files to be connected to the cache.
> 
>  (6) The PG_fscache bit is now set on a page to indicate that the page is
>      being written to the cache and cleared upon completion.  write_begin,
>      page_mkwrite, releasepage, invalidatepage, etc. can wait on this.
> 
>  (7) Cookie removal now read-locks the semaphore that is used to manage
>      addition and removal of a cache.  This greatly simplifies the logic in
>      detaching an object from a cookie and cleaning it up as relinquishment
>      and withdrawal can't then happen simultaneously.
> 
>      It does mean, though, that cookie relinquishment is held up by cache
>      removal.
> 
> And the following changes to cachefiles:
> 
>  (1) The I/O code has been replaced.  The page waitqueue snooping and deferred
>      backing-page to netfs-page copy is now entirely gone and asynchronous
>      direct I/O through kiocbs is now used instead to effect the transfer of
>      data to/from the cache.
> 
>      This affords a speed increase of something like 40-50% and reduces the
>      amount of memory that is pinned during I/O.
> 
>  (2) bmap() is no longer used to detect the presence of blocks in the
>      filesystem.  With a modern extent based filesystem, this may give both
>      false positives and false negatives if the filesystem optimises an extent
>      by eliminating a block of zeros or adds a block to bridge between two
>      close neighbours.
> 
>      Instead, a content map is stored in an xattr on the backing file, with 1
>      bit per 256KiB block.  The cache shapes the netfs's read requests to
>      request multiple-of-256KiB reads from the server, which are then written
>      back.
> 
>  (3) The content map and attributes are then stored lazily when the object is
>      destroyed.  This may be too lazy.
> 

Having something that just provides a caching facility without so many
tendrils back into the netfs seems like a big improvement.

> To aid this I've added the following:
> 
>  (1) Wait/wake functions for the PG_fscache bit.
> 
>  (2) ITER_MAPPING iterator that refers to a contiguous sequence of pinned
>      pages with no holes in a mapping.  This means you don't have to allocate
>      a sequence of bio_vecs to represent the same thing.
> 
>      As stated, the pages *must* be pinned - such as by PG_locked,
>      PG_writeback or PG_fscache - before iov_iter_mapping() is called to set
>      the mapping up.
> 
> Things that still need doing:
> 
>  (1) afs (and any other netfs) needs to write changes to the cache at the same
>      time it writes them to the server so that the cache doesn't get out of
>      sync.  This is also necessary to implement write-back caching and
>      disconnected operation.
> 
>  (2) The content map is limited by the maximum xattr size.  Is it possible to
>      configure the backing filesystem so that it doesn't merge extents across
>      certain boundaries or eliminate blocks of zeros so that I don't need a
>      content map?
> 
>  (3) Use O_TMPFILE in the cache to effect immediate invalidation.  I/O can
>      then continue to progress whilst the cache driver replaces the linkage.
> 
>  (4) The file in the cache needs to be truncated if the netfs file is
>      shortened by truncation.
> 
>  (5) Data insertion into the cache is not currently checked for space
>      availability.
> 
>  (6) The stats need going over.  Some of them are obsolete and there are no
>      I/O stats working at the moment.
> 
>  (7) Replacement I/O tracepoints are required.
> 
> Future changes:
> 
>  (1) Get rid of cookie enablement.
> 
>  (2) Frame the limit on the cache capacity in terms of an amount of data that
>      can be stored in it rather than an amount of free space that must be
>      kept.
> 
>  (3) Move culling out of cachefilesd into the kernel.
> 
>  (4) Use the I/O handle to add caching to files that are already open, perhaps
>      listing I/O handles from the cache tag.
> 
> Questions:
> 
>  (*) Does it make sense to actually permit multiple caches?
> 

I don't see a lot of value in having more than one per fstype. Given
that the index_key can be as big as you'd reasonably need, it seems
better to just require that the netfs drivers generate keys that are
unique system-wide.

fscache should however, be prepared to deal with collisions between
different drivers (e.g. NFS and ceph). That said, we could just require
that the fs' add some per-fstype value into each index_key.

>  (*) Do we want to allow multiple filesystem instances (think NFS) to use the
>      same cache objects?  fscache no longer knows about the netfs state, and
>      the netfs now just reads and writes to the cache, so it's kind of
>      possible - but coherency management is tricky and would definitely be up
>      to the netfs.

NFS is sort of a difficult example. It aggressively shares superblocks
where it can. You only get multiple inodes for the same server-side
object when you mount with "-o nosharecache" (or some other option
prevents sharing), and at that point you probably don't want to share
objects.

For something like ceph, you could have two mounts with different local
superblocks that point to different subtrees of the root cephfs. In that
configuration we can have two different inodes (in different sb's) for
the same cephfs MDS inode (maybe hardlinked across the two mounted
directories).

I think it comes down to how atomic the read and write operations are.
Are the reads and writes serialized such that a read will block until a
write completes?

Assuming that you won't end up with the data in some half-baked state,
then leaving that up to the netfs seems like the right thing to do. That
seems more in accordance with just having fscache be a simple(r) caching
layer anyway.

> The patches can be found here:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter
> 
> I'm not going to post them for the moment unless someone really wants that.

The diffstat looks great so far!

 62 files changed, 4036 insertions(+), 7095 deletions(-)

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 3+ messages in thread

* How to avoid using bmap in cachefiles -- FS-Cache/CacheFiles rewrite
  2019-11-13 17:55 FS-Cache/CacheFiles rewrite David Howells
  2019-11-13 18:46 ` Jeff Layton
@ 2019-11-14 13:40 ` David Howells
  1 sibling, 0 replies; 3+ messages in thread
From: David Howells @ 2019-11-14 13:40 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner, Theodore Ts'o
  Cc: dhowells, Alexander Viro, v9fs-developer, linux-afs, linux-cifs,
	linux-cachefs, ceph-devel, linux-nfs, linux-fsdevel,
	linux-kernel

Hi Christoph,

I've been rewriting cachefiles in the kernel and it now uses kiocbs to do
async direct I/O to/from the cache files - which seems to make a 40-48% speed
improvement.

However, I've replaced the use of bmap internally to detect whether data is
present or not - which is dodgy for a number of reasons, not least that
extent-based filesystems might insert or remove blocks of zeros to shape the
extents better, thereby rendering the metadata information useless for
cachefiles.

But using a separate map has a couple of problems:

 (1) The map is metadata kept outside of the filesystem journal, so coherency
     management is necessary

 (2) The map gets hard to manage for very large files (I'm using 256KiB
     granules, so 1 bit per granule means a 512-byte map block can span 1GiB)
     and xattrs can be of limited capacity.

I seem to remember you said something along the lines of it being possible to
tell the filesystem not to do discarding and insertion of blocks of zeros.  Is
there a generic way to do that?

Also, is it possible to make it so that I can tell an O_DIRECT read to fail
partially or, better, completely if there's no data to be had in part of the
range?  I can see DIO_SKIP_HOLES, but that only seems to affect writes

Thanks,
David


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-11-14 13:40 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-13 17:55 FS-Cache/CacheFiles rewrite David Howells
2019-11-13 18:46 ` Jeff Layton
2019-11-14 13:40 ` How to avoid using bmap in cachefiles -- " David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).