linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Upcoming: fscache rewrite
@ 2020-07-30 11:51 David Howells
  2020-07-30 12:16 ` Matthew Wilcox
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: David Howells @ 2020-07-30 11:51 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, Alexander Viro, Matthew Wilcox, Christoph Hellwig,
	Jeff Layton, Dave Wysochanski, Trond Myklebust, Anna Schumaker,
	Steve French, Eric Van Hensbergen, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	linux-kernel

Hi Linus, Trond/Anna, Steve, Eric,

I have an fscache rewrite that I'm tempted to put in for the next merge
window:

	https://lore.kernel.org/linux-fsdevel/159465784033.1376674.18106463693989811037.stgit@warthog.procyon.org.uk/

It improves the code by:

 (*) Ripping out the stuff that uses page cache snooping and kernel_write()
     and using kiocb instead.  This gives multiple wins: uses async DIO rather
     than snooping for updated pages and then copying them, less VM overhead.

 (*) Object management is also simplified, getting rid of the state machine
     that was managing things and using a much simplified thread pool instead.

 (*) Object invalidation creates a tmpfile and diverts new activity to that so
     that it doesn't have to synchronise in-flight ADIO.

 (*) Using a bitmap stored in an xattr rather than using bmap to find out if
     a block is present in the cache.  Probing the backing filesystem's
     metadata to find out is not reliable in modern extent-based filesystems
     as them may insert or remove blocks of zeros.  Even SEEK_HOLE/SEEK_DATA
     are problematic since they don't distinguish transparently inserted
     bridging.

I've provided a read helper that handles ->readpage, ->readpages, and
preparatory writes in ->write_begin.  Willy is looking at using this as a way
to roll his new ->readahead op out into filesystems.  A good chunk of this
will move into MM code.

The code is simpler, and this is nice too:

 67 files changed, 5947 insertions(+), 8294 deletions(-)

not including documentation changes, which I need to convert to rst format
yet.  That removes a whole bunch more lines.

But there are reasons you might not want to take it yet:

 (1) It starts off by disabling fscache support in all the filesystems that
     use it: afs, nfs, cifs, ceph and 9p.  I've taken care of afs, Dave
     Wysochanski has patches for nfs:

	https://lore.kernel.org/linux-nfs/1596031949-26793-1-git-send-email-dwysocha@redhat.com/

     but they haven't been reviewed by Trond or Anna yet, and Jeff Layton has
     patches for ceph:

	https://marc.info/?l=ceph-devel&m=159541538914631&w=2

     and I've briefly discussed cifs with Steve, but nothing has started there
     yet.  9p I've not looked at yet.

     Now, if we're okay for going a kernel release with 4/5 filesystems with
     caching disabled and then pushing the changes for individual filesystems
     through their respective trees, it might be easier.

     Unfortunately, I wasn't able to get together with Trond and Anna at LSF
     to discuss this.

 (2) The patched afs fs passed xfstests -g quick (unlike the upstream code
     that oopses pretty quickly with caching enabled).  Dave and Jeff's nfs
     and ceph code is getting close, but not quite there yet.

 (3) Al has objections to the ITER_MAPPING iov_iter type that I added

	https://lore.kernel.org/linux-fsdevel/20200719014436.GG2786714@ZenIV.linux.org.uk/

     but note that iov_iter_for_each_range() is not actually used by anything.

     However, Willy likes it and would prefer to make it ITER_XARRAY instead
     as he might be able to use it in other places, though there's an issue
     where I'm calling find_get_pages_contig() which takes a mapping (though
     all it does is then get the xarray out of it).

     Instead I would have to use ITER_BVEC, which has quite a high overhead,
     though it would mean that the RCU read lock wouldn't be necessary.  This
     would require 1K of memory for every 256K block the cache wants to read;
     for any read >1M, I'd have to use vmalloc() instead.

     I'd also prefer not to use ITER_BVEC because the offset and length are
     superfluous here.  If ITER_MAPPING is not good, would it be possible to
     have an ITER_PAGEARRAY that just takes a page array instead?  Or, even,
     create a transient xarray?

 (4) The way object culling is managed needs overhauling too, but that's a
     whole 'nother patchset.  We could wait till that's done too, but its lack
     doesn't prevent what we have now being used.

Thoughts?

David


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Upcoming: fscache rewrite
  2020-07-30 11:51 Upcoming: fscache rewrite David Howells
@ 2020-07-30 12:16 ` Matthew Wilcox
  2020-07-30 12:36 ` David Howells
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2020-07-30 12:16 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, Alexander Viro, Christoph Hellwig, Jeff Layton,
	Dave Wysochanski, Trond Myklebust, Anna Schumaker, Steve French,
	Eric Van Hensbergen, linux-cachefs, linux-afs, linux-nfs,
	linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	linux-kernel

On Thu, Jul 30, 2020 at 12:51:16PM +0100, David Howells wrote:
>  (3) Al has objections to the ITER_MAPPING iov_iter type that I added
> 
> 	https://lore.kernel.org/linux-fsdevel/20200719014436.GG2786714@ZenIV.linux.org.uk/
> 
>      but note that iov_iter_for_each_range() is not actually used by anything.
> 
>      However, Willy likes it and would prefer to make it ITER_XARRAY instead
>      as he might be able to use it in other places, though there's an issue
>      where I'm calling find_get_pages_contig() which takes a mapping (though
>      all it does is then get the xarray out of it).

I suspect you don't need to call find_get_pages_contig().  If you look
at __readahead_batch() in pagemap.h, it does basically what you want
(other than being wrapped up inside the readahead iterator).  You require
the pages already be pinned in the xarray, so there's no need for the
page_cache_get_speculative() dance that find_get_pages_contig) does,
nor the check for xa_is_value().

My main concern with your patchset is that it introduces a new page flag
to sleep on which basically means "I am writing this page to the fscache".
I don't understand why you need it; you've elevated the refcount on
the pages so they're not going to get reused for another purpose.
All it does (as far as I can tell) is make a task calling truncate()
wait for the page to finish being written to cache, which isn't actually
necessary.

Overall, I do like the patch series!  It's a big improvement over what we
currently have and will make it easier to finish the readpages->readahead
conversion.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Upcoming: fscache rewrite
  2020-07-30 11:51 Upcoming: fscache rewrite David Howells
  2020-07-30 12:16 ` Matthew Wilcox
@ 2020-07-30 12:36 ` David Howells
  2020-07-30 13:08 ` Jeff Layton
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: David Howells @ 2020-07-30 12:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, torvalds, Alexander Viro, Christoph Hellwig,
	Jeff Layton, Dave Wysochanski, Trond Myklebust, Anna Schumaker,
	Steve French, Eric Van Hensbergen, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> I suspect you don't need to call find_get_pages_contig().  If you look
> at __readahead_batch() in pagemap.h, it does basically what you want
> (other than being wrapped up inside the readahead iterator).  You require
> the pages already be pinned in the xarray, so there's no need for the
> page_cache_get_speculative() dance that find_get_pages_contig) does,
> nor the check for xa_is_value().

I'll have a look at that.

> My main concern with your patchset is that it introduces a new page flag

Technically, the flag already exists - I'm just using it for something
different than the old fscache code used it for.

> to sleep on which basically means "I am writing this page to the fscache".
> I don't understand why you need it; you've elevated the refcount on
> the pages so they're not going to get reused for another purpose.
> All it does (as far as I can tell) is make a task calling truncate()
> wait for the page to finish being written to cache, which isn't actually
> necessary.

It's also used to prevent starting overlapping async DIO writes to the cache.

See fscache_read_done(), where it's set to cover writing what we've just read
from the server to the cache, and afs_write_back_from_locked_page(), where
it's set to cover writing the data to be written back to the cache.

David


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Upcoming: fscache rewrite
  2020-07-30 11:51 Upcoming: fscache rewrite David Howells
  2020-07-30 12:16 ` Matthew Wilcox
  2020-07-30 12:36 ` David Howells
@ 2020-07-30 13:08 ` Jeff Layton
  2020-08-03 16:30 ` [GIT PULL] " David Howells
  2020-08-10 15:16 ` [GIT PULL] fscache rewrite -- please drop for now David Howells
  4 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2020-07-30 13:08 UTC (permalink / raw)
  To: David Howells, torvalds
  Cc: Alexander Viro, Matthew Wilcox, Christoph Hellwig,
	Dave Wysochanski, Trond Myklebust, Anna Schumaker, Steve French,
	Eric Van Hensbergen, linux-cachefs, linux-afs, linux-nfs,
	linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	linux-kernel

On Thu, 2020-07-30 at 12:51 +0100, David Howells wrote:
> Hi Linus, Trond/Anna, Steve, Eric,
> 
> I have an fscache rewrite that I'm tempted to put in for the next merge
> window:
> 
> 	https://lore.kernel.org/linux-fsdevel/159465784033.1376674.18106463693989811037.stgit@warthog.procyon.org.uk/
> 
> It improves the code by:
> 
>  (*) Ripping out the stuff that uses page cache snooping and kernel_write()
>      and using kiocb instead.  This gives multiple wins: uses async DIO rather
>      than snooping for updated pages and then copying them, less VM overhead.
> 
>  (*) Object management is also simplified, getting rid of the state machine
>      that was managing things and using a much simplified thread pool instead.
> 
>  (*) Object invalidation creates a tmpfile and diverts new activity to that so
>      that it doesn't have to synchronise in-flight ADIO.
> 
>  (*) Using a bitmap stored in an xattr rather than using bmap to find out if
>      a block is present in the cache.  Probing the backing filesystem's
>      metadata to find out is not reliable in modern extent-based filesystems
>      as them may insert or remove blocks of zeros.  Even SEEK_HOLE/SEEK_DATA
>      are problematic since they don't distinguish transparently inserted
>      bridging.
> 
> I've provided a read helper that handles ->readpage, ->readpages, and
> preparatory writes in ->write_begin.  Willy is looking at using this as a way
> to roll his new ->readahead op out into filesystems.  A good chunk of this
> will move into MM code.
> 
> The code is simpler, and this is nice too:
> 
>  67 files changed, 5947 insertions(+), 8294 deletions(-)
> 
> not including documentation changes, which I need to convert to rst format
> yet.  That removes a whole bunch more lines.
> 
> But there are reasons you might not want to take it yet:
> 
>  (1) It starts off by disabling fscache support in all the filesystems that
>      use it: afs, nfs, cifs, ceph and 9p.  I've taken care of afs, Dave
>      Wysochanski has patches for nfs:
> 
> 	https://lore.kernel.org/linux-nfs/1596031949-26793-1-git-send-email-dwysocha@redhat.com/
> 
>      but they haven't been reviewed by Trond or Anna yet, and Jeff Layton has
>      patches for ceph:
> 
> 	https://marc.info/?l=ceph-devel&m=159541538914631&w=2
> 
>      and I've briefly discussed cifs with Steve, but nothing has started there
>      yet.  9p I've not looked at yet.
> 
>      Now, if we're okay for going a kernel release with 4/5 filesystems with
>      caching disabled and then pushing the changes for individual filesystems
>      through their respective trees, it might be easier.
> 
>      Unfortunately, I wasn't able to get together with Trond and Anna at LSF
>      to discuss this.
> 
>  (2) The patched afs fs passed xfstests -g quick (unlike the upstream code
>      that oopses pretty quickly with caching enabled).  Dave and Jeff's nfs
>      and ceph code is getting close, but not quite there yet.


That was my experience on cephfs+fscache too -- it often crashed down in
the fscache code. I'd support the approach in (1) above -- put this in
soon and disable the caches in the filesystems. Then push the changes to
reenable it via fs-specific trees.

The ceph patch series is more or less ready. It passes all of the
xfstest "quick" group run (aside from a few expected failures on
cephfs).

The only real exception is generic/531, which seems to trigger OOM kills
in my testing. The test tries to create a ton of files and leak the file
descriptors. I tend to think that workload is pretty unusual, and given
that fscache was terribly unstable and crashed before, this is still a
marked improvement.

>  (3) Al has objections to the ITER_MAPPING iov_iter type that I added
> 
> 	https://lore.kernel.org/linux-fsdevel/20200719014436.GG2786714@ZenIV.linux.org.uk/
> 
>      but note that iov_iter_for_each_range() is not actually used by anything.
> 
>      However, Willy likes it and would prefer to make it ITER_XARRAY instead
>      as he might be able to use it in other places, though there's an issue
>      where I'm calling find_get_pages_contig() which takes a mapping (though
>      all it does is then get the xarray out of it).
> 
>      Instead I would have to use ITER_BVEC, which has quite a high overhead,
>      though it would mean that the RCU read lock wouldn't be necessary.  This
>      would require 1K of memory for every 256K block the cache wants to read;
>      for any read >1M, I'd have to use vmalloc() instead.
> 
>      I'd also prefer not to use ITER_BVEC because the offset and length are
>      superfluous here.  If ITER_MAPPING is not good, would it be possible to
>      have an ITER_PAGEARRAY that just takes a page array instead?  Or, even,
>      create a transient xarray?
> 
>  (4) The way object culling is managed needs overhauling too, but that's a
>      whole 'nother patchset.  We could wait till that's done too, but its lack
>      doesn't prevent what we have now being used.
> 
> Thoughts?
> 
> David
> 

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [GIT PULL] fscache rewrite
  2020-07-30 11:51 Upcoming: fscache rewrite David Howells
                   ` (2 preceding siblings ...)
  2020-07-30 13:08 ` Jeff Layton
@ 2020-08-03 16:30 ` David Howells
  2020-08-10 15:16 ` [GIT PULL] fscache rewrite -- please drop for now David Howells
  4 siblings, 0 replies; 16+ messages in thread
From: David Howells @ 2020-08-03 16:30 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, Alexander Viro, Matthew Wilcox, Christoph Hellwig,
	Jeff Layton, Dave Wysochanski, Trond Myklebust, Anna Schumaker,
	Steve French, Eric Van Hensbergen, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	linux-kernel

Hi Linus,

Here's a set of patches that massively overhauls fscache and cachefiles.
It improves the code by:

 (*) Ripping out the stuff that uses page cache snooping and kernel_write()
     and using kiocb instead.  This gives multiple wins: uses async DIO
     rather than snooping for updated pages and then copying them, less VM
     overhead.

 (*) Object management is also simplified, getting rid of the state machine
     that was managing things and using a much simplified thread pool
     instead.

 (*) Object invalidation creates a tmpfile and diverts new activity to that
     so that it doesn't have to synchronise in-flight ADIO.

 (*) Using a bitmap stored in an xattr rather than using bmap to find out if
     a block is present in the cache.

     Probing the backing filesystem's metadata to find out is not reliable
     in modern extent-based filesystems as them may insert or remove blocks
     of zeros.  Even SEEK_HOLE/SEEK_DATA are problematic since they don't
     distinguish transparently-inserted bridging.

The patchset includes a read helper that handles ->readpage, ->readpages,
and preparatory writes in ->write_begin.  Matthew Wilcox is looking at
using this as a way to roll his new ->readahead op out into filesystems.  A
good chunk of this will move into MM code.

Note that this patchset does not include documentation changes yet.  I have
them (mostly) done, but they were based on the plain-text format that got
ReST-ified, and I haven't managed to get around to the conversion yet.  I
can create a follow-up patchset for that if this is taken.

Further note: There's a last minute change due to a bit of debugging code
that got left in mm/filemap.c that needed removing.

However, there are reasons you might not want to take it yet:

 (1) It starts off by disabling fscache support in all the filesystems that
     use it: afs, nfs, cifs, ceph and 9p.  I've taken care of afs, Dave
     Wysochanski has patches for nfs:

	https://lore.kernel.org/linux-nfs/1596031949-26793-1-git-send-email-dwysocha@redhat.com/

     but Trond and Anna haven't said anything yet, and Jeff Layton has
     patches for ceph:

	https://marc.info/?l=ceph-devel&m=159541538914631&w=2

     and I've briefly discussed cifs with Steve, but nothing has started
     there yet.  9p I haven't looked at yet.

     Are we okay for going a kernel release with 4/5 filesystems with
     caching disabled and then pushing the changes for individual
     filesystems through their respective trees?  I floated this question
     last week, but have no replies either way.

 (2) The patched afs fs passed xfstests -g quick (unlike the upstream code
     that oopses pretty quickly with caching enabled).  Dave and Jeff's nfs
     and ceph code is getting close, but not quite there yet.

 (3) Al has objections to the ITER_MAPPING iov_iter type that I added

	https://lore.kernel.org/linux-fsdevel/20200719014436.GG2786714@ZenIV.linux.org.uk/

     but note that iov_iter_for_each_range() is not actually used by anything.

     However, Willy likes it and would prefer to make it ITER_XARRAY instead
     as he might be able to use it in other places, though there's an issue
     where I'm calling find_get_pages_contig() which takes a mapping (though
     all it does is then get the xarray out of it).  Willy has made
     suggestions as to how this may be achieved, but I haven't got round to
     looking at them yet.

     Instead I would have to use ITER_BVEC, which has quite a high overhead,
     though it would mean that the RCU read lock wouldn't be necessary.  This
     would require 1K of memory for every 256K block the cache wants to read;
     for any read >1M, I'd have to use vmalloc() instead.

     I'd also prefer not to use ITER_BVEC because the offset and length are
     superfluous here.  If ITER_MAPPING is not good, would it be possible to
     have an ITER_PAGEARRAY that just takes a page array instead?  Or, even,
     create a transient xarray?

 (4) The way object culling is managed needs overhauling too, but that's a
     separate patchset in its own right.  We could wait till that's done
     too, but its lack doesn't prevent what we have now from being used.

David
---
The following changes since commit 9ebcfadb0610322ac537dd7aa5d9cbc2b2894c68:

  Linux 5.8-rc3 (2020-06-28 15:00:24 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/fscache-iter-20200803

for you to fetch changes up to 2f716a79439b100a3ded54b828f9c87582d13f86:

  fscache: disable cookie when doing an invalidation for DIO write (2020-08-03 17:22:36 +0100)

----------------------------------------------------------------
Filesystem caching rewrite

----------------------------------------------------------------
David Howells (59):
      nfs, cifs, ceph, 9p: Disable use of fscache prior to its rewrite
      afs: Disable use of the fscache I/O routines
      fscache: Add a cookie debug ID and use that in traces
      fscache: Procfile to display cookies
      fscache: Remove the old I/O API
      fscache: Remove the netfs data from the cookie
      fscache: Remove struct fscache_cookie_def
      fscache: Remove store_limit* from struct fscache_object
      fscache: Remove fscache_check_consistency()
      fscache: Remove fscache_attr_changed()
      fscache: Remove obsolete stats
      fscache: Remove old I/O tracepoints
      fscache: Temporarily disable fscache_invalidate()
      fscache: Remove the I/O operation manager
      iov_iter: Add ITER_MAPPING
      vm: Add wait/unlock functions for PG_fscache
      vfs: Export rw_verify_area() for use by cachefiles
      vfs: Provide S_CACHE_FILE inode flag
      mm: Provide lru_to_last_page() to get last of a page list
      cachefiles: Remove tree of active files and use S_CACHE_FILE inode flag
      fscache: Provide a simple thread pool for running ops asynchronously
      fscache: Replace the object management state machine
      fscache: Rewrite the I/O API based on iov_iter
      fscache: Keep track of size of a file last set independently on the server
      fscache, cachefiles: Fix disabled histogram warnings
      fscache: Recast assertion in terms of cookie not being an index
      cachefiles: Remove some redundant checks on unsigned values
      cachefiles: trace: Log coherency checks
      cachefiles: Split cachefiles_drop_object() up a bit
      cachefiles: Implement new fscache I/O backend API
      cachefiles: Merge object->backer into object->dentry
      cachefiles: Implement a content-present indicator and bitmap
      cachefiles: Shape read requests
      cachefiles: Round the cachefile size up to DIO block size
      cachefiles: Implement read and write parts of new I/O API
      cachefiles: Add I/O tracepoints
      fscache: Add read helper
      fscache: Display cache-specific data in /proc/fs/fscache/objects
      fscache: Remove more obsolete stats
      fscache: New stats
      fscache, cachefiles: Rewrite invalidation
      fscache: Implement "will_modify" parameter on fscache_use_cookie()
      fscache: Provide resize operation
      fscache: Remove the update operation
      cachefiles: Shape write requests
      afs: Fix interruption of operations
      afs: Move key to afs_read struct
      afs: Don't truncate iter during data fetch
      afs: Log remote unmarshalling errors
      afs: Set up the iov_iter before calling afs_extract_data()
      afs: Use ITER_MAPPING for writing
      afs: Interpose struct fscache_io_request into struct afs_read
      afs: Note the amount transferred in fetch-data delivery
      afs: Wait on PG_fscache before modifying/releasing a page
      afs: Use new fscache I/O API
      afs: Copy local writes to the cache when writing to the server
      afs: Invoke fscache_resize_cookie() when handling ATTR_SIZE for setattr
      afs: Add O_DIRECT read support
      afs: Skip truncation on the server of data we haven't written yet

Jeff Layton (1):
      fscache: disable cookie when doing an invalidation for DIO write

 fs/9p/Kconfig                          |    2 +-
 fs/Makefile                            |    2 +-
 fs/afs/Kconfig                         |    4 +-
 fs/afs/cache.c                         |   54 --
 fs/afs/cell.c                          |    9 +-
 fs/afs/dir.c                           |  242 +++++--
 fs/afs/file.c                          |  577 +++++++--------
 fs/afs/fs_operation.c                  |    4 +-
 fs/afs/fsclient.c                      |  154 ++--
 fs/afs/inode.c                         |  104 ++-
 fs/afs/internal.h                      |   58 +-
 fs/afs/rxrpc.c                         |  150 ++--
 fs/afs/volume.c                        |    9 +-
 fs/afs/write.c                         |  435 +++++++----
 fs/afs/yfsclient.c                     |  113 ++-
 fs/cachefiles/Makefile                 |    3 +-
 fs/cachefiles/bind.c                   |   11 +-
 fs/cachefiles/content-map.c            |  499 +++++++++++++
 fs/cachefiles/daemon.c                 |   10 +-
 fs/cachefiles/interface.c              |  580 ++++++++-------
 fs/cachefiles/internal.h               |  142 ++--
 fs/cachefiles/io.c                     |  325 +++++++++
 fs/cachefiles/main.c                   |   12 +-
 fs/cachefiles/namei.c                  |  508 +++++--------
 fs/cachefiles/rdwr.c                   |  974 -------------------------
 fs/cachefiles/xattr.c                  |  263 +++----
 fs/ceph/Kconfig                        |    2 +-
 fs/cifs/Kconfig                        |    2 +-
 fs/fscache/Kconfig                     |   24 +-
 fs/fscache/Makefile                    |   15 +-
 fs/fscache/cache.c                     |  145 ++--
 fs/fscache/cookie.c                    |  898 ++++++++++-------------
 fs/fscache/dispatcher.c                |  150 ++++
 fs/fscache/fsdef.c                     |   56 +-
 fs/fscache/histogram.c                 |    2 +-
 fs/fscache/internal.h                  |  264 +++----
 fs/fscache/io.c                        |  206 ++++++
 fs/fscache/main.c                      |   35 +-
 fs/fscache/netfs.c                     |   10 +-
 fs/fscache/obj.c                       |  366 ++++++++++
 fs/fscache/object-list.c               |  129 +---
 fs/fscache/object.c                    | 1133 -----------------------------
 fs/fscache/object_bits.c               |  120 +++
 fs/fscache/operation.c                 |  633 ----------------
 fs/fscache/page.c                      | 1248 --------------------------------
 fs/fscache/proc.c                      |   13 +-
 fs/fscache/read_helper.c               |  701 ++++++++++++++++++
 fs/fscache/stats.c                     |  269 +++----
 fs/internal.h                          |    5 -
 fs/nfs/Kconfig                         |    2 +-
 fs/nfs/fscache-index.c                 |    4 +-
 fs/read_write.c                        |    1 +
 include/linux/fs.h                     |    2 +
 include/linux/fscache-cache.h          |  508 +++----------
 include/linux/fscache-obsolete.h       |   13 +
 include/linux/fscache.h                |  834 +++++++++------------
 include/linux/mm.h                     |    1 +
 include/linux/pagemap.h                |   14 +
 include/linux/uio.h                    |   11 +
 include/net/af_rxrpc.h                 |    2 +-
 include/trace/events/afs.h             |   51 +-
 include/trace/events/cachefiles.h      |  285 ++++++--
 include/trace/events/fscache.h         |  428 ++---------
 include/trace/events/fscache_support.h |   97 +++
 lib/iov_iter.c                         |  286 +++++++-
 mm/filemap.c                           |   18 +
 net/rxrpc/recvmsg.c                    |    9 +-
 67 files changed, 5941 insertions(+), 8295 deletions(-)
 create mode 100644 fs/cachefiles/content-map.c
 create mode 100644 fs/cachefiles/io.c
 delete mode 100644 fs/cachefiles/rdwr.c
 create mode 100644 fs/fscache/dispatcher.c
 create mode 100644 fs/fscache/io.c
 create mode 100644 fs/fscache/obj.c
 delete mode 100644 fs/fscache/object.c
 create mode 100644 fs/fscache/object_bits.c
 delete mode 100644 fs/fscache/operation.c
 delete mode 100644 fs/fscache/page.c
 create mode 100644 fs/fscache/read_helper.c
 create mode 100644 include/linux/fscache-obsolete.h
 create mode 100644 include/trace/events/fscache_support.h


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [GIT PULL] fscache rewrite -- please drop for now
  2020-07-30 11:51 Upcoming: fscache rewrite David Howells
                   ` (3 preceding siblings ...)
  2020-08-03 16:30 ` [GIT PULL] " David Howells
@ 2020-08-10 15:16 ` David Howells
  2020-08-10 15:34   ` Steve French
                     ` (3 more replies)
  4 siblings, 4 replies; 16+ messages in thread
From: David Howells @ 2020-08-10 15:16 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, Alexander Viro, Matthew Wilcox, Christoph Hellwig,
	Jeff Layton, Dave Wysochanski, Trond Myklebust, Anna Schumaker,
	Steve French, Eric Van Hensbergen, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	linux-kernel

Hi Linus,

Can you drop the fscache rewrite pull for now.  We've seem an issue in NFS
integration and need to rework the read helper a bit.  I made an assumption
that fscache will always be able to request that the netfs perform a read of a
certain minimum size - but with NFS you can break that by setting rsize too
small.

We need to make the read helper able to make multiple netfs reads.  This can
help ceph too.

Thanks,
David


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] fscache rewrite -- please drop for now
  2020-08-10 15:16 ` [GIT PULL] fscache rewrite -- please drop for now David Howells
@ 2020-08-10 15:34   ` Steve French
  2020-08-10 15:48   ` David Howells
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Steve French @ 2020-08-10 15:34 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Alexander Viro, Matthew Wilcox,
	Christoph Hellwig, Jeff Layton, Dave Wysochanski,
	Trond Myklebust, Anna Schumaker, Steve French,
	Eric Van Hensbergen, linux-cachefs, linux-afs, linux-nfs, CIFS,
	ceph-devel, v9fs-developer, linux-fsdevel, LKML

cifs.ko also can set rsize quite small (even 1K for example, although
that will be more than 10x slower than the default 4MB so hopefully no
one is crazy enough to do that).   I can't imagine an SMB3 server
negotiating an rsize or wsize smaller than 64K in today's world (and
typical is 1MB to 8MB) but the user can specify a much smaller rsize
on mount.  If 64K is an adequate minimum, we could change the cifs
mount option parsing to require a certain minimum rsize if fscache is
selected.

On Mon, Aug 10, 2020 at 10:17 AM David Howells <dhowells@redhat.com> wrote:
>
> Hi Linus,
>
> Can you drop the fscache rewrite pull for now.  We've seem an issue in NFS
> integration and need to rework the read helper a bit.  I made an assumption
> that fscache will always be able to request that the netfs perform a read of a
> certain minimum size - but with NFS you can break that by setting rsize too
> small.
>
> We need to make the read helper able to make multiple netfs reads.  This can
> help ceph too.
>
> Thanks,
> David
>


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] fscache rewrite -- please drop for now
  2020-08-10 15:16 ` [GIT PULL] fscache rewrite -- please drop for now David Howells
  2020-08-10 15:34   ` Steve French
@ 2020-08-10 15:48   ` David Howells
  2020-08-10 16:09     ` Steve French
  2020-08-10 16:35     ` David Wysochanski
  2020-08-10 16:40   ` Christoph Hellwig
  2020-08-27 15:28   ` David Howells
  3 siblings, 2 replies; 16+ messages in thread
From: David Howells @ 2020-08-10 15:48 UTC (permalink / raw)
  To: Steve French
  Cc: dhowells, Linus Torvalds, Alexander Viro, Matthew Wilcox,
	Christoph Hellwig, Jeff Layton, Dave Wysochanski,
	Trond Myklebust, Anna Schumaker, Steve French,
	Eric Van Hensbergen, linux-cachefs, linux-afs, linux-nfs, CIFS,
	ceph-devel, v9fs-developer, linux-fsdevel, LKML

Steve French <smfrench@gmail.com> wrote:

> cifs.ko also can set rsize quite small (even 1K for example, although
> that will be more than 10x slower than the default 4MB so hopefully no
> one is crazy enough to do that).

You can set rsize < PAGE_SIZE?

> I can't imagine an SMB3 server negotiating an rsize or wsize smaller than
> 64K in today's world (and typical is 1MB to 8MB) but the user can specify a
> much smaller rsize on mount.  If 64K is an adequate minimum, we could change
> the cifs mount option parsing to require a certain minimum rsize if fscache
> is selected.

I've borrowed the 256K granule size used by various AFS implementations for
the moment.  A 512-byte xattr can thus hold a bitmap covering 1G of file
space.

David


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] fscache rewrite -- please drop for now
  2020-08-10 15:48   ` David Howells
@ 2020-08-10 16:09     ` Steve French
  2020-08-10 16:35     ` David Wysochanski
  1 sibling, 0 replies; 16+ messages in thread
From: Steve French @ 2020-08-10 16:09 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Alexander Viro, Matthew Wilcox,
	Christoph Hellwig, Jeff Layton, Dave Wysochanski,
	Trond Myklebust, Anna Schumaker, Steve French,
	Eric Van Hensbergen, linux-cachefs, linux-afs, linux-nfs, CIFS,
	ceph-devel, v9fs-developer, linux-fsdevel, LKML

On Mon, Aug 10, 2020 at 10:48 AM David Howells <dhowells@redhat.com> wrote:
>
> Steve French <smfrench@gmail.com> wrote:
>
> > cifs.ko also can set rsize quite small (even 1K for example, although
> > that will be more than 10x slower than the default 4MB so hopefully no
> > one is crazy enough to do that).
>
> You can set rsize < PAGE_SIZE?

I have never seen anyone do it and it would be crazy to set it so
small (would hurt
performance a lot and cause extra work on client and server) but yes
it can be set
very small. Apparently NFS can also set rsize to 1K as well (see
https://linux.die.net/man/5/nfs)

I don't mind adding a minimum rsize check for cifs.ko (preventing a
user from setting
rsize below page size for example) if there is a precedent for this in
other fs or
bug that it would cause.   In general my informal perf measurements showed
slight advantages to all servers with larger rsizes up to 4MB (thus
cifs client will
negotiate 4MB by default even if server supports larger), but
overriding rsize (larger)
on mount by having the user setting rsize to 8MB on mount could help
perf to some
servers. I am hoping we can figure out a way to automatically
determine when to negotiate
rsize larger than 4MB but in the meantime rsize will almost always be
4MB (or 1MB on
mounts to some older servers) for cifs but some users will benefit
slightly from manually
setting it to 8MB.

> > I can't imagine an SMB3 server negotiating an rsize or wsize smaller than
> > 64K in today's world (and typical is 1MB to 8MB) but the user can specify a
> > much smaller rsize on mount.  If 64K is an adequate minimum, we could change
> > the cifs mount option parsing to require a certain minimum rsize if fscache
> > is selected.
>
> I've borrowed the 256K granule size used by various AFS implementations for
> the moment.  A 512-byte xattr can thus hold a bitmap covering 1G of file
> space.
>
> David
>


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] fscache rewrite -- please drop for now
  2020-08-10 15:48   ` David Howells
  2020-08-10 16:09     ` Steve French
@ 2020-08-10 16:35     ` David Wysochanski
  2020-08-10 17:06       ` Jeff Layton
  1 sibling, 1 reply; 16+ messages in thread
From: David Wysochanski @ 2020-08-10 16:35 UTC (permalink / raw)
  To: David Howells
  Cc: Steve French, Linus Torvalds, Alexander Viro, Matthew Wilcox,
	Christoph Hellwig, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steve French, Eric Van Hensbergen, linux-cachefs, linux-afs,
	linux-nfs, CIFS, ceph-devel, v9fs-developer, linux-fsdevel, LKML

On Mon, Aug 10, 2020 at 11:48 AM David Howells <dhowells@redhat.com> wrote:
>
> Steve French <smfrench@gmail.com> wrote:
>
> > cifs.ko also can set rsize quite small (even 1K for example, although
> > that will be more than 10x slower than the default 4MB so hopefully no
> > one is crazy enough to do that).
>
> You can set rsize < PAGE_SIZE?
>
> > I can't imagine an SMB3 server negotiating an rsize or wsize smaller than
> > 64K in today's world (and typical is 1MB to 8MB) but the user can specify a
> > much smaller rsize on mount.  If 64K is an adequate minimum, we could change
> > the cifs mount option parsing to require a certain minimum rsize if fscache
> > is selected.
>
> I've borrowed the 256K granule size used by various AFS implementations for
> the moment.  A 512-byte xattr can thus hold a bitmap covering 1G of file
> space.
>
>

Is it possible to make the granule size configurable, then reject a
registration if the size is too small or not a power of 2?  Then a
netfs using the API could try to set equal to rsize, and then error
out with a message if the registration was rejected.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] fscache rewrite -- please drop for now
  2020-08-10 15:16 ` [GIT PULL] fscache rewrite -- please drop for now David Howells
  2020-08-10 15:34   ` Steve French
  2020-08-10 15:48   ` David Howells
@ 2020-08-10 16:40   ` Christoph Hellwig
  2020-08-27 15:28   ` David Howells
  3 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2020-08-10 16:40 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, Alexander Viro, Matthew Wilcox, Christoph Hellwig,
	Jeff Layton, Dave Wysochanski, Trond Myklebust, Anna Schumaker,
	Steve French, Eric Van Hensbergen, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	linux-kernel

On Mon, Aug 10, 2020 at 04:16:59PM +0100, David Howells wrote:
> Hi Linus,
> 
> Can you drop the fscache rewrite pull for now.  We've seem an issue in NFS
> integration and need to rework the read helper a bit.  I made an assumption
> that fscache will always be able to request that the netfs perform a read of a
> certain minimum size - but with NFS you can break that by setting rsize too
> small.
> 
> We need to make the read helper able to make multiple netfs reads.  This can
> help ceph too.

FYI, a giant rewrite dropping support for existing consumer is always
rather awkward.  Is there any way you could pre-stage some infrastructure
changes, and then do a temporary fscache2, which could then be renamed
back to fscache once everyone switched over?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] fscache rewrite -- please drop for now
  2020-08-10 16:35     ` David Wysochanski
@ 2020-08-10 17:06       ` Jeff Layton
  2020-08-17 19:07         ` Steven French
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2020-08-10 17:06 UTC (permalink / raw)
  To: David Wysochanski, David Howells
  Cc: Steve French, Linus Torvalds, Alexander Viro, Matthew Wilcox,
	Christoph Hellwig, Trond Myklebust, Anna Schumaker, Steve French,
	Eric Van Hensbergen, linux-cachefs, linux-afs, linux-nfs, CIFS,
	ceph-devel, v9fs-developer, linux-fsdevel, LKML

On Mon, 2020-08-10 at 12:35 -0400, David Wysochanski wrote:
> On Mon, Aug 10, 2020 at 11:48 AM David Howells <dhowells@redhat.com> wrote:
> > Steve French <smfrench@gmail.com> wrote:
> > 
> > > cifs.ko also can set rsize quite small (even 1K for example, although
> > > that will be more than 10x slower than the default 4MB so hopefully no
> > > one is crazy enough to do that).
> > 
> > You can set rsize < PAGE_SIZE?
> > 
> > > I can't imagine an SMB3 server negotiating an rsize or wsize smaller than
> > > 64K in today's world (and typical is 1MB to 8MB) but the user can specify a
> > > much smaller rsize on mount.  If 64K is an adequate minimum, we could change
> > > the cifs mount option parsing to require a certain minimum rsize if fscache
> > > is selected.
> > 
> > I've borrowed the 256K granule size used by various AFS implementations for
> > the moment.  A 512-byte xattr can thus hold a bitmap covering 1G of file
> > space.
> > 
> > 
> 
> Is it possible to make the granule size configurable, then reject a
> registration if the size is too small or not a power of 2?  Then a
> netfs using the API could try to set equal to rsize, and then error
> out with a message if the registration was rejected.
> 

...or maybe we should just make fscache incompatible with an
rsize that isn't an even multiple of 256k? You need to set mount options
for both, typically, so it would be fairly trivial to check this at
mount time, I'd think.

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] fscache rewrite -- please drop for now
  2020-08-10 17:06       ` Jeff Layton
@ 2020-08-17 19:07         ` Steven French
  0 siblings, 0 replies; 16+ messages in thread
From: Steven French @ 2020-08-17 19:07 UTC (permalink / raw)
  To: Jeff Layton, David Wysochanski, David Howells
  Cc: Steve French, Linus Torvalds, Alexander Viro, Matthew Wilcox,
	Christoph Hellwig, Trond Myklebust, Anna Schumaker,
	Eric Van Hensbergen, linux-cachefs, linux-afs, linux-nfs, CIFS,
	ceph-devel, v9fs-developer, linux-fsdevel, LKML


On 8/10/20 12:06 PM, Jeff Layton wrote:
> On Mon, 2020-08-10 at 12:35 -0400, David Wysochanski wrote:
>> On Mon, Aug 10, 2020 at 11:48 AM David Howells <dhowells@redhat.com> wrote:
>>> Steve French <smfrench@gmail.com> wrote:
>>>
>>>> cifs.ko also can set rsize quite small (even 1K for example, although
>>>> that will be more than 10x slower than the default 4MB so hopefully no
>>>> one is crazy enough to do that).
>>> You can set rsize < PAGE_SIZE?
>>>
>>>> I can't imagine an SMB3 server negotiating an rsize or wsize smaller than
>>>> 64K in today's world (and typical is 1MB to 8MB) but the user can specify a
>>>> much smaller rsize on mount.  If 64K is an adequate minimum, we could change
>>>> the cifs mount option parsing to require a certain minimum rsize if fscache
>>>> is selected.
>>> I've borrowed the 256K granule size used by various AFS implementations for
>>> the moment.  A 512-byte xattr can thus hold a bitmap covering 1G of file
>>> space.
>>>
>>>
>> Is it possible to make the granule size configurable, then reject a
>> registration if the size is too small or not a power of 2?  Then a
>> netfs using the API could try to set equal to rsize, and then error
>> out with a message if the registration was rejected.
>>
> ...or maybe we should just make fscache incompatible with an
> rsize that isn't an even multiple of 256k? You need to set mount options
> for both, typically, so it would be fairly trivial to check this at
> mount time, I'd think.


Yes - if fscache is specified on mount it would be easy to round rsize 
up (or down), at least for cifs.ko (perhaps simply in the mount.cifs 
helper so a warning could be returned to the user) to whatever boundary 
you prefer in fscache.   The default of 4MB (or 1MB for mounts to some 
older servers) should be fine.  Similarly if the user requested the 
default but the server negotiated an unusual size, not a multiple of 
256K, we could round try to round it down if possible (or fail the mount 
if not possible to round it down to 256K).


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] fscache rewrite -- please drop for now
  2020-08-10 15:16 ` [GIT PULL] fscache rewrite -- please drop for now David Howells
                     ` (2 preceding siblings ...)
  2020-08-10 16:40   ` Christoph Hellwig
@ 2020-08-27 15:28   ` David Howells
  2020-08-27 16:18     ` Dominique Martinet
  2020-08-27 17:14     ` David Howells
  3 siblings, 2 replies; 16+ messages in thread
From: David Howells @ 2020-08-27 15:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Alexander Viro, Matthew Wilcox, Jeff Layton,
	Dave Wysochanski, Trond Myklebust, Anna Schumaker, Steve French,
	Eric Van Hensbergen, linux-cachefs, linux-afs, linux-nfs,
	linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	linux-kernel

Christoph Hellwig <hch@lst.de> wrote:

> FYI, a giant rewrite dropping support for existing consumer is always
> rather awkward.  Is there any way you could pre-stage some infrastructure
> changes, and then do a temporary fscache2, which could then be renamed
> back to fscache once everyone switched over?

That's a bit tricky.  There are three points that would have to be shared: the
userspace miscdev interface, the backing filesystem and the single index tree.

It's probably easier to just have a go at converting 9P and cifs.  Making the
old and new APIs share would be a fairly hefty undertaking in its own right.

David


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] fscache rewrite -- please drop for now
  2020-08-27 15:28   ` David Howells
@ 2020-08-27 16:18     ` Dominique Martinet
  2020-08-27 17:14     ` David Howells
  1 sibling, 0 replies; 16+ messages in thread
From: Dominique Martinet @ 2020-08-27 16:18 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Alexander Viro, Matthew Wilcox, Jeff Layton,
	Dave Wysochanski, Trond Myklebust, Anna Schumaker, Steve French,
	Eric Van Hensbergen, linux-cachefs, linux-afs, linux-nfs,
	linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	linux-kernel

David Howells wrote on Thu, Aug 27, 2020:
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > FYI, a giant rewrite dropping support for existing consumer is always
> > rather awkward.  Is there any way you could pre-stage some infrastructure
> > changes, and then do a temporary fscache2, which could then be renamed
> > back to fscache once everyone switched over?
> 
> That's a bit tricky.  There are three points that would have to be shared: the
> userspace miscdev interface, the backing filesystem and the single index tree.
> 
> It's probably easier to just have a go at converting 9P and cifs.  Making the
> old and new APIs share would be a fairly hefty undertaking in its own right.

While I agree something incremental is probably better, I have some free
time over the next few weeks so will take a shot at 9p; it's definitely
going to be easier.


Should I submit patches to you or wait until Linus merges it next cycle
and send them directly?

I see Jeff's ceph patches are still in his tree's ceph-fscache-iter
branch and I don't see them anywhere in your tree.

-- 
Dominique


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [GIT PULL] fscache rewrite -- please drop for now
  2020-08-27 15:28   ` David Howells
  2020-08-27 16:18     ` Dominique Martinet
@ 2020-08-27 17:14     ` David Howells
  1 sibling, 0 replies; 16+ messages in thread
From: David Howells @ 2020-08-27 17:14 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: dhowells, Christoph Hellwig, Alexander Viro, Matthew Wilcox,
	Jeff Layton, Dave Wysochanski, Trond Myklebust, Anna Schumaker,
	Steve French, Eric Van Hensbergen, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	linux-kernel

Dominique Martinet <asmadeus@codewreck.org> wrote:

> Should I submit patches to you or wait until Linus merges it next cycle
> and send them directly?
> 
> I see Jeff's ceph patches are still in his tree's ceph-fscache-iter
> branch and I don't see them anywhere in your tree.

I really want them to all go in the same window, but there may be a
requirement for some filesystem-specific sets (eg. NFS) to go via the
maintainer tree.

Btw, at the moment, I'm looking at making the fscache read helper support the
new ->readahead() op.

David


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-08-27 17:15 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-30 11:51 Upcoming: fscache rewrite David Howells
2020-07-30 12:16 ` Matthew Wilcox
2020-07-30 12:36 ` David Howells
2020-07-30 13:08 ` Jeff Layton
2020-08-03 16:30 ` [GIT PULL] " David Howells
2020-08-10 15:16 ` [GIT PULL] fscache rewrite -- please drop for now David Howells
2020-08-10 15:34   ` Steve French
2020-08-10 15:48   ` David Howells
2020-08-10 16:09     ` Steve French
2020-08-10 16:35     ` David Wysochanski
2020-08-10 17:06       ` Jeff Layton
2020-08-17 19:07         ` Steven French
2020-08-10 16:40   ` Christoph Hellwig
2020-08-27 15:28   ` David Howells
2020-08-27 16:18     ` Dominique Martinet
2020-08-27 17:14     ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).