From: David Howells <dhowells@redhat.com>
To: torvalds@linux-foundation.org
Cc: dhowells@redhat.com, Alexander Viro <viro@zeniv.linux.org.uk>,
Matthew Wilcox <willy@infradead.org>,
Christoph Hellwig <hch@lst.de>, Jeff Layton <jlayton@redhat.com>,
Dave Wysochanski <dwysocha@redhat.com>,
Trond Myklebust <trondmy@hammerspace.com>,
Anna Schumaker <anna.schumaker@netapp.com>,
Steve French <sfrench@samba.org>,
Eric Van Hensbergen <ericvh@gmail.com>,
linux-cachefs@redhat.com, linux-afs@lists.infradead.org,
linux-nfs@vger.kernel.org, linux-cifs@vger.kernel.org,
ceph-devel@vger.kernel.org, v9fs-developer@lists.sourceforge.net,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Upcoming: fscache rewrite
Date: Thu, 30 Jul 2020 12:51:16 +0100 [thread overview]
Message-ID: <447452.1596109876@warthog.procyon.org.uk> (raw)
Hi Linus, Trond/Anna, Steve, Eric,
I have an fscache rewrite that I'm tempted to put in for the next merge
window:
https://lore.kernel.org/linux-fsdevel/159465784033.1376674.18106463693989811037.stgit@warthog.procyon.org.uk/
It improves the code by:
(*) Ripping out the stuff that uses page cache snooping and kernel_write()
and using kiocb instead. This gives multiple wins: uses async DIO rather
than snooping for updated pages and then copying them, less VM overhead.
(*) Object management is also simplified, getting rid of the state machine
that was managing things and using a much simplified thread pool instead.
(*) Object invalidation creates a tmpfile and diverts new activity to that so
that it doesn't have to synchronise in-flight ADIO.
(*) Using a bitmap stored in an xattr rather than using bmap to find out if
a block is present in the cache. Probing the backing filesystem's
metadata to find out is not reliable in modern extent-based filesystems
as them may insert or remove blocks of zeros. Even SEEK_HOLE/SEEK_DATA
are problematic since they don't distinguish transparently inserted
bridging.
I've provided a read helper that handles ->readpage, ->readpages, and
preparatory writes in ->write_begin. Willy is looking at using this as a way
to roll his new ->readahead op out into filesystems. A good chunk of this
will move into MM code.
The code is simpler, and this is nice too:
67 files changed, 5947 insertions(+), 8294 deletions(-)
not including documentation changes, which I need to convert to rst format
yet. That removes a whole bunch more lines.
But there are reasons you might not want to take it yet:
(1) It starts off by disabling fscache support in all the filesystems that
use it: afs, nfs, cifs, ceph and 9p. I've taken care of afs, Dave
Wysochanski has patches for nfs:
https://lore.kernel.org/linux-nfs/1596031949-26793-1-git-send-email-dwysocha@redhat.com/
but they haven't been reviewed by Trond or Anna yet, and Jeff Layton has
patches for ceph:
https://marc.info/?l=ceph-devel&m=159541538914631&w=2
and I've briefly discussed cifs with Steve, but nothing has started there
yet. 9p I've not looked at yet.
Now, if we're okay for going a kernel release with 4/5 filesystems with
caching disabled and then pushing the changes for individual filesystems
through their respective trees, it might be easier.
Unfortunately, I wasn't able to get together with Trond and Anna at LSF
to discuss this.
(2) The patched afs fs passed xfstests -g quick (unlike the upstream code
that oopses pretty quickly with caching enabled). Dave and Jeff's nfs
and ceph code is getting close, but not quite there yet.
(3) Al has objections to the ITER_MAPPING iov_iter type that I added
https://lore.kernel.org/linux-fsdevel/20200719014436.GG2786714@ZenIV.linux.org.uk/
but note that iov_iter_for_each_range() is not actually used by anything.
However, Willy likes it and would prefer to make it ITER_XARRAY instead
as he might be able to use it in other places, though there's an issue
where I'm calling find_get_pages_contig() which takes a mapping (though
all it does is then get the xarray out of it).
Instead I would have to use ITER_BVEC, which has quite a high overhead,
though it would mean that the RCU read lock wouldn't be necessary. This
would require 1K of memory for every 256K block the cache wants to read;
for any read >1M, I'd have to use vmalloc() instead.
I'd also prefer not to use ITER_BVEC because the offset and length are
superfluous here. If ITER_MAPPING is not good, would it be possible to
have an ITER_PAGEARRAY that just takes a page array instead? Or, even,
create a transient xarray?
(4) The way object culling is managed needs overhauling too, but that's a
whole 'nother patchset. We could wait till that's done too, but its lack
doesn't prevent what we have now being used.
Thoughts?
David
next reply other threads:[~2020-07-30 11:51 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-30 11:51 David Howells [this message]
2020-07-30 12:16 ` Upcoming: fscache rewrite Matthew Wilcox
2020-07-30 12:36 ` David Howells
2020-07-30 13:08 ` Jeff Layton
2020-08-03 16:30 ` [GIT PULL] " David Howells
2020-08-10 15:16 ` [GIT PULL] fscache rewrite -- please drop for now David Howells
2020-08-10 15:34 ` Steve French
2020-08-10 15:48 ` David Howells
2020-08-10 16:09 ` Steve French
2020-08-10 16:35 ` David Wysochanski
2020-08-10 17:06 ` Jeff Layton
2020-08-17 19:07 ` Steven French
2020-08-10 16:40 ` Christoph Hellwig
2020-08-27 15:28 ` David Howells
2020-08-27 16:18 ` Dominique Martinet
2020-08-27 17:14 ` David Howells
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=447452.1596109876@warthog.procyon.org.uk \
--to=dhowells@redhat.com \
--cc=anna.schumaker@netapp.com \
--cc=ceph-devel@vger.kernel.org \
--cc=dwysocha@redhat.com \
--cc=ericvh@gmail.com \
--cc=hch@lst.de \
--cc=jlayton@redhat.com \
--cc=linux-afs@lists.infradead.org \
--cc=linux-cachefs@redhat.com \
--cc=linux-cifs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=sfrench@samba.org \
--cc=torvalds@linux-foundation.org \
--cc=trondmy@hammerspace.com \
--cc=v9fs-developer@lists.sourceforge.net \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).