From: David Howells <email@example.com> To: firstname.lastname@example.org, email@example.com Cc: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, Matthew Wilcox <email@example.com>, Dominique Martinet <firstname.lastname@example.org>, Steve French <email@example.com>, Trond Myklebust <firstname.lastname@example.org>, Anna Schumaker <email@example.com>, Christoph Hellwig <firstname.lastname@example.org>, email@example.com, Linus Torvalds <firstname.lastname@example.org>, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org Subject: Redesigning and modernising fscache Date: Thu, 14 Jan 2021 10:45:06 +0000 [thread overview] Message-ID: <email@example.com> (raw) Hi, I've been working on modernising fscache, primarily with help from Jeff Layton and Dave Wysochanski with porting Ceph and NFS to it, and with Willy helpfully reinventing the VM I/O interface beneath us;-). However, there've been some objections to the approach I've taken to implementing this. The way I've done it is to disable the use of fscache by the five network filesystems that use it, remove much of the old code, put in the reimplementation, then cut the filesystems over. I.e. rip-and-replace. It leaves unported filesystems unable to use it - but three of the five are done (afs, ceph, nfs), and I've supplied partially-done patches for the other two (9p, cifs). It's been suggested that it's too hard to review this way and that either I should go for a gradual phasing in or build the new one in parallel. The first is difficult because I want to change how almost everything in there works - but the parts are tied together; the second is difficult because there are areas that would *have* to overlap (the UAPI device file, the cache storage, the cache size limits and at least some state for managing these), so there would have to be interaction between the two variants. One refinement of the latter would be to make the two implementations mutually exclusive: you can build one or the other, but not both. However. Given that I want to replace the on-disk format in cachefiles at some point, and change what the userspace component does, maybe I should create a new, separate UAPI interface and do the on-disk format change at the same time. In which case, it makes sense to build a parallel variant Anyway, a bit of background into the why. There are a number of things that need to be fixed in fscache/cachefiles: (1) The use of bmap to test whether the backing fs contains a cache block. This is not reliable in a modern extent-based filesystem as it can insert and remove bridging blocks of zeros at will. Having discussed this with Christoph Hellwig and Dave Chinner, I think I that the cache really needs to keep track of this for itself. (2) The use of pagecache waitlist snooping to find out if a backing filesystem page has been updated yet. I have the feeling that this is not 100% reliable from untrackdownable bugs that seem to relate to this. I really would rather be told directly by the backing fs that the op was complete. Switching over to kiocbs means that can be done. (3) Having to go through the pagecache attached to the backing file, copying data from it or using vfs_write() to write into it. This doubles the amount of pagecache required and adds a bunch of copies for good measure. When I wrote the cachefiles caching backend, using direct I/O from within the kernel wasn't possible - but, now that kiocbs are available, I can actually do async DIO from the backing files to/from the netfs pages, cutting out copies in both direction, and using the kiocb completion function to tell me when it's done. (4) fscache's structs have a number of pointers back into the netfs, which makes it tricky if the netfs instance goes away whilst the cache is active. I really want no pointers back - apart from very transient I/O completion callbacks. I can store the metadata I need in the cookie. Modernising this affords the opportunity to make huge simplifications in the code (shaving off over 1000 lines, maybe as many as 3000). One thing I've done is to make a helper library that handles a number of features on behalf of a netfs if it wants to use the library: (*) Local caching. (*) Segmentation and shaping of read operations. This takes a ->readahead() request from the VM and translates it into one or more reads against the cache and the netfs, allowing both to adjust/expand the size of the individual subops according to internal alignments. Allowing the cache to expand a read request to put it on a larger granularity allows the cache to use less metadata to represent what it contains. It also provides a place to retry operations (something that's required if a read against the cache fails and we need to send it to the server instead). (*) Transparent huge pages (Willy). (*) A place to put fscrypt support (Jeff). We have the first three working - with some limitations - for afs, nfs and ceph, and I've produced partial patches for 9p and cifs. afs, nfs and ceph are able to handle xfstests with a cache now - which is something that the old fscache code will just explode with. So, as stated, much of that code is written and working. However, if I do a complete replacement all the way out to userspace, there are further changes I'm thinking of making: (*) Get rid of the ability to remove a cache that's in use. This accounts for a *lot* of the complexity in fscache. All the synchronisation required to effect the removal of a live cache at any time whilst it's actually being used. (*) Change cachefiles so that it uses an index file and a single data file and perform culling by marking the index rather than deleting data files. Culling would then be moved into the kernel. cachefilesd is then unnecessary, except to load the config and keep the cache open. Moving the culling into an index would also make manual invalidation easier. (*) Rather than using cachefilesd to hold the cache open, do something akin to swapon/swapoff to add and remove the cache. Attempting to remove an in-use cache would either fail EBUSY or mark the cache to be removed when it becomes unused and not allow further new users. (*) Declare the size of the cache up front rather than declaring that it has to maintain a certain amount of free space, reducing the cache to make more space if the level drops. (*) Merge cachefiles into fscache. Give up the ability to have alternate cache backends. That would allow a bit more reduction in the complexity and reduce the number of function pointers gone through. David
next reply other threads:[~2021-01-14 10:47 UTC|newest] Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-01-14 10:45 David Howells [this message] 2021-01-14 16:19 ` Matthew Wilcox 2021-01-18 23:36 ` Cut down implementation of fscache new API David Howells
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --subject='Re: Redesigning and modernising fscache' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.