Re: The future of readahead

From: David Howells <dhowells@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: dhowells@redhat.com, linux-fsdevel@vger.kernel.org,
	Kent Overstreet <kent.overstreet@gmail.com>,
	Mike Marshall <hubcap@omnibond.com>
Subject: Re: The future of readahead
Date: Thu, 27 Aug 2020 18:02:18 +0100	[thread overview]
Message-ID: <1441311.1598547738@warthog.procyon.org.uk> (raw)
In-Reply-To: <20200826193116.GU17456@casper.infradead.org>

Matthew Wilcox <willy@infradead.org> wrote:

> So solving #2 and #3 looks like a new interface for filesystems to call:
> 
> void readahead_expand(struct readahead_control *rac, loff_t start, u64 len);
> or possibly
> void readahead_expand(struct readahead_control *rac, pgoff_t start,
> 		unsigned int count);
> 
> It might not actually expand the readahead attempt at all -- for example,
> if there's already a page in the page cache, or if it can't allocate
> memory.  But this puts the responsibility for allocating pages in the VFS,
> where it belongs.

This is exactly what the fscache read helper in my fscache rewrite is doing,
except that I'm doing it in fs/fscache/read_helper.c.

Have a look here:

	https://lore.kernel.org/linux-fsdevel/159465810864.1376674.10267227421160756746.stgit@warthog.procyon.org.uk/

and look for the fscache_read_helper() function.

Note that it's slighly complicated because it handles ->readpage(),
->readpages() and ->write_begin()[*].

[*] I want to be able to bring the granule into the cache for modification.
    Ideally I'd be able to see that the entire granule is going to get written
    over and skip - kind of like write_begin for a whole granule rather than a
    page.

Shaping the readahead request has the following issues:

 (1) The request may span multiple granules.

 (2) Those granules may be a mixture of cached and uncached.

 (3) The granule size may vary.

 (4) Granules fall on power-of-2 boundaries (for example 256K boundaries)
     within the file, but the request may not start on a boundary and may not
     end on one.

To deal with this, fscache_read_helper() calls out to the cache backend
(fscache_shape_request()) and the netfs (req->ops->reshape()) to adjust the
read it's going to make.  Shaping the request may mean moving the start
earlier as well as expanding or contracting the size.  The only thing that's
guaranteed is that the first page of the request will be retained.

I also don't let a request cross a cached/uncached boundary, but rather cut
the request off there and return.  The filesystem can then generate a new
request and call back in.  (Note that I have to be able to keep track of the
filesystem's metadata so that I can reissue the request to the netfs in the
event that cache suffers some sort of error).

What I was originally envisioning for the new ->readahead() interface is add a
second aop that allows the shaping to be accessed by the VM, before it's
started pinning any pages.

The shaping parameters I think we need are:

	- The inode, for i_size and fscache cookie
	- The proposed page range

and what you would get back could be:

	- Shaped page range
	- Minimum I/O granularity[1]
	- Minimum preferred granularity[2]
	- Flag indicating if the pages can just be zero-filled[3]

[1] The filesystem doesn't want to read in smaller chunks than this.

[2] The cache doesn't want to read in smaller chunks than this, though in the
    cache's case, a partially read block is just abandoned for the moment.
    This number would allow the readahead algorithm to shorten the request if
    it can't allocate a page.

[3] If I know that the local i_size is much bigger than the i_size on the
    server, there's no need to download/read those pages and readahead can
    just clear them.  This is more applicable to write_begin() normally.

Now a chunk of this is in struct readahead_control, so it might be reasonable
to add the other bits there too.

Note that one thing I really would like to avoid having to do is to expand a
request forward, particularly if the main page of interest is precreated and
locked by the VM before calling the filesystem.  I would much rather the VM
created the pages, starting from the lowest-numbered.

Anyway, that's my 2p.
David