The future of readahead

From: Matthew Wilcox <willy@infradead.org>
To: linux-fsdevel@vger.kernel.org
Cc: Kent Overstreet <kent.overstreet@gmail.com>,
	David Howells <dhowells@redhat.com>,
	Mike Marshall <hubcap@omnibond.com>
Subject: The future of readahead
Date: Wed, 26 Aug 2020 20:31:16 +0100	[thread overview]
Message-ID: <20200826193116.GU17456@casper.infradead.org> (raw)

Both Kent and David have had conversations with me about improving the
readahead filesystem interface this last week, and as I don't have time
to write the code, here's the design.

1. Kent doesn't like it that we do an XArray lookup for each page.
The proposed solution adds a (small) array of page pointers (or a
pagevec) to the struct readahead_control.  It may make sense to move
__readahead_batch() and readahead_page() out of line at that point.
This should be backed up with performance numbers.

2. David wants to be sure that readahead is aligned to a granule
size (eg 256kB) to support fscache.  When we last talked about it,
I suggested encoding the granule size in the struct address_space.
I no longer think this approach should be pursued, since ...

3. Kent wants to be able to expand readahead to encompass an entire fs
extent (if, eg, that extent is compressed or encrypted).  We don't know
that at the right point; the filesystem can't pass that information
through the generic_file_buffered_read() or filemap_fault() interface
to the readahead code.  So the right approach here is for the filesystem
to ask the readahead code to expand the readahead batch.

So solving #2 and #3 looks like a new interface for filesystems to call:

void readahead_expand(struct readahead_control *rac, loff_t start, u64 len);
or possibly
void readahead_expand(struct readahead_control *rac, pgoff_t start,
		unsigned int count);

It might not actually expand the readahead attempt at all -- for example,
if there's already a page in the page cache, or if it can't allocate
memory.  But this puts the responsibility for allocating pages in the VFS,
where it belongs.

4. Mike wants to be able to do 4MB I/Os [1].  That should be covered by
the solution above.  Mike, just to clarify.  Do you need 4MB pages, or can
you work with some mixture of page sizes going as far as 1024 x 4kB pages?

5. I'm allocating larger pages in the readahead code (part of the THP
patch set [2])

[1] https://lore.kernel.org/linux-fsdevel/CAOg9mSSrJp2dqQTNDgucLoeQcE_E_aYPxnRe5xphhdSPYw7QtQ@mail.gmail.com/
[2] http://git.infradead.org/users/willy/pagecache.git/commitdiff/c00bd4082c7bc32a17b0baa29af6974286978e1f