On Feb 9, 2017, at 4:34 PM, Matthew Wilcox wrote: > > On Thu, Jan 26, 2017 at 02:57:53PM +0300, Kirill A. Shutemov wrote: >> Most page cache allocation happens via readahead (sync or async), so if >> we want to have significant number of huge pages in page cache we need >> to find a ways to allocate them from readahead. >> >> Unfortunately, huge pages doesn't fit into current readahead design: >> 128 max readahead window, assumption on page size, PageReadahead() to >> track hit/miss. >> >> I haven't found a ways to get it right yet. >> >> This patch just allocates huge page if allowed, but doesn't really >> provide any readahead if huge page is allocated. We read out 2M a time >> and I would expect spikes in latancy without readahead. >> >> Therefore HACK. >> >> Having that said, I don't think it should prevent huge page support to >> be applied. Future will show if lacking readahead is a big deal with >> huge pages in page cache. >> >> Any suggestions are welcome. > > Well ... what if we made readahead 2 hugepages in size for inodes which > are using huge pages? That's only 8x our current readahead window, and > if you're asking for hugepages, you're accepting that IOs are going to > be larger, and you probably have the kind of storage system which can > handle doing larger IOs. It would be nice if the bdi had a parameter for the maximum readahead size. Currently, readahead is capped at 2MB chunks by force_page_cache_readahead() even if bdi->ra_pages and bdi->io_pages are much larger. It should be up to the filesystem to decide how large the readahead chunks are rather than imposing some policy in the MM code. For high-speed (network) storage access it is better to have at least 4MB read chunks, for RAID storage it is desirable to have stripe-aligned readahead to avoid read inflation when verifying the parity. Any fixed size will eventually be inadequate as disks and filesystems change, so it may as well be a per-bdi tunable that can be set by the filesystem as needed, or possibly with a mount option if needed. Cheers, Andreas