Re: [RFC PATCH v1] mm/filemap: Allow arch to request folio size for exec memory

From: Ryan Roberts <ryan.roberts@arm.com>
To: Barry Song <21cnbao@gmail.com>, Matthew Wilcox <willy@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [RFC PATCH v1] mm/filemap: Allow arch to request folio size for exec memory
Date: Mon, 15 Jan 2024 09:33:35 +0000	[thread overview]
Message-ID: <398fdb16-b8c5-4d02-bb5d-d4c9b8f9bf89@arm.com> (raw)
In-Reply-To: <CAGsJ_4wZzjprAs42LMw8s8C_iz4v7m6fiO7-7nBS2BxkU9u8QA@mail.gmail.com>

On 13/01/2024 00:11, Barry Song wrote:
> On Sat, Jan 13, 2024 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Sat, Jan 13, 2024 at 11:54:23AM +1300, Barry Song wrote:
>>>>> Perhaps an alternative would be to double ra->size and set ra->async_size to
>>>>> (ra->size / 2)? That would ensure we always have 64K aligned blocks but would
>>>>> give us an async portion so readahead can still happen.
>>>>
>>>> this might be worth to try as PMD is exactly doing this because async
>>>> can decrease
>>>> the latency of subsequent page faults.
>>>>
>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>         /* Use the readahead code, even if readahead is disabled */
>>>>         if (vm_flags & VM_HUGEPAGE) {
>>>>                 fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>>>>                 ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
>>>>                 ra->size = HPAGE_PMD_NR;
>>>>                 /*
>>>>                  * Fetch two PMD folios, so we get the chance to actually
>>>>                  * readahead, unless we've been told not to.
>>>>                  */
>>>>                 if (!(vm_flags & VM_RAND_READ))
>>>>                         ra->size *= 2;
>>>>                 ra->async_size = HPAGE_PMD_NR;
>>>>                 page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
>>>>                 return fpin;
>>>>         }
>>>> #endif
>>>>
>>>
>>> BTW, rather than simply always reading backwards,  we did something very
>>> "ugly" to simulate "read-around" for CONT-PTE exec before[1]
>>>
>>> if page faults happen in the first half of cont-pte, we read this 64KiB
>>> and its previous 64KiB. otherwise, we read it and its next 64KiB.

I actually tried something very similar to this while prototyping. I found that
it was about 10% less effective at getting text into 64K folios as the approach
I posted. I didn't investigate why, as I came to the conclusion that text
unlikely benefits from readahead anyway.

>>
>> I don't think that makes sense.  The CPU executes instructions forwards,
>> not "around".  I honestly think we should treat text as "random access"
>> because function A calls function B and functions A and B might well be
>> very far apart from each other.  The only time I see you actually
>> getting "readahead" hits is if a function is split across two pages (for
>> whatever size of page), but that's a false hit!  The function is not,
>> generally, 64kB long, so doing readahead is no more likely to bring in
>> the next page of text that we want than reading any other random page.
>>
> 
> it seems you are in favor of Ryan's modification even for filesystems
> which don't support large mapping?
> 
>> Unless somebody finds the GNU Rope source code from 1998, or recreates it:
>> https://lwn.net/1998/1029/als/rope.html
>> Then we might actually have some locality.
>>
>> Did you actually benchmark what you did?  Is there really some locality
>> between the code at offset 256-288kB in the file and then in the range
>> 192kB-256kB?
> 
> I really didn't have benchmark data, at that point I was like,
> instinctively didn’t
> want to break the logic of read-around, so made the code just that.
> The info your provide makes me re-think if the read-around code is necessary,
> thanks!

As a quick experiment, I modified my thpmaps script to collect data *only* for
executable mappings. This is run *without* my change:

| File-backed exec folios |    Speedometer | Kernel Compile |
|=========================|================|================|
|file-thp-aligned-16kB    |            56% |            46% |
|file-thp-aligned-32kB    |             2% |             3% |
|file-thp-aligned-64kB    |             4% |             5% |
|file-thp-unaligned-16kB  |             0% |             3% |
|file-thp-unaligned-128kB |             2% |             0% |
|file-thp-partial         |             0% |             0% |

It's implied that the rest of the memory (up to 100%) is small (single page)
folios. I think the only reason we would see small folios is if we would
otherwise run off the end of the file?

If so, then I think any text in folios > 16K is a rough proxy for how effective
readahead is for text: Not very.

Intuitively, I agree with Matthew that readahead doesn't make much sense for
text, and this rough data seems to agree.

> 
> was using filesystems without large-mapping support but worked around
> the problem by
> 1. preparing 16*n normals pages
> 2. insert normal pages into xa
> 3. let filesystem read 16 normal pages
> 4. after all IO completion, transform 16 pages into mTHP and reinsert
> mTHP to xa

I had a go at something like this too, but was doing it in the dynamic loader
and having it do MADV_COLLAPSE to generate PMD-sized THPs for the text. I
actaully found this to be even faster for the use cases I was measuring. But of
course its using more memory due to the 2M page size, and I expect it is slowing
down app load time because it is potentially reading in a lot more text than is
actually faulting. Ultimately I think the better strategy is to make the
filesystems large folio capable.

> 
> that was very painful and finally made no improvement probably because
> of due to various sync overhead. so  ran away and didn't dig more data.
> 
> Thanks
> Barry