Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory

From: Ryan Roberts <ryan.roberts@arm.com>
To: David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Yu Zhao <yuzhao@google.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
Date: Mon, 17 Apr 2023 12:43:25 +0100	[thread overview]
Message-ID: <13969045-4e47-ae5d-73f4-dad40fe631be@arm.com> (raw)
In-Reply-To: <faa31f50-4d2a-c71f-945f-398789cbbf66@redhat.com>

On 17/04/2023 11:54, David Hildenbrand wrote:
> On 14.04.23 15:02, Ryan Roberts wrote:
>> Hi All,
>>
>> This is a second RFC and my first proper attempt at implementing variable order,
>> large folios for anonymous memory. The first RFC [1], was a partial
>> implementation and a plea for help in debugging an issue I was hitting; thanks
>> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
>>
>> The objective of variable order anonymous folios is to improve performance by
>> allocating larger chunks of memory during anonymous page faults:
>>
>>   - Since SW (the kernel) is dealing with larger chunks of memory than base
>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>     and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>     overhead. This should benefit all architectures.
>>   - Since we are now mapping physically contiguous chunks of memory, we can take
>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
>>
>> This patch set deals with the SW side of things only but sets us up nicely for
>> taking advantage of the HW improvements in the near future.
>>
>> I'm not yet benchmarking a wide variety of use cases, but those that I have
>> looked at are positive; I see kernel compilation time improved by up to 10%,
>> which I expect to improve further once I add in the arm64 "contiguous bit".
>> Memory consumption is somewhere between 1% less and 2% more, depending on how
>> its measured. More on perf and memory below.
>>
>> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
>> conflict resolution). I have a tree at [4].
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
>> [2]
>> https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
>> [3]
>> https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>> [4]
>> https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
>>
>> Approach
>> ========
>>
>> There are 4 fault paths that have been modified:
>>   - write fault on unallocated address: do_anonymous_page()
>>   - write fault on zero page: wp_page_copy()
>>   - write fault on non-exclusive CoW page: wp_page_copy()
>>   - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
>>
>> In the first 2 cases, we will determine the preferred order folio to allocate,
>> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
>> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
>> folio as the source, subject to constraints that may arise if the source has
>> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
>> the folio as we can, subject to the same mremap/munmap constraints.
> 
> Just a note (that you maybe already know) that we have to be a bit careful in
> the wp_copy path with replacing sub-pages that are marked exclusive.

Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
think I have a bug.

(I'm guessing the GUP fast path assumes that if it sees an exclusive page then
that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
But surely user space can munmap it at any time and the consequences are
similar? It's probably clear that I don't know much about the GUP implementation
details...)

My current patch always prefers reuse over copy, and expands the size of the
reuse to the biggest set of pages that are all exclusive (determined either by
the presence of the anon_exclusive flag or from the refcount), and covered by
the same folio (and a few other bounds constraints - see
calc_anon_folio_range_reuse()).

If I determine I must copy (because the "anchor" page is not exclusive), then I
determine the size of the copy region based on a few constraints (see
calc_anon_folio_order_copy()). But I think you are saying that no pages in that
region are allowed to have the anon_exclusive flag set? In which case, this
could be fixed by adding that check in the function.

> 
> Currently, we always only replace a single shared anon (sub)page by a fresh
> exclusive base-page during a write-fault/unsharing. As the sub-page is already
> marked "maybe shared", it cannot get pinned concurrently and everybody is happy.

When you say, "maybe shared" is that determined by the absence of the
"exclusive" flag?

> 
> If you now decide to replace more subpages, you have to be careful that none of
> them are still exclusive -- because they could get pinned concurrently and
> replacing them would result in memory corruptions.
> 
> There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial
> fork() that could result in something like that.

Are there any test cases that stress the kernel in this area that I could use to
validate my fix?

> 
> Further, we have to be a bit careful regarding replacing ranges that are backed
> by different anon pages (for example, due to fork() deciding to copy some
> sub-pages of a PTE-mapped folio instead of sharing all sub-pages).

I don't understand this statement; do you mean "different anon _folios_"? I am
scanning the page table to expand the region that I reuse/copy and as part of
that scan, make sure that I only cover a single folio. So I think I conform here
- the scan would give up once it gets to the hole.

> 
> 
> So what should be safe is replacing all sub-pages of a folio that are marked
> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
> worth the complexity. For THP we were happy so far to *not* optimize this,
> implying that maybe we shouldn't worry about optimizing the fork() case for now
> that heavily.

I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
large copies was contributing a measurable amount to the performance
improvement. (Certainly, the zero-page copy case, is definitely a big
contributer). I don't have access to the HW at the moment but can rerun later
with and without to double check.

> 
> 
> One optimization once could think of instead (that I raised previously in other
> context) is the detection of exclusivity after fork()+exit in the child (IOW,
> only the parent continues to exist). Once PG_anon_exclusive was cleared for all
> sub-pages of the THP-mapped folio during fork(), we'd always decide to copy
> instead of reuse (because page_count() > 1, as the folio is PTE mapped).
> Scanning the surrounding page table if it makes sense (e.g., page_count() <=
> folio_nr_pages()), to test if all page references are from the current process
> would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages.
> The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is.
> For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this
> optimization.

Yes, I have already implemented this in my series; see patch 10.

Thanks,
Ryan