Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory

From: David Hildenbrand <david@redhat.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Yu Zhao <yuzhao@google.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
Date: Mon, 17 Apr 2023 12:54:35 +0200	[thread overview]
Message-ID: <faa31f50-4d2a-c71f-945f-398789cbbf66@redhat.com> (raw)
In-Reply-To: <20230414130303.2345383-1-ryan.roberts@arm.com>

On 14.04.23 15:02, Ryan Roberts wrote:
> Hi All,
> 
> This is a second RFC and my first proper attempt at implementing variable order,
> large folios for anonymous memory. The first RFC [1], was a partial
> implementation and a plea for help in debugging an issue I was hitting; thanks
> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
> 
> The objective of variable order anonymous folios is to improve performance by
> allocating larger chunks of memory during anonymous page faults:
> 
>   - Since SW (the kernel) is dealing with larger chunks of memory than base
>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>     and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>     overhead. This should benefit all architectures.
>   - Since we are now mapping physically contiguous chunks of memory, we can take
>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
> 
> This patch set deals with the SW side of things only but sets us up nicely for
> taking advantage of the HW improvements in the near future.
> 
> I'm not yet benchmarking a wide variety of use cases, but those that I have
> looked at are positive; I see kernel compilation time improved by up to 10%,
> which I expect to improve further once I add in the arm64 "contiguous bit".
> Memory consumption is somewhere between 1% less and 2% more, depending on how
> its measured. More on perf and memory below.
> 
> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
> conflict resolution). I have a tree at [4].
> 
> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
> 
> Approach
> ========
> 
> There are 4 fault paths that have been modified:
>   - write fault on unallocated address: do_anonymous_page()
>   - write fault on zero page: wp_page_copy()
>   - write fault on non-exclusive CoW page: wp_page_copy()
>   - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
> 
> In the first 2 cases, we will determine the preferred order folio to allocate,
> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
> folio as the source, subject to constraints that may arise if the source has
> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
> the folio as we can, subject to the same mremap/munmap constraints.

Just a note (that you maybe already know) that we have to be a bit 
careful in the wp_copy path with replacing sub-pages that are marked 
exclusive.

Currently, we always only replace a single shared anon (sub)page by a 
fresh exclusive base-page during a write-fault/unsharing. As the 
sub-page is already marked "maybe shared", it cannot get pinned 
concurrently and everybody is happy.

If you now decide to replace more subpages, you have to be careful that 
none of them are still exclusive -- because they could get pinned 
concurrently and replacing them would result in memory corruptions.

There are scenarios (most prominently: MADV_WIPEONFORK), but also failed 
partial fork() that could result in something like that.

Further, we have to be a bit careful regarding replacing ranges that are 
backed by different anon pages (for example, due to fork() deciding to 
copy some sub-pages of a PTE-mapped folio instead of sharing all sub-pages).

So what should be safe is replacing all sub-pages of a folio that are 
marked "maybe shared" by a new folio under PT lock. However, I wonder if 
it's really worth the complexity. For THP we were happy so far to *not* 
optimize this, implying that maybe we shouldn't worry about optimizing 
the fork() case for now that heavily.

One optimization once could think of instead (that I raised previously 
in other context) is the detection of exclusivity after fork()+exit in 
the child (IOW, only the parent continues to exist). Once 
PG_anon_exclusive was cleared for all sub-pages of the THP-mapped folio 
during fork(), we'd always decide to copy instead of reuse (because 
page_count() > 1, as the folio is PTE mapped). Scanning the surrounding 
page table if it makes sense (e.g., page_count() <= folio_nr_pages()), 
to test if all page references are from the current process would allow 
for reusing the folio (setting PG_anon_exclusive) for the sub-pages. The 
smaller the folio order, the cheaper this "scan surrounding PTEs" scan 
is. For THP, which are usually PMD-mapped even after fork()+exit, we 
didn't add this optimization.

-- 
Thanks,

David / dhildenb