linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Yu Zhao <yuzhao@google.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
Date: Mon, 17 Apr 2023 12:54:35 +0200	[thread overview]
Message-ID: <faa31f50-4d2a-c71f-945f-398789cbbf66@redhat.com> (raw)
In-Reply-To: <20230414130303.2345383-1-ryan.roberts@arm.com>

On 14.04.23 15:02, Ryan Roberts wrote:
> Hi All,
> 
> This is a second RFC and my first proper attempt at implementing variable order,
> large folios for anonymous memory. The first RFC [1], was a partial
> implementation and a plea for help in debugging an issue I was hitting; thanks
> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
> 
> The objective of variable order anonymous folios is to improve performance by
> allocating larger chunks of memory during anonymous page faults:
> 
>   - Since SW (the kernel) is dealing with larger chunks of memory than base
>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>     and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>     overhead. This should benefit all architectures.
>   - Since we are now mapping physically contiguous chunks of memory, we can take
>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
> 
> This patch set deals with the SW side of things only but sets us up nicely for
> taking advantage of the HW improvements in the near future.
> 
> I'm not yet benchmarking a wide variety of use cases, but those that I have
> looked at are positive; I see kernel compilation time improved by up to 10%,
> which I expect to improve further once I add in the arm64 "contiguous bit".
> Memory consumption is somewhere between 1% less and 2% more, depending on how
> its measured. More on perf and memory below.
> 
> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
> conflict resolution). I have a tree at [4].
> 
> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
> 
> Approach
> ========
> 
> There are 4 fault paths that have been modified:
>   - write fault on unallocated address: do_anonymous_page()
>   - write fault on zero page: wp_page_copy()
>   - write fault on non-exclusive CoW page: wp_page_copy()
>   - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
> 
> In the first 2 cases, we will determine the preferred order folio to allocate,
> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
> folio as the source, subject to constraints that may arise if the source has
> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
> the folio as we can, subject to the same mremap/munmap constraints.

Just a note (that you maybe already know) that we have to be a bit 
careful in the wp_copy path with replacing sub-pages that are marked 
exclusive.

Currently, we always only replace a single shared anon (sub)page by a 
fresh exclusive base-page during a write-fault/unsharing. As the 
sub-page is already marked "maybe shared", it cannot get pinned 
concurrently and everybody is happy.

If you now decide to replace more subpages, you have to be careful that 
none of them are still exclusive -- because they could get pinned 
concurrently and replacing them would result in memory corruptions.

There are scenarios (most prominently: MADV_WIPEONFORK), but also failed 
partial fork() that could result in something like that.

Further, we have to be a bit careful regarding replacing ranges that are 
backed by different anon pages (for example, due to fork() deciding to 
copy some sub-pages of a PTE-mapped folio instead of sharing all sub-pages).


So what should be safe is replacing all sub-pages of a folio that are 
marked "maybe shared" by a new folio under PT lock. However, I wonder if 
it's really worth the complexity. For THP we were happy so far to *not* 
optimize this, implying that maybe we shouldn't worry about optimizing 
the fork() case for now that heavily.


One optimization once could think of instead (that I raised previously 
in other context) is the detection of exclusivity after fork()+exit in 
the child (IOW, only the parent continues to exist). Once 
PG_anon_exclusive was cleared for all sub-pages of the THP-mapped folio 
during fork(), we'd always decide to copy instead of reuse (because 
page_count() > 1, as the folio is PTE mapped). Scanning the surrounding 
page table if it makes sense (e.g., page_count() <= folio_nr_pages()), 
to test if all page references are from the current process would allow 
for reusing the folio (setting PG_anon_exclusive) for the sub-pages. The 
smaller the folio order, the cheaper this "scan surrounding PTEs" scan 
is. For THP, which are usually PMD-mapped even after fork()+exit, we 
didn't add this optimization.

-- 
Thanks,

David / dhildenb



  parent reply	other threads:[~2023-04-17 10:54 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 01/17] mm: Expose clear_huge_page() unconditionally Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 02/17] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 03/17] mm: Introduce try_vma_alloc_movable_folio() Ryan Roberts
2023-04-17  8:49   ` Yin, Fengwei
2023-04-17 10:11     ` Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 04/17] mm: Implement folio_add_new_anon_rmap_range() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order Ryan Roberts
2023-04-14 14:09   ` Kirill A. Shutemov
2023-04-14 14:38     ` Ryan Roberts
2023-04-14 15:37       ` Kirill A. Shutemov
2023-04-14 16:06         ` Ryan Roberts
2023-04-14 16:18           ` Matthew Wilcox
2023-04-14 16:31             ` Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 06/17] mm: Allocate large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 07/17] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 08/17] mm: Implement folio_move_anon_rmap_range() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 09/17] mm: Update wp_page_reuse() to operate on range of pages Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 10/17] mm: Reuse large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 11/17] mm: Split __wp_page_copy_user() into 2 variants Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 12/17] mm: ptep_clear_flush_range_notify() macro for batch operation Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 13/17] mm: Implement folio_remove_rmap_range() Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 14/17] mm: Copy large folios for anonymous memory Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 15/17] mm: Convert zero page to large folios on write Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 16/17] mm: mmap: Align unhinted maps to highest anon folio order Ryan Roberts
2023-04-17  8:25   ` Yin, Fengwei
2023-04-17 10:13     ` Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 17/17] mm: Batch-zap large anonymous folio PTE mappings Ryan Roberts
2023-04-17  8:04 ` [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Yin, Fengwei
2023-04-17 10:19   ` Ryan Roberts
2023-04-17  8:19 ` Yin, Fengwei
2023-04-17 10:28   ` Ryan Roberts
2023-04-17 10:54 ` David Hildenbrand [this message]
2023-04-17 11:43   ` Ryan Roberts
2023-04-17 14:05     ` David Hildenbrand
2023-04-17 15:38       ` Ryan Roberts
2023-04-17 15:44         ` David Hildenbrand
2023-04-17 16:15           ` Ryan Roberts
2023-04-26 10:41           ` Ryan Roberts
2023-05-17 13:58             ` David Hildenbrand
2023-05-18 11:23               ` Ryan Roberts
2023-04-19 10:12       ` Ryan Roberts
2023-04-19 10:51         ` David Hildenbrand
2023-04-19 11:13           ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=faa31f50-4d2a-c71f-945f-398789cbbf66@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=fengwei.yin@intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).