linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Yu Zhao <yuzhao@google.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
Date: Mon, 17 Apr 2023 12:43:25 +0100	[thread overview]
Message-ID: <13969045-4e47-ae5d-73f4-dad40fe631be@arm.com> (raw)
In-Reply-To: <faa31f50-4d2a-c71f-945f-398789cbbf66@redhat.com>

On 17/04/2023 11:54, David Hildenbrand wrote:
> On 14.04.23 15:02, Ryan Roberts wrote:
>> Hi All,
>>
>> This is a second RFC and my first proper attempt at implementing variable order,
>> large folios for anonymous memory. The first RFC [1], was a partial
>> implementation and a plea for help in debugging an issue I was hitting; thanks
>> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
>>
>> The objective of variable order anonymous folios is to improve performance by
>> allocating larger chunks of memory during anonymous page faults:
>>
>>   - Since SW (the kernel) is dealing with larger chunks of memory than base
>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>     and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>     overhead. This should benefit all architectures.
>>   - Since we are now mapping physically contiguous chunks of memory, we can take
>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
>>
>> This patch set deals with the SW side of things only but sets us up nicely for
>> taking advantage of the HW improvements in the near future.
>>
>> I'm not yet benchmarking a wide variety of use cases, but those that I have
>> looked at are positive; I see kernel compilation time improved by up to 10%,
>> which I expect to improve further once I add in the arm64 "contiguous bit".
>> Memory consumption is somewhere between 1% less and 2% more, depending on how
>> its measured. More on perf and memory below.
>>
>> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
>> conflict resolution). I have a tree at [4].
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
>> [2]
>> https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
>> [3]
>> https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>> [4]
>> https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
>>
>> Approach
>> ========
>>
>> There are 4 fault paths that have been modified:
>>   - write fault on unallocated address: do_anonymous_page()
>>   - write fault on zero page: wp_page_copy()
>>   - write fault on non-exclusive CoW page: wp_page_copy()
>>   - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
>>
>> In the first 2 cases, we will determine the preferred order folio to allocate,
>> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
>> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
>> folio as the source, subject to constraints that may arise if the source has
>> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
>> the folio as we can, subject to the same mremap/munmap constraints.
> 
> Just a note (that you maybe already know) that we have to be a bit careful in
> the wp_copy path with replacing sub-pages that are marked exclusive.

Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
think I have a bug.

(I'm guessing the GUP fast path assumes that if it sees an exclusive page then
that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
But surely user space can munmap it at any time and the consequences are
similar? It's probably clear that I don't know much about the GUP implementation
details...)

My current patch always prefers reuse over copy, and expands the size of the
reuse to the biggest set of pages that are all exclusive (determined either by
the presence of the anon_exclusive flag or from the refcount), and covered by
the same folio (and a few other bounds constraints - see
calc_anon_folio_range_reuse()).

If I determine I must copy (because the "anchor" page is not exclusive), then I
determine the size of the copy region based on a few constraints (see
calc_anon_folio_order_copy()). But I think you are saying that no pages in that
region are allowed to have the anon_exclusive flag set? In which case, this
could be fixed by adding that check in the function.

> 
> Currently, we always only replace a single shared anon (sub)page by a fresh
> exclusive base-page during a write-fault/unsharing. As the sub-page is already
> marked "maybe shared", it cannot get pinned concurrently and everybody is happy.

When you say, "maybe shared" is that determined by the absence of the
"exclusive" flag?

> 
> If you now decide to replace more subpages, you have to be careful that none of
> them are still exclusive -- because they could get pinned concurrently and
> replacing them would result in memory corruptions.
> 
> There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial
> fork() that could result in something like that.

Are there any test cases that stress the kernel in this area that I could use to
validate my fix?

> 
> Further, we have to be a bit careful regarding replacing ranges that are backed
> by different anon pages (for example, due to fork() deciding to copy some
> sub-pages of a PTE-mapped folio instead of sharing all sub-pages).

I don't understand this statement; do you mean "different anon _folios_"? I am
scanning the page table to expand the region that I reuse/copy and as part of
that scan, make sure that I only cover a single folio. So I think I conform here
- the scan would give up once it gets to the hole.

> 
> 
> So what should be safe is replacing all sub-pages of a folio that are marked
> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
> worth the complexity. For THP we were happy so far to *not* optimize this,
> implying that maybe we shouldn't worry about optimizing the fork() case for now
> that heavily.

I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
large copies was contributing a measurable amount to the performance
improvement. (Certainly, the zero-page copy case, is definitely a big
contributer). I don't have access to the HW at the moment but can rerun later
with and without to double check.

> 
> 
> One optimization once could think of instead (that I raised previously in other
> context) is the detection of exclusivity after fork()+exit in the child (IOW,
> only the parent continues to exist). Once PG_anon_exclusive was cleared for all
> sub-pages of the THP-mapped folio during fork(), we'd always decide to copy
> instead of reuse (because page_count() > 1, as the folio is PTE mapped).
> Scanning the surrounding page table if it makes sense (e.g., page_count() <=
> folio_nr_pages()), to test if all page references are from the current process
> would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages.
> The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is.
> For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this
> optimization.

Yes, I have already implemented this in my series; see patch 10.

Thanks,
Ryan



  reply	other threads:[~2023-04-17 11:43 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 01/17] mm: Expose clear_huge_page() unconditionally Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 02/17] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 03/17] mm: Introduce try_vma_alloc_movable_folio() Ryan Roberts
2023-04-17  8:49   ` Yin, Fengwei
2023-04-17 10:11     ` Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 04/17] mm: Implement folio_add_new_anon_rmap_range() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order Ryan Roberts
2023-04-14 14:09   ` Kirill A. Shutemov
2023-04-14 14:38     ` Ryan Roberts
2023-04-14 15:37       ` Kirill A. Shutemov
2023-04-14 16:06         ` Ryan Roberts
2023-04-14 16:18           ` Matthew Wilcox
2023-04-14 16:31             ` Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 06/17] mm: Allocate large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 07/17] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 08/17] mm: Implement folio_move_anon_rmap_range() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 09/17] mm: Update wp_page_reuse() to operate on range of pages Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 10/17] mm: Reuse large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 11/17] mm: Split __wp_page_copy_user() into 2 variants Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 12/17] mm: ptep_clear_flush_range_notify() macro for batch operation Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 13/17] mm: Implement folio_remove_rmap_range() Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 14/17] mm: Copy large folios for anonymous memory Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 15/17] mm: Convert zero page to large folios on write Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 16/17] mm: mmap: Align unhinted maps to highest anon folio order Ryan Roberts
2023-04-17  8:25   ` Yin, Fengwei
2023-04-17 10:13     ` Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 17/17] mm: Batch-zap large anonymous folio PTE mappings Ryan Roberts
2023-04-17  8:04 ` [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Yin, Fengwei
2023-04-17 10:19   ` Ryan Roberts
2023-04-17  8:19 ` Yin, Fengwei
2023-04-17 10:28   ` Ryan Roberts
2023-04-17 10:54 ` David Hildenbrand
2023-04-17 11:43   ` Ryan Roberts [this message]
2023-04-17 14:05     ` David Hildenbrand
2023-04-17 15:38       ` Ryan Roberts
2023-04-17 15:44         ` David Hildenbrand
2023-04-17 16:15           ` Ryan Roberts
2023-04-26 10:41           ` Ryan Roberts
2023-05-17 13:58             ` David Hildenbrand
2023-05-18 11:23               ` Ryan Roberts
2023-04-19 10:12       ` Ryan Roberts
2023-04-19 10:51         ` David Hildenbrand
2023-04-19 11:13           ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=13969045-4e47-ae5d-73f4-dad40fe631be@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=fengwei.yin@intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).