From: Ryan Roberts <ryan.roberts@arm.com>
To: "Yin, Fengwei" <fengwei.yin@intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
Date: Mon, 17 Apr 2023 11:28:20 +0100 [thread overview]
Message-ID: <9765f58f-8004-af0f-07da-5e2528c66779@arm.com> (raw)
In-Reply-To: <bb344902-aced-c550-5742-d8a1534a200b@intel.com>
On 17/04/2023 09:19, Yin, Fengwei wrote:
>
>
> On 4/14/2023 9:02 PM, Ryan Roberts wrote:
>> Hi All,
>>
>> This is a second RFC and my first proper attempt at implementing variable order,
>> large folios for anonymous memory. The first RFC [1], was a partial
>> implementation and a plea for help in debugging an issue I was hitting; thanks
>> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
>>
>> The objective of variable order anonymous folios is to improve performance by
>> allocating larger chunks of memory during anonymous page faults:
>>
>> - Since SW (the kernel) is dealing with larger chunks of memory than base
>> pages, there are efficiency savings to be had; fewer page faults, batched PTE
>> and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>> overhead. This should benefit all architectures.
>> - Since we are now mapping physically contiguous chunks of memory, we can take
>> advantage of HW TLB compression techniques. A reduction in TLB pressure
>> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>> TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
>>
>> This patch set deals with the SW side of things only but sets us up nicely for
>> taking advantage of the HW improvements in the near future.
>>
>> I'm not yet benchmarking a wide variety of use cases, but those that I have
>> looked at are positive; I see kernel compilation time improved by up to 10%,
>> which I expect to improve further once I add in the arm64 "contiguous bit".
>> Memory consumption is somewhere between 1% less and 2% more, depending on how
>> its measured. More on perf and memory below.
>>
>> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
>> conflict resolution). I have a tree at [4].
>>
>> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
>> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
>> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
>>
>> Approach
>> ========
>>
>> There are 4 fault paths that have been modified:
>> - write fault on unallocated address: do_anonymous_page()
>> - write fault on zero page: wp_page_copy()
>> - write fault on non-exclusive CoW page: wp_page_copy()
>> - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
>>
>> In the first 2 cases, we will determine the preferred order folio to allocate,
>> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
>> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
>> folio as the source, subject to constraints that may arise if the source has
>> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
>> the folio as we can, subject to the same mremap/munmap constraints.
>>
>> If allocation of our preferred folio order fails, we gracefully fall back to
>> lower orders all the way to 0.
>>
>> Note that none of this affects the behavior of traditional PMD-sized THP. If we
>> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.
>>
>> Open Questions
>> ==============
>>
>> How to Move Forwards
>> --------------------
>>
>> While the series is a small-ish code change, it represents a big shift in the
>> way things are done. So I'd appreciate any help in scaling up performance
>> testing, review and general advice on how best to guide a change like this into
>> the kernel.
>>
>> Folio Allocation Order Policy
>> -----------------------------
>>
>> The current code is hardcoded to use a maximum order of 4. This was chosen for a
>> couple of reasons:
>> - From the SW performance perspective, I see a knee around here where
>> increasing it doesn't lead to much more performance gain.
>> - Intuitively I assume that higher orders become increasingly difficult to
>> allocate.
>> - From the HW performance perspective, arm64's HPA works on order-2 blocks and
>> "the contiguous bit" works on order-4 for 4KB base pages (although it's
>> order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
>> any higher.
>>
>> I suggest that ultimately setting the max order should be left to the
>> architecture. arm64 would take advantage of this and set it to the order
>> required for the contiguous bit for the configured base page size.
>>
>> However, I also have a (mild) concern about increased memory consumption. If an
>> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
>> we would end up allocating 16x as much memory as we used to. One potential
>> approach I see here is to track fault addresses per-VMA, and increase a per-VMA
>> max allocation order for consecutive faults that extend a contiguous range, and
>> decrement when discontiguous. Alternatively/additionally, we could use the VMA
>> size as an indicator. I'd be interested in your thoughts/opinions.
>>
>> Deferred Split Queue Lock Contention
>> ------------------------------------
>>
>> The results below show that we are spending a much greater proportion of time in
>> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.
>>
>> I think this is (at least partially) related for contention on the deferred
>> split queue lock. This is a per-memcg spinlock, which means a single spinlock
>> shared among all 160 CPUs. I've solved part of the problem with the last patch
>> in the series (which cuts down the need to take the lock), but at folio free
>> time (free_transhuge_page()), the lock is still taken and I think this could be
>> a problem. Now that most anonymous pages are large folios, this lock is taken a
>> lot more.
>>
>> I think we could probably avoid taking the lock unless !list_empty(), but I
>> haven't convinced myself its definitely safe, so haven't applied it yet.
>>
>> Roadmap
>> =======
>>
>> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
>> bit" on arm64 to validate predictions about HW speedups.
>>
>> I also think there are some opportunities with madvise to split folios to non-0
>> orders, which might improve performance in some cases. madvise is also mistaking
>> exclusive large folios for non-exclusive ones at the moment (due to the "small
>> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
>> frees the folio.
>>
>> Results
>> =======
>>
>> Performance
>> -----------
>>
>> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
>> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
>> before each run.
>>
>> make defconfig && time make -jN Image
>>
>> First with -j8:
>>
>> | | baseline time | anonfolio time | percent change |
>> | | to compile (s) | to compile (s) | SMALLER=better |
>> |-----------|---------------:|---------------:|---------------:|
>> | real-time | 373.0 | 342.8 | -8.1% |
>> | user-time | 2333.9 | 2275.3 | -2.5% |
>> | sys-time | 510.7 | 340.9 | -33.3% |
>>
>> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
>> execution. The next 2 tables show a breakdown of the cycles spent in the kernel
>> for the 8 job config:
>>
>> | | baseline | anonfolio | percent change |
>> | | (cycles) | (cycles) | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | data abort | 683B | 316B | -53.8% |
>> | instruction abort | 93B | 76B | -18.4% |
>> | syscall | 887B | 767B | -13.6% |
>>
>> | | baseline | anonfolio | percent change |
>> | | (cycles) | (cycles) | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | arm64_sys_openat | 194B | 188B | -3.3% |
>> | arm64_sys_exit_group | 192B | 124B | -35.7% |
>> | arm64_sys_read | 124B | 108B | -12.7% |
>> | arm64_sys_execve | 75B | 67B | -11.0% |
>> | arm64_sys_mmap | 51B | 50B | -3.0% |
>> | arm64_sys_mprotect | 15B | 13B | -12.0% |
>> | arm64_sys_write | 43B | 42B | -2.9% |
>> | arm64_sys_munmap | 15B | 12B | -17.0% |
>> | arm64_sys_newfstatat | 46B | 41B | -9.7% |
>> | arm64_sys_clone | 26B | 24B | -10.0% |
>>
>> And now with -j160:
>>
>> | | baseline time | anonfolio time | percent change |
>> | | to compile (s) | to compile (s) | SMALLER=better |
>> |-----------|---------------:|---------------:|---------------:|
>> | real-time | 53.7 | 48.2 | -10.2% |
>> | user-time | 2705.8 | 2842.1 | 5.0% |
>> | sys-time | 1370.4 | 1064.3 | -22.3% |
>>
>> Above shows a 10.2% improvement in real time execution. But ~3x more time is
>> spent in the kernel than for the -j8 config. I think this is related to the lock
>> contention issue I highlighted above, but haven't bottomed it out yet. It's also
>> not yet clear to me why user-time increases by 5%.
>>
>> I've also run all the will-it-scale microbenchmarks for a single task, using the
>> process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
>> fluctuation. So I'm just calling out tests with results that have gt 5%
>> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
>> are regressed:
>>
>> | benchmark | baseline | anonfolio | percent change |
>> | | ops/s | ops/s | BIGGER=better |
>> | ---------------------|---------:|----------:|---------------:|
>> | context_switch1.csv | 328744 | 351150 | 6.8% |
>> | malloc1.csv | 96214 | 50890 | -47.1% |
>> | mmap1.csv | 410253 | 375746 | -8.4% |
>> | page_fault1.csv | 624061 | 3185678 | 410.5% |
>> | page_fault2.csv | 416483 | 557448 | 33.8% |
>> | page_fault3.csv | 724566 | 1152726 | 59.1% |
>> | read1.csv | 1806908 | 1905752 | 5.5% |
>> | read2.csv | 587722 | 1942062 | 230.4% |
>> | tlb_flush1.csv | 143910 | 152097 | 5.7% |
>> | tlb_flush2.csv | 266763 | 322320 | 20.8% |
>>
>> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
>> object in a loop and never touches the allocated memory. I think the malloc
>> implementation is maintaining a header just before the allocated object, which
>> causes a single page fault. Previously that page fault allocated 1 page. Now it
>> is allocating 16 pages. This cost would be repaid if the test code wrote to the
>> allocated object. Alternatively the folio allocation order policy described
>> above would also solve this.
>>
>> It is not clear to me why mmap1 has slowed down. This remains a todo.
>>
>> Memory
>> ------
>>
>> I measured memory consumption while doing a kernel compile with 8 jobs on a
>> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
>> workload, then calcualted "memory used" high and low watermarks using both
>> MemFree and MemAvailable. If there is a better way of measuring system memory
>> consumption, please let me know!
>>
>> mem-used = 4GB - /proc/meminfo:MemFree
>>
>> | | baseline | anonfolio | percent change |
>> | | (MB) | (MB) | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | mem-used-low | 825 | 842 | 2.1% |
>> | mem-used-high | 2697 | 2672 | -0.9% |
>>
>> mem-used = 4GB - /proc/meminfo:MemAvailable
>>
>> | | baseline | anonfolio | percent change |
>> | | (MB) | (MB) | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | mem-used-low | 518 | 530 | 2.3% |
>> | mem-used-high | 1522 | 1537 | 1.0% |
>>
>> For the high watermark, the methods disagree; we are either saving 1% or using
>> 1% more. For the low watermark, both methods agree that we are using about 2%
>> more. I plan to investigate whether the proposed folio allocation order policy
>> can reduce this to zero.
>
> Besides the memory consumption, the large folio could impact the tail latency
> of page allocation also (extra zeroing memory, more operations on slow path of
> page allocation).
I agree; this series could be thought of as trading latency for throughput.
There are a couple of mitigations to try to limit the extra latency; on the
Altra at least, we are taking advantage of the CPU's streaming mode for the
memory zeroing - its >2x faster to zero a block of 64K than it is to zero 16x
4K. And currently I'm using __GPF_NORETRY for high order folio allocations,
which I understood to mean we shouldn't suffer high allocation latency due to
reclaim, etc. Although, I know we are discussing whether this is correct in the
thread for patch 3.
>
> Again, my understanding is the tail latency of page allocation is more important
> to anonymous page than page cache page because anonymous page is allocated/freed
> more frequently.
I had assumed that any serious application that needs to guarantee no latency
due to page faults would preallocate and lock all memory during init? Perhaps
that's wishful thinking?
>
>
> Regards
> Yin, Fengwei
>
>>
>> Thanks for making it this far!
>> Ryan
>>
>>
>> Ryan Roberts (17):
>> mm: Expose clear_huge_page() unconditionally
>> mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
>> mm: Introduce try_vma_alloc_movable_folio()
>> mm: Implement folio_add_new_anon_rmap_range()
>> mm: Routines to determine max anon folio allocation order
>> mm: Allocate large folios for anonymous memory
>> mm: Allow deferred splitting of arbitrary large anon folios
>> mm: Implement folio_move_anon_rmap_range()
>> mm: Update wp_page_reuse() to operate on range of pages
>> mm: Reuse large folios for anonymous memory
>> mm: Split __wp_page_copy_user() into 2 variants
>> mm: ptep_clear_flush_range_notify() macro for batch operation
>> mm: Implement folio_remove_rmap_range()
>> mm: Copy large folios for anonymous memory
>> mm: Convert zero page to large folios on write
>> mm: mmap: Align unhinted maps to highest anon folio order
>> mm: Batch-zap large anonymous folio PTE mappings
>>
>> arch/alpha/include/asm/page.h | 5 +-
>> arch/arm64/include/asm/page.h | 3 +-
>> arch/arm64/mm/fault.c | 7 +-
>> arch/ia64/include/asm/page.h | 5 +-
>> arch/m68k/include/asm/page_no.h | 7 +-
>> arch/s390/include/asm/page.h | 5 +-
>> arch/x86/include/asm/page.h | 5 +-
>> include/linux/highmem.h | 23 +-
>> include/linux/mm.h | 8 +-
>> include/linux/mmu_notifier.h | 31 ++
>> include/linux/rmap.h | 6 +
>> mm/memory.c | 877 ++++++++++++++++++++++++++++----
>> mm/mmap.c | 4 +-
>> mm/rmap.c | 147 +++++-
>> 14 files changed, 1000 insertions(+), 133 deletions(-)
>>
>> --
>> 2.25.1
>>
next prev parent reply other threads:[~2023-04-17 10:28 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 01/17] mm: Expose clear_huge_page() unconditionally Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 02/17] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 03/17] mm: Introduce try_vma_alloc_movable_folio() Ryan Roberts
2023-04-17 8:49 ` Yin, Fengwei
2023-04-17 10:11 ` Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 04/17] mm: Implement folio_add_new_anon_rmap_range() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order Ryan Roberts
2023-04-14 14:09 ` Kirill A. Shutemov
2023-04-14 14:38 ` Ryan Roberts
2023-04-14 15:37 ` Kirill A. Shutemov
2023-04-14 16:06 ` Ryan Roberts
2023-04-14 16:18 ` Matthew Wilcox
2023-04-14 16:31 ` Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 06/17] mm: Allocate large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 07/17] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 08/17] mm: Implement folio_move_anon_rmap_range() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 09/17] mm: Update wp_page_reuse() to operate on range of pages Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 10/17] mm: Reuse large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 11/17] mm: Split __wp_page_copy_user() into 2 variants Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 12/17] mm: ptep_clear_flush_range_notify() macro for batch operation Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 13/17] mm: Implement folio_remove_rmap_range() Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 14/17] mm: Copy large folios for anonymous memory Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 15/17] mm: Convert zero page to large folios on write Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 16/17] mm: mmap: Align unhinted maps to highest anon folio order Ryan Roberts
2023-04-17 8:25 ` Yin, Fengwei
2023-04-17 10:13 ` Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 17/17] mm: Batch-zap large anonymous folio PTE mappings Ryan Roberts
2023-04-17 8:04 ` [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Yin, Fengwei
2023-04-17 10:19 ` Ryan Roberts
2023-04-17 8:19 ` Yin, Fengwei
2023-04-17 10:28 ` Ryan Roberts [this message]
2023-04-17 10:54 ` David Hildenbrand
2023-04-17 11:43 ` Ryan Roberts
2023-04-17 14:05 ` David Hildenbrand
2023-04-17 15:38 ` Ryan Roberts
2023-04-17 15:44 ` David Hildenbrand
2023-04-17 16:15 ` Ryan Roberts
2023-04-26 10:41 ` Ryan Roberts
2023-05-17 13:58 ` David Hildenbrand
2023-05-18 11:23 ` Ryan Roberts
2023-04-19 10:12 ` Ryan Roberts
2023-04-19 10:51 ` David Hildenbrand
2023-04-19 11:13 ` Ryan Roberts
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9765f58f-8004-af0f-07da-5e2528c66779@arm.com \
--to=ryan.roberts@arm.com \
--cc=akpm@linux-foundation.org \
--cc=fengwei.yin@intel.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-mm@kvack.org \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).