Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory

From: Ryan Roberts <ryan.roberts@arm.com>
To: "Yin, Fengwei" <fengwei.yin@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
Date: Mon, 17 Apr 2023 11:28:20 +0100	[thread overview]
Message-ID: <9765f58f-8004-af0f-07da-5e2528c66779@arm.com> (raw)
In-Reply-To: <bb344902-aced-c550-5742-d8a1534a200b@intel.com>

On 17/04/2023 09:19, Yin, Fengwei wrote:
> 
> 
> On 4/14/2023 9:02 PM, Ryan Roberts wrote:
>> Hi All,
>>
>> This is a second RFC and my first proper attempt at implementing variable order,
>> large folios for anonymous memory. The first RFC [1], was a partial
>> implementation and a plea for help in debugging an issue I was hitting; thanks
>> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
>>
>> The objective of variable order anonymous folios is to improve performance by
>> allocating larger chunks of memory during anonymous page faults:
>>
>>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>    overhead. This should benefit all architectures.
>>  - Since we are now mapping physically contiguous chunks of memory, we can take
>>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
>>
>> This patch set deals with the SW side of things only but sets us up nicely for
>> taking advantage of the HW improvements in the near future.
>>
>> I'm not yet benchmarking a wide variety of use cases, but those that I have
>> looked at are positive; I see kernel compilation time improved by up to 10%,
>> which I expect to improve further once I add in the arm64 "contiguous bit".
>> Memory consumption is somewhere between 1% less and 2% more, depending on how
>> its measured. More on perf and memory below.
>>
>> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
>> conflict resolution). I have a tree at [4].
>>
>> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
>> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
>> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
>>
>> Approach
>> ========
>>
>> There are 4 fault paths that have been modified:
>>  - write fault on unallocated address: do_anonymous_page()
>>  - write fault on zero page: wp_page_copy()
>>  - write fault on non-exclusive CoW page: wp_page_copy()
>>  - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
>>
>> In the first 2 cases, we will determine the preferred order folio to allocate,
>> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
>> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
>> folio as the source, subject to constraints that may arise if the source has
>> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
>> the folio as we can, subject to the same mremap/munmap constraints.
>>
>> If allocation of our preferred folio order fails, we gracefully fall back to
>> lower orders all the way to 0.
>>
>> Note that none of this affects the behavior of traditional PMD-sized THP. If we
>> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.
>>
>> Open Questions
>> ==============
>>
>> How to Move Forwards
>> --------------------
>>
>> While the series is a small-ish code change, it represents a big shift in the
>> way things are done. So I'd appreciate any help in scaling up performance
>> testing, review and general advice on how best to guide a change like this into
>> the kernel.
>>
>> Folio Allocation Order Policy
>> -----------------------------
>>
>> The current code is hardcoded to use a maximum order of 4. This was chosen for a
>> couple of reasons:
>>  - From the SW performance perspective, I see a knee around here where
>>    increasing it doesn't lead to much more performance gain.
>>  - Intuitively I assume that higher orders become increasingly difficult to
>>    allocate.
>>  - From the HW performance perspective, arm64's HPA works on order-2 blocks and
>>    "the contiguous bit" works on order-4 for 4KB base pages (although it's
>>    order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
>>    any higher.
>>
>> I suggest that ultimately setting the max order should be left to the
>> architecture. arm64 would take advantage of this and set it to the order
>> required for the contiguous bit for the configured base page size.
>>
>> However, I also have a (mild) concern about increased memory consumption. If an
>> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
>> we would end up allocating 16x as much memory as we used to. One potential
>> approach I see here is to track fault addresses per-VMA, and increase a per-VMA
>> max allocation order for consecutive faults that extend a contiguous range, and
>> decrement when discontiguous. Alternatively/additionally, we could use the VMA
>> size as an indicator. I'd be interested in your thoughts/opinions.
>>
>> Deferred Split Queue Lock Contention
>> ------------------------------------
>>
>> The results below show that we are spending a much greater proportion of time in
>> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.
>>
>> I think this is (at least partially) related for contention on the deferred
>> split queue lock. This is a per-memcg spinlock, which means a single spinlock
>> shared among all 160 CPUs. I've solved part of the problem with the last patch
>> in the series (which cuts down the need to take the lock), but at folio free
>> time (free_transhuge_page()), the lock is still taken and I think this could be
>> a problem. Now that most anonymous pages are large folios, this lock is taken a
>> lot more.
>>
>> I think we could probably avoid taking the lock unless !list_empty(), but I
>> haven't convinced myself its definitely safe, so haven't applied it yet.
>>
>> Roadmap
>> =======
>>
>> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
>> bit" on arm64 to validate predictions about HW speedups.
>>
>> I also think there are some opportunities with madvise to split folios to non-0
>> orders, which might improve performance in some cases. madvise is also mistaking
>> exclusive large folios for non-exclusive ones at the moment (due to the "small
>> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
>> frees the folio.
>>
>> Results
>> =======
>>
>> Performance
>> -----------
>>
>> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
>> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
>> before each run.
>>
>> make defconfig && time make -jN Image
>>
>> First with -j8:
>>
>> |           | baseline time  | anonfolio time | percent change |
>> |           | to compile (s) | to compile (s) | SMALLER=better |
>> |-----------|---------------:|---------------:|---------------:|
>> | real-time |          373.0 |          342.8 |          -8.1% |
>> | user-time |         2333.9 |         2275.3 |          -2.5% |
>> | sys-time  |          510.7 |          340.9 |         -33.3% |
>>
>> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
>> execution. The next 2 tables show a breakdown of the cycles spent in the kernel
>> for the 8 job config:
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (cycles) | (cycles)  | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | data abort           |     683B |      316B |         -53.8% |
>> | instruction abort    |      93B |       76B |         -18.4% |
>> | syscall              |     887B |      767B |         -13.6% |
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (cycles) | (cycles)  | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | arm64_sys_openat     |     194B |      188B |          -3.3% |
>> | arm64_sys_exit_group |     192B |      124B |         -35.7% |
>> | arm64_sys_read       |     124B |      108B |         -12.7% |
>> | arm64_sys_execve     |      75B |       67B |         -11.0% |
>> | arm64_sys_mmap       |      51B |       50B |          -3.0% |
>> | arm64_sys_mprotect   |      15B |       13B |         -12.0% |
>> | arm64_sys_write      |      43B |       42B |          -2.9% |
>> | arm64_sys_munmap     |      15B |       12B |         -17.0% |
>> | arm64_sys_newfstatat |      46B |       41B |          -9.7% |
>> | arm64_sys_clone      |      26B |       24B |         -10.0% |
>>
>> And now with -j160:
>>
>> |           | baseline time  | anonfolio time | percent change |
>> |           | to compile (s) | to compile (s) | SMALLER=better |
>> |-----------|---------------:|---------------:|---------------:|
>> | real-time |           53.7 |           48.2 |         -10.2% |
>> | user-time |         2705.8 |         2842.1 |           5.0% |
>> | sys-time  |         1370.4 |         1064.3 |         -22.3% |
>>
>> Above shows a 10.2% improvement in real time execution. But ~3x more time is
>> spent in the kernel than for the -j8 config. I think this is related to the lock
>> contention issue I highlighted above, but haven't bottomed it out yet. It's also
>> not yet clear to me why user-time increases by 5%.
>>
>> I've also run all the will-it-scale microbenchmarks for a single task, using the
>> process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
>> fluctuation. So I'm just calling out tests with results that have gt 5%
>> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
>> are regressed:
>>
>> | benchmark            | baseline | anonfolio | percent change |
>> |                      | ops/s    | ops/s     | BIGGER=better  |
>> | ---------------------|---------:|----------:|---------------:|
>> | context_switch1.csv  |   328744 |    351150 |          6.8%  |
>> | malloc1.csv          |    96214 |     50890 |        -47.1%  |
>> | mmap1.csv            |   410253 |    375746 |         -8.4%  |
>> | page_fault1.csv      |   624061 |   3185678 |        410.5%  |
>> | page_fault2.csv      |   416483 |    557448 |         33.8%  |
>> | page_fault3.csv      |   724566 |   1152726 |         59.1%  |
>> | read1.csv            |  1806908 |   1905752 |          5.5%  |
>> | read2.csv            |   587722 |   1942062 |        230.4%  |
>> | tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
>> | tlb_flush2.csv       |   266763 |    322320 |         20.8%  |
>>
>> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
>> object in a loop and never touches the allocated memory. I think the malloc
>> implementation is maintaining a header just before the allocated object, which
>> causes a single page fault. Previously that page fault allocated 1 page. Now it
>> is allocating 16 pages. This cost would be repaid if the test code wrote to the
>> allocated object. Alternatively the folio allocation order policy described
>> above would also solve this.
>>
>> It is not clear to me why mmap1 has slowed down. This remains a todo.
>>
>> Memory
>> ------
>>
>> I measured memory consumption while doing a kernel compile with 8 jobs on a
>> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
>> workload, then calcualted "memory used" high and low watermarks using both
>> MemFree and MemAvailable. If there is a better way of measuring system memory
>> consumption, please let me know!
>>
>> mem-used = 4GB - /proc/meminfo:MemFree
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (MB)     | (MB)      | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | mem-used-low         |      825 |       842 |           2.1% |
>> | mem-used-high        |     2697 |      2672 |          -0.9% |
>>
>> mem-used = 4GB - /proc/meminfo:MemAvailable
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (MB)     | (MB)      | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | mem-used-low         |      518 |       530 |           2.3% |
>> | mem-used-high        |     1522 |      1537 |           1.0% |
>>
>> For the high watermark, the methods disagree; we are either saving 1% or using
>> 1% more. For the low watermark, both methods agree that we are using about 2%
>> more. I plan to investigate whether the proposed folio allocation order policy
>> can reduce this to zero.
> 
> Besides the memory consumption, the large folio could impact the tail latency
> of page allocation also (extra zeroing memory, more operations on slow path of
> page allocation).

I agree; this series could be thought of as trading latency for throughput.
There are a couple of mitigations to try to limit the extra latency; on the
Altra at least, we are taking advantage of the CPU's streaming mode for the
memory zeroing - its >2x faster to zero a block of 64K than it is to zero 16x
4K. And currently I'm using __GPF_NORETRY for high order folio allocations,
which I understood to mean we shouldn't suffer high allocation latency due to
reclaim, etc. Although, I know we are discussing whether this is correct in the
thread for patch 3.

> 
> Again, my understanding is the tail latency of page allocation is more important
> to anonymous page than page cache page because anonymous page is allocated/freed
> more frequently.

I had assumed that any serious application that needs to guarantee no latency
due to page faults would preallocate and lock all memory during init? Perhaps
that's wishful thinking?

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks for making it this far!
>> Ryan
>>
>>
>> Ryan Roberts (17):
>>   mm: Expose clear_huge_page() unconditionally
>>   mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
>>   mm: Introduce try_vma_alloc_movable_folio()
>>   mm: Implement folio_add_new_anon_rmap_range()
>>   mm: Routines to determine max anon folio allocation order
>>   mm: Allocate large folios for anonymous memory
>>   mm: Allow deferred splitting of arbitrary large anon folios
>>   mm: Implement folio_move_anon_rmap_range()
>>   mm: Update wp_page_reuse() to operate on range of pages
>>   mm: Reuse large folios for anonymous memory
>>   mm: Split __wp_page_copy_user() into 2 variants
>>   mm: ptep_clear_flush_range_notify() macro for batch operation
>>   mm: Implement folio_remove_rmap_range()
>>   mm: Copy large folios for anonymous memory
>>   mm: Convert zero page to large folios on write
>>   mm: mmap: Align unhinted maps to highest anon folio order
>>   mm: Batch-zap large anonymous folio PTE mappings
>>
>>  arch/alpha/include/asm/page.h   |   5 +-
>>  arch/arm64/include/asm/page.h   |   3 +-
>>  arch/arm64/mm/fault.c           |   7 +-
>>  arch/ia64/include/asm/page.h    |   5 +-
>>  arch/m68k/include/asm/page_no.h |   7 +-
>>  arch/s390/include/asm/page.h    |   5 +-
>>  arch/x86/include/asm/page.h     |   5 +-
>>  include/linux/highmem.h         |  23 +-
>>  include/linux/mm.h              |   8 +-
>>  include/linux/mmu_notifier.h    |  31 ++
>>  include/linux/rmap.h            |   6 +
>>  mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
>>  mm/mmap.c                       |   4 +-
>>  mm/rmap.c                       | 147 +++++-
>>  14 files changed, 1000 insertions(+), 133 deletions(-)
>>
>> --
>> 2.25.1
>>