[RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory

From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Yu Zhao <yuzhao@google.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory
Date: Fri, 14 Apr 2023 14:02:46 +0100	[thread overview]
Message-ID: <20230414130303.2345383-1-ryan.roberts@arm.com> (raw)

Hi All,

This is a second RFC and my first proper attempt at implementing variable order,
large folios for anonymous memory. The first RFC [1], was a partial
implementation and a plea for help in debugging an issue I was hitting; thanks
to Yin Fengwei and Matthew Wilcox for their advice in solving that!

The objective of variable order anonymous folios is to improve performance by
allocating larger chunks of memory during anonymous page faults:

 - Since SW (the kernel) is dealing with larger chunks of memory than base
   pages, there are efficiency savings to be had; fewer page faults, batched PTE
   and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
   overhead. This should benefit all architectures.
 - Since we are now mapping physically contiguous chunks of memory, we can take
   advantage of HW TLB compression techniques. A reduction in TLB pressure
   speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
   TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].

This patch set deals with the SW side of things only but sets us up nicely for
taking advantage of the HW improvements in the near future.

I'm not yet benchmarking a wide variety of use cases, but those that I have
looked at are positive; I see kernel compilation time improved by up to 10%,
which I expect to improve further once I add in the arm64 "contiguous bit".
Memory consumption is somewhere between 1% less and 2% more, depending on how
its measured. More on perf and memory below.

The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
conflict resolution). I have a tree at [4].

[1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
[3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2

Approach
========

There are 4 fault paths that have been modified:
 - write fault on unallocated address: do_anonymous_page()
 - write fault on zero page: wp_page_copy()
 - write fault on non-exclusive CoW page: wp_page_copy()
 - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()

In the first 2 cases, we will determine the preferred order folio to allocate,
limited by a max order (currently order-4; see below), VMA and PMD bounds, and
state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
folio as the source, subject to constraints that may arise if the source has
been mremapped or partially munmapped. And in the 4th case, we reuse as much of
the folio as we can, subject to the same mremap/munmap constraints.

If allocation of our preferred folio order fails, we gracefully fall back to
lower orders all the way to 0.

Note that none of this affects the behavior of traditional PMD-sized THP. If we
take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.

Open Questions
==============

How to Move Forwards
--------------------

While the series is a small-ish code change, it represents a big shift in the
way things are done. So I'd appreciate any help in scaling up performance
testing, review and general advice on how best to guide a change like this into
the kernel.

Folio Allocation Order Policy
-----------------------------

The current code is hardcoded to use a maximum order of 4. This was chosen for a
couple of reasons:
 - From the SW performance perspective, I see a knee around here where
   increasing it doesn't lead to much more performance gain.
 - Intuitively I assume that higher orders become increasingly difficult to
   allocate.
 - From the HW performance perspective, arm64's HPA works on order-2 blocks and
   "the contiguous bit" works on order-4 for 4KB base pages (although it's
   order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
   any higher.

I suggest that ultimately setting the max order should be left to the
architecture. arm64 would take advantage of this and set it to the order
required for the contiguous bit for the configured base page size.

However, I also have a (mild) concern about increased memory consumption. If an
app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
we would end up allocating 16x as much memory as we used to. One potential
approach I see here is to track fault addresses per-VMA, and increase a per-VMA
max allocation order for consecutive faults that extend a contiguous range, and
decrement when discontiguous. Alternatively/additionally, we could use the VMA
size as an indicator. I'd be interested in your thoughts/opinions.

Deferred Split Queue Lock Contention
------------------------------------

The results below show that we are spending a much greater proportion of time in
the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.

I think this is (at least partially) related for contention on the deferred
split queue lock. This is a per-memcg spinlock, which means a single spinlock
shared among all 160 CPUs. I've solved part of the problem with the last patch
in the series (which cuts down the need to take the lock), but at folio free
time (free_transhuge_page()), the lock is still taken and I think this could be
a problem. Now that most anonymous pages are large folios, this lock is taken a
lot more.

I think we could probably avoid taking the lock unless !list_empty(), but I
haven't convinced myself its definitely safe, so haven't applied it yet.

Roadmap
=======

Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
bit" on arm64 to validate predictions about HW speedups.

I also think there are some opportunities with madvise to split folios to non-0
orders, which might improve performance in some cases. madvise is also mistaking
exclusive large folios for non-exclusive ones at the moment (due to the "small
pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
frees the folio.

Results
=======

Performance
-----------

Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
before each run.

make defconfig && time make -jN Image

First with -j8:

|           | baseline time  | anonfolio time | percent change |
|           | to compile (s) | to compile (s) | SMALLER=better |
|-----------|---------------:|---------------:|---------------:|
| real-time |          373.0 |          342.8 |          -8.1% |
| user-time |         2333.9 |         2275.3 |          -2.5% |
| sys-time  |          510.7 |          340.9 |         -33.3% |

Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
execution. The next 2 tables show a breakdown of the cycles spent in the kernel
for the 8 job config:

|                      | baseline | anonfolio | percent change |
|                      | (cycles) | (cycles)  | SMALLER=better |
|----------------------|---------:|----------:|---------------:|
| data abort           |     683B |      316B |         -53.8% |
| instruction abort    |      93B |       76B |         -18.4% |
| syscall              |     887B |      767B |         -13.6% |

|                      | baseline | anonfolio | percent change |
|                      | (cycles) | (cycles)  | SMALLER=better |
|----------------------|---------:|----------:|---------------:|
| arm64_sys_openat     |     194B |      188B |          -3.3% |
| arm64_sys_exit_group |     192B |      124B |         -35.7% |
| arm64_sys_read       |     124B |      108B |         -12.7% |
| arm64_sys_execve     |      75B |       67B |         -11.0% |
| arm64_sys_mmap       |      51B |       50B |          -3.0% |
| arm64_sys_mprotect   |      15B |       13B |         -12.0% |
| arm64_sys_write      |      43B |       42B |          -2.9% |
| arm64_sys_munmap     |      15B |       12B |         -17.0% |
| arm64_sys_newfstatat |      46B |       41B |          -9.7% |
| arm64_sys_clone      |      26B |       24B |         -10.0% |

And now with -j160:

|           | baseline time  | anonfolio time | percent change |
|           | to compile (s) | to compile (s) | SMALLER=better |
|-----------|---------------:|---------------:|---------------:|
| real-time |           53.7 |           48.2 |         -10.2% |
| user-time |         2705.8 |         2842.1 |           5.0% |
| sys-time  |         1370.4 |         1064.3 |         -22.3% |

Above shows a 10.2% improvement in real time execution. But ~3x more time is
spent in the kernel than for the -j8 config. I think this is related to the lock
contention issue I highlighted above, but haven't bottomed it out yet. It's also
not yet clear to me why user-time increases by 5%.

I've also run all the will-it-scale microbenchmarks for a single task, using the
process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
fluctuation. So I'm just calling out tests with results that have gt 5%
improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
are regressed:

| benchmark            | baseline | anonfolio | percent change |
|                      | ops/s    | ops/s     | BIGGER=better  |
| ---------------------|---------:|----------:|---------------:|
| context_switch1.csv  |   328744 |    351150 |          6.8%  |
| malloc1.csv          |    96214 |     50890 |        -47.1%  |
| mmap1.csv            |   410253 |    375746 |         -8.4%  |
| page_fault1.csv      |   624061 |   3185678 |        410.5%  |
| page_fault2.csv      |   416483 |    557448 |         33.8%  |
| page_fault3.csv      |   724566 |   1152726 |         59.1%  |
| read1.csv            |  1806908 |   1905752 |          5.5%  |
| read2.csv            |   587722 |   1942062 |        230.4%  |
| tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
| tlb_flush2.csv       |   266763 |    322320 |         20.8%  |

I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
object in a loop and never touches the allocated memory. I think the malloc
implementation is maintaining a header just before the allocated object, which
causes a single page fault. Previously that page fault allocated 1 page. Now it
is allocating 16 pages. This cost would be repaid if the test code wrote to the
allocated object. Alternatively the folio allocation order policy described
above would also solve this.

It is not clear to me why mmap1 has slowed down. This remains a todo.

Memory
------

I measured memory consumption while doing a kernel compile with 8 jobs on a
system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
workload, then calcualted "memory used" high and low watermarks using both
MemFree and MemAvailable. If there is a better way of measuring system memory
consumption, please let me know!

mem-used = 4GB - /proc/meminfo:MemFree

|                      | baseline | anonfolio | percent change |
|                      | (MB)     | (MB)      | SMALLER=better |
|----------------------|---------:|----------:|---------------:|
| mem-used-low         |      825 |       842 |           2.1% |
| mem-used-high        |     2697 |      2672 |          -0.9% |

mem-used = 4GB - /proc/meminfo:MemAvailable

|                      | baseline | anonfolio | percent change |
|                      | (MB)     | (MB)      | SMALLER=better |
|----------------------|---------:|----------:|---------------:|
| mem-used-low         |      518 |       530 |           2.3% |
| mem-used-high        |     1522 |      1537 |           1.0% |

For the high watermark, the methods disagree; we are either saving 1% or using
1% more. For the low watermark, both methods agree that we are using about 2%
more. I plan to investigate whether the proposed folio allocation order policy
can reduce this to zero.

Thanks for making it this far!
Ryan

Ryan Roberts (17):
  mm: Expose clear_huge_page() unconditionally
  mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  mm: Introduce try_vma_alloc_movable_folio()
  mm: Implement folio_add_new_anon_rmap_range()
  mm: Routines to determine max anon folio allocation order
  mm: Allocate large folios for anonymous memory
  mm: Allow deferred splitting of arbitrary large anon folios
  mm: Implement folio_move_anon_rmap_range()
  mm: Update wp_page_reuse() to operate on range of pages
  mm: Reuse large folios for anonymous memory
  mm: Split __wp_page_copy_user() into 2 variants
  mm: ptep_clear_flush_range_notify() macro for batch operation
  mm: Implement folio_remove_rmap_range()
  mm: Copy large folios for anonymous memory
  mm: Convert zero page to large folios on write
  mm: mmap: Align unhinted maps to highest anon folio order
  mm: Batch-zap large anonymous folio PTE mappings

 arch/alpha/include/asm/page.h   |   5 +-
 arch/arm64/include/asm/page.h   |   3 +-
 arch/arm64/mm/fault.c           |   7 +-
 arch/ia64/include/asm/page.h    |   5 +-
 arch/m68k/include/asm/page_no.h |   7 +-
 arch/s390/include/asm/page.h    |   5 +-
 arch/x86/include/asm/page.h     |   5 +-
 include/linux/highmem.h         |  23 +-
 include/linux/mm.h              |   8 +-
 include/linux/mmu_notifier.h    |  31 ++
 include/linux/rmap.h            |   6 +
 mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
 mm/mmap.c                       |   4 +-
 mm/rmap.c                       | 147 +++++-
 14 files changed, 1000 insertions(+), 133 deletions(-)

--
2.25.1