linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: linux-mm@kvack.org
Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: [RFC PATCH 00/26] mm: reliable huge page allocator
Date: Tue, 18 Apr 2023 15:12:47 -0400	[thread overview]
Message-ID: <20230418191313.268131-1-hannes@cmpxchg.org> (raw)

As memory capacity continues to grow, 4k TLB coverage has not been
able to keep up. On Meta's 64G webservers, close to 20% of execution
cycles are observed to be handling TLB misses when using 4k pages
only. Huge pages are shifting from being a nice-to-have optimization
for HPC workloads to becoming a necessity for common applications.

However, while trying to deploy THP more universally, we observe a
fragmentation problem in the page allocator that often prevents larger
requests from being met quickly, or met at all, at runtime. Since we
have to provision hardware capacity for worst case performance,
unreliable huge page coverage isn't of much help.

Drilling into the allocator, we find that existing defrag efforts,
such as mobility grouping and watermark boosting, help, but are
insufficient by themselves. We still observe a high number of blocks
being routinely shared by allocations of different migratetypes. This
in turn results in inefficient or ineffective reclaim/compaction runs.

In a broad sample of Meta servers, we find that unmovable allocations
make up less than 7% of total memory on average, yet occupy 34% of the
2M blocks in the system. We also found that this effect isn't
correlated with high uptimes, and that servers can get heavily
fragmented within the first hour of running a workload.

The following experiment shows that only 20min of build load under
moderate memory pressure already results in a significant number of
typemixed blocks (block analysis run after system is back to idle):

vanilla:
unmovable 50
movable 701
reclaimable 149
unmovable blocks with slab/lru pages: 13 ({'slab': 17, 'lru': 19} pages)
movable blocks with non-LRU pages: 77 ({'slab': 4257, 'kmem': 77, 'other': 2} pages)
reclaimable blocks with non-slab pages: 16 ({'lru': 37, 'kmem': 311, 'other': 26} pages)

patched:
unmovable 65
movable 457
reclaimable 159
free 219
unmovable blocks with slab/lru pages: 22 ({'slab': 0, 'lru': 38} pages)
movable blocks with non-LRU pages: 0 ({'slab': 0, 'kmem': 0, 'other': 0} pages)
reclaimable blocks with non-slab pages: 3 ({'lru': 36, 'kmem': 0, 'other': 23} pages)

[ The remaining "mixed blocks" in the patched kernel are false
  positives: LRU pages without migrate callbacks (empty_aops), and
  i915 shmem that is pinned until reclaimed through shrinkers. ]

Root causes

One of the behaviors that sabotage the page allocator's mobility
grouping is the fact that requests of one migratetype are allowed to
fall back into blocks of another type before reclaim and compaction
occur. This is a design decision to prioritize memory utilization over
block fragmentation - especially considering the history of lumpy
reclaim and its tendency to overreclaim. However, with compaction
available, these two goals are no longer in conflict: the scratch
space of free pages for compaction to work is only twice the size of
the allocation request; in most cases, only small amounts of
proactive, coordinated reclaim and compaction is required to prevent a
fallback which may fragment a pageblock indefinitely.

Another problem lies in how the page allocator drives reclaim and
compaction when it does invoke it. While the page allocator targets
migratetype grouping at the pageblock level, it calls reclaim and
compaction with the order of the allocation request. As most requests
are smaller than a pageblock, this results in partial block freeing
and subsequent fallbacks and type mixing.

Note that in combination, these two design decisions have a
self-reinforcing effect on fragmentation: 1. Partially used unmovable
blocks are filled up with fallback movable pages. 2. A subsequent
unmovable allocation, instead of grouping up, will then need to enter
reclaim, which most likely results in a partially freed movable block
that it falls back into. Over time, unmovable allocations are sparsely
scattered throughout the address space and poison many pageblocks.

Note that block fragmentation is driven by lower-order requests. It is
not reliably mitigated by the mere presence of higher-order requests.

Proposal

This series proposes to make THP allocations reliable by enforcing
pageblock hygiene, and aligning the allocator, reclaim and compaction
on the pageblock as the base unit for managing free memory. All orders
up to and including the pageblock are made first-class requests that
(outside of OOM situations) are expected to succeed without
exceptional investment by the allocating thread.

A neutral pageblock type is introduced, MIGRATE_FREE. The first
allocation to be placed into such a block claims it exclusively for
the allocation's migratetype. Fallbacks from a different type are no
longer allowed, and the block is "kept open" for more allocations of
the same type to ensure tight grouping. A pageblock becomes neutral
again only once all its pages have been freed.

Reclaim and compaction are changed from partial block reclaim to
producing whole neutral page blocks. The watermark logic is adjusted
to apply to neutral blocks, ensuring that background and direct
reclaim always maintain a readily-available reserve of them.

The defragmentation effort changes from reactive to proactive. In
turn, this makes defragmentation actually more efficient: compaction
only has to scan movable blocks and can skip other blocks entirely;
since movable blocks aren't poisoned by unmovable pages, the chances
of successful compaction in each block are greatly improved as well.

Defragmentation becomes an ongoing responsibility of all allocations,
rather than being the burden of only higher-order asks. This prevents
sub-block allocations - which cause block fragmentation in the first
place - from starving the increasingly important larger requests.

There is a slight increase in worst-case memory overhead by requiring
the watermarks to be met against neutral blocks even when there might
be free pages in typed blocks. However, the high watermarks are less
than 1% of the zone, so the increase is relatively small.

These changes only apply to CONFIG_COMPACTION kernels. Without
compaction, fallbacks and partial block reclaim remain the best
trade-off between memory utilization and fragmentation.

Initial Test Results

The following is purely an allocation reliability test. Achieving full
THP benefits in practice is tied to other pending changes, such as the
THP shrinker to avoid memory pressure from excessive internal
fragmentation, and tweaks to the kernel's THP allocation strategy.

The test is a kernel build under moderate-to-high memory pressure,
with a concurrent process trying to repeatedly fault THPs (madvise):

                                              HUGEALLOC-VANILLA       HUGEALLOC-PATCHED
Real time                                   265.04 (    +0.00%)     268.12 (    +1.16%)
User time                                  1131.05 (    +0.00%)    1131.13 (    +0.01%)
System time                                 474.66 (    +0.00%)     478.97 (    +0.91%)
THP fault alloc                           17913.24 (    +0.00%)   19647.50 (    +9.68%)
THP fault fallback                         1947.12 (    +0.00%)     223.40 (   -88.48%)
THP fault fail rate %                         9.80 (    +0.00%)       1.12 (   -80.34%)
Direct compact stall                        282.44 (    +0.00%)     543.90 (   +92.25%)
Direct compact fail                         262.44 (    +0.00%)     239.90 (    -8.56%)
Direct compact success                       20.00 (    +0.00%)     304.00 ( +1352.38%)
Direct compact success rate %                 7.15 (    +0.00%)      57.10 (  +612.90%)
Compact daemon scanned migrate            21643.80 (    +0.00%)  387479.80 ( +1690.18%)
Compact daemon scanned free              188462.36 (    +0.00%) 2842824.10 ( +1408.42%)
Compact direct scanned migrate          1601294.84 (    +0.00%)  275670.70 (   -82.78%)
Compact direct scanned free             4476155.60 (    +0.00%) 2438835.00 (   -45.51%)
Compact migrate scanned daemon %              1.32 (    +0.00%)      59.18 ( +2499.00%)
Compact free scanned daemon %                 3.95 (    +0.00%)      54.31 ( +1018.20%)
Alloc stall                                2425.00 (    +0.00%)     992.00 (   -59.07%)
Pages kswapd scanned                     586756.68 (    +0.00%)  975390.20 (   +66.23%)
Pages kswapd reclaimed                   385468.20 (    +0.00%)  437767.50 (   +13.57%)
Pages direct scanned                     335199.56 (    +0.00%)  501824.20 (   +49.71%)
Pages direct reclaimed                   127953.72 (    +0.00%)  151880.70 (   +18.70%)
Pages scanned kswapd %                       64.43 (    +0.00%)      66.39 (    +2.99%)
Swap out                                  14083.88 (    +0.00%)   45034.60 (  +219.74%)
Swap in                                    3395.08 (    +0.00%)    7767.50 (  +128.75%)
File refaults                             93546.68 (    +0.00%)  129648.30 (   +38.59%)

The THP fault success rate is drastically improved. A bigger share of
the work is done by the background threads, as they now proactively
maintain MIGRATE_FREE block reserves. The increase in memory pressure
is shown by the uptick in swap activity.

Status

Initial test results look promising, but production testing has been
lagging behind the effort to generalize this code for upstream, and
putting all the pieces together to make THP work. I'll follow up as I
gather more data.

Sending this out now as an RFC to get input on the overall direction.

The patches are based on v6.2.

 Documentation/admin-guide/sysctl/vm.rst |  21 -
 block/bdev.c                            |   2 +-
 include/linux/compaction.h              | 100 +---
 include/linux/gfp.h                     |   2 -
 include/linux/mm.h                      |   1 -
 include/linux/mmzone.h                  |  30 +-
 include/linux/page-isolation.h          |  28 +-
 include/linux/pageblock-flags.h         |   4 +-
 include/linux/vmstat.h                  |   8 -
 include/trace/events/mmflags.h          |   4 +-
 kernel/sysctl.c                         |   8 -
 mm/compaction.c                         | 242 +++-----
 mm/internal.h                           |  14 +-
 mm/memory_hotplug.c                     |   4 +-
 mm/page_alloc.c                         | 930 +++++++++++++-----------------
 mm/page_isolation.c                     |  42 +-
 mm/vmscan.c                             | 251 ++------
 mm/vmstat.c                             |   6 +-
 18 files changed, 629 insertions(+), 1068 deletions(-)



             reply	other threads:[~2023-04-18 19:13 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-18 19:12 Johannes Weiner [this message]
2023-04-18 19:12 ` [RFC PATCH 01/26] block: bdev: blockdev page cache is movable Johannes Weiner
2023-04-19  4:07   ` Matthew Wilcox
2023-04-21 12:25   ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 02/26] mm: compaction: avoid GFP_NOFS deadlocks Johannes Weiner
2023-04-21 12:27   ` Mel Gorman
2023-04-21 14:17     ` Johannes Weiner
2023-04-18 19:12 ` [RFC PATCH 03/26] mm: make pageblock_order 2M per default Johannes Weiner
2023-04-19  0:01   ` Kirill A. Shutemov
2023-04-19  2:55     ` Johannes Weiner
2023-04-19  3:44       ` Johannes Weiner
2023-04-19 11:10     ` David Hildenbrand
2023-04-19 10:36   ` Vlastimil Babka
2023-04-19 11:09     ` David Hildenbrand
2023-04-21 12:37   ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 04/26] mm: page_isolation: write proper kerneldoc Johannes Weiner
2023-04-21 12:39   ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 05/26] mm: page_alloc: per-migratetype pcplist for THPs Johannes Weiner
2023-04-21 12:47   ` Mel Gorman
2023-04-21 15:06     ` Johannes Weiner
2023-04-28 10:29       ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 06/26] mm: page_alloc: consolidate free page accounting Johannes Weiner
2023-04-21 12:54   ` Mel Gorman
2023-04-21 15:08     ` Johannes Weiner
2023-04-18 19:12 ` [RFC PATCH 07/26] mm: page_alloc: move capture_control to the page allocator Johannes Weiner
2023-04-21 12:59   ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 08/26] mm: page_alloc: claim blocks during compaction capturing Johannes Weiner
2023-04-21 13:12   ` Mel Gorman
2023-04-25 14:39     ` Johannes Weiner
2023-04-18 19:12 ` [RFC PATCH 09/26] mm: page_alloc: move expand() above compaction_capture() Johannes Weiner
2023-04-18 19:12 ` [RFC PATCH 10/26] mm: page_alloc: allow compaction capturing from larger blocks Johannes Weiner
2023-04-21 14:14   ` Mel Gorman
2023-04-25 15:40     ` Johannes Weiner
2023-04-28 10:41       ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 11/26] mm: page_alloc: introduce MIGRATE_FREE Johannes Weiner
2023-04-21 14:25   ` Mel Gorman
2023-04-18 19:12 ` [RFC PATCH 12/26] mm: page_alloc: per-migratetype free counts Johannes Weiner
2023-04-21 14:28   ` Mel Gorman
2023-04-21 15:35     ` Johannes Weiner
2023-04-21 16:03       ` Mel Gorman
2023-04-21 16:32         ` Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 13/26] mm: compaction: remove compaction result helpers Johannes Weiner
2023-04-21 14:32   ` Mel Gorman
2023-04-18 19:13 ` [RFC PATCH 14/26] mm: compaction: simplify should_compact_retry() Johannes Weiner
2023-04-21 14:36   ` Mel Gorman
2023-04-25  2:15     ` Johannes Weiner
2023-04-25  0:56   ` Huang, Ying
2023-04-25  2:11     ` Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 15/26] mm: compaction: simplify free block check in suitable_migration_target() Johannes Weiner
2023-04-21 14:39   ` Mel Gorman
2023-04-18 19:13 ` [RFC PATCH 16/26] mm: compaction: improve compaction_suitable() accuracy Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 17/26] mm: compaction: refactor __compaction_suitable() Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 18/26] mm: compaction: remove unnecessary is_via_compact_memory() checks Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 19/26] mm: compaction: drop redundant watermark check in compaction_zonelist_suitable() Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 20/26] mm: vmscan: use compaction_suitable() check in kswapd Johannes Weiner
2023-04-25  3:12   ` Huang, Ying
2023-04-25 14:26     ` Johannes Weiner
2023-04-26  1:30       ` Huang, Ying
2023-04-26 15:22         ` Johannes Weiner
2023-04-27  5:41           ` Huang, Ying
2023-04-18 19:13 ` [RFC PATCH 21/26] mm: compaction: align compaction goals with reclaim goals Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 22/26] mm: page_alloc: manage free memory in whole pageblocks Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 23/26] mm: page_alloc: kill highatomic Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 24/26] mm: page_alloc: kill watermark boosting Johannes Weiner
2023-04-18 19:13 ` [RFC PATCH 25/26] mm: page_alloc: disallow fallbacks when 2M defrag is enabled Johannes Weiner
2023-04-21 14:56   ` Mel Gorman
2023-04-21 15:24     ` Johannes Weiner
2023-04-21 15:55       ` Mel Gorman
2023-04-18 19:13 ` [RFC PATCH 26/26] mm: page_alloc: add sanity checks for migratetypes Johannes Weiner
2023-04-18 23:54 ` [RFC PATCH 00/26] mm: reliable huge page allocator Kirill A. Shutemov
2023-04-19  2:08   ` Johannes Weiner
2023-04-19 10:56     ` Vlastimil Babka
2023-04-19  4:11 ` Matthew Wilcox
2023-04-21 16:11   ` Mel Gorman
2023-04-21 17:14     ` Matthew Wilcox
2023-05-02 15:21       ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230418191313.268131-1-hannes@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=kaiyang2@cs.cmu.edu \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).