[PATCH v2 00/12] mm: userspace hugepage collapse

* [PATCH v2 00/12] mm: userspace hugepage collapse
@ 2022-04-14 18:06 Zach O'Keefe
  2022-04-14 18:06 ` [PATCH v2 01/12] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
                   ` (12 more replies)
  0 siblings, 13 replies; 25+ messages in thread
From: Zach O'Keefe @ 2022-04-14 18:06 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer, Zach O'Keefe

Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was introduced by David Rientjes[1], and the semantics and
implementation were introduced and discussed in a previous PATCH RFC[2].

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

process_madvise(2)

	Performs a synchronous collapse of the native pages
	mapped by the list of iovecs into transparent hugepages.

	Allocation semantics are the same as khugepaged, and depend on
	(1) the active sysfs settings
	/sys/kernel/mm/transparent_hugepage/enabled and
	/sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2)
	the VMA flags of the memory range being collapsed.

	Collapse eligibility criteria differs from khugepaged in that
	the sysfs files
	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]
	are ignored.

	When a range spans multiple VMAs, the semantics of the collapse over
	of each VMA is independent from the others.

	Caller must have CAP_SYS_ADMIN if not acting on self.

	Return value follows existing process_madvise(2) conventions.  A
	“success” indicates that all hugepage-sized/aligned regions
	covered by the provided range were either successfully
	collapsed, or were already pmd-mapped THPs.

madvise(2)

	Equivalent to process_madvise(2) on self, with 0 returned on
	“success”.

Future work
--------------------------------

Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.

One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps.  For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[3].

Sequence of Patches
--------------------------------

Patches 1-4 perform refactoring of collapse logic within khugepaged.c
and introduce the notion of a collapse context.

Patches 5-9 introduces MADV_COLLAPSE, does some renaming, adds support
so that MADV_COLLAPSE context has the eligibility and allocation
semantics referenced above, and adds process_madivse(2) support.

Patches 10-12 add selftests to test the new functionality.

Applies against next-20220414.

v1 -> v2:
* Cover-letter clarification and added RFC -> v1 notes
* Fixes issues reported by kernel test robot <lkp@intel.com>
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Fixed mixed code/declarations
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Fixed bad function signature in !NUMA && TRANSPARENT_HUGEPAGE configs
  -> Added doc comment to retract_page_tables() for "cc"
* 'mm/khugepaged: add struct collapse_result'
  -> Added doc comment to retract_page_tables() for "cr"
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Added MADV_COLLAPSE definitions for alpha, mips, parisc, xtensa
  -> Moved an "#ifdef NUMA" so that khugepaged_find_target_node() is
     defined in !NUMA && TRANSPARENT_HUGEPAGE configs.
* 'mm/khugepaged: remove khugepaged prefix from shared collapse'
  functions
  -> Removed khugepaged prefix from khugepaged_find_target_node on L914
* Rebased onto next-20220414

RFC -> v1:
* The series was significantly reworked from RFC and most patches are entirely
  new or reworked.
* Collapse eligibility criteria has changed: MADV_COLLAPSE now respects
  VM_NOHUGEPAGE.
* Collapse semantics have changed: the gfp flags used for hugepage allocation
  now match that of khugepaged for the same VMA, instead of the gfp flags used
  at-fault for calling process for the VMA.
* Collapse semantics have changed: The collapse semantics for multiple VMAs
  spanning a single MADV_COLLAPSE call are now independent, whereas before the
  idea was to allow direct reclaim/compaction if any spanned VMA permitted so.
* The process_madvise(2) flags, MADV_F_COLLAPSE_LIMITS and
  MADV_F_COLLAPSE_DEFRAG have been removed.
* Implementation change: the RFC implemented collapse over a range of hugepages
  in a batched-fashion with the aim of doing multiple page table updates inside
  a single mmap_lock write.  This has been changed, and the implementation now
  collapses each hugepage-aligned/sized region iteratively.  This was motivated
  by an experiment which showed that, when multiple threads were concurrently
  faulting during a MADV_COLLAPSE operation, mean and tail latency to acquire
  mmap_lock in read for threads in the fault patch was improved by using a batch
  size of 1 (batch sizes of 1, 8, 16, 32 were tested)[4].
* Added: If a collapse operation fails because a page isn't found on the LRU, do
  a lru_add_drain_all() and retry.
* Added: selftests

[1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20220308213417.1407042-1-zokeefe@google.com/
[3] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/
[4] https://lore.kernel.org/linux-mm/CAAa6QmRc76n-dspGT7UK8DkaqZAOz-CkCsME1V7KGtQ6Yt2FqA@mail.gmail.com/

Zach O'Keefe (13):
  mm/khugepaged: separate hugepage preallocation and cleanup
  mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  mm/khugepaged: add struct collapse_control
  mm/khugepaged: make hugepage allocation context-specific
  mm/khugepaged: add struct collapse_result
  mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  mm/khugepaged: remove khugepaged prefix from shared collapse functions
  mm/khugepaged: add flag to ignore khugepaged_max_ptes_*
  mm/khugepaged: add flag to ignore page young/referenced requirement
  mm/madvise: add MADV_COLLAPSE to process_madvise()
  selftests/vm: modularize collapse selftests
  selftests/vm: add MADV_COLLAPSE collapse context to selftests
  selftests/vm: add test to verify recollapse of THPs

 include/linux/huge_mm.h                 |  12 +
 include/trace/events/huge_memory.h      |   5 +-
 include/uapi/asm-generic/mman-common.h  |   2 +
 mm/internal.h                           |   1 +
 mm/khugepaged.c                         | 598 ++++++++++++++++--------
 mm/madvise.c                            |  11 +-
 mm/rmap.c                               |  15 +-
 tools/testing/selftests/vm/khugepaged.c | 417 +++++++++++------
 8 files changed, 702 insertions(+), 359 deletions(-)

-- 
2.35.1.1178.g4f1659d476-goog

^ permalink raw reply	[flat|nested] 25+ messages in thread