linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [mm-unstable v7 00/18] mm: userspace hugepage collapse
@ 2022-07-06 23:59 Zach O'Keefe
  2022-07-06 23:59 ` [mm-unstable v7 01/18] mm/khugepaged: remove redundant transhuge_vma_suitable() check Zach O'Keefe
                   ` (18 more replies)
  0 siblings, 19 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

v7 Forward
--------------------------------

The major changes to v7 over v6[1] are:

1.  mm_find_pmd() refactoring has been extended, and now returns the raw
pmd_t* without additional check (which was it's original behavior).  For
MADV_COLLAPSE, we've tightened up our use of it and now check if we've
raced with khugepaged when collapsing (Yang Shi).

2.  errno return values have been changed, and now deviate from madvise
convention in some places.  Most notably, this is to allow ENOMEM to mean "
memory allocation failed" to the user - the most important being THP
allocation failure.

3.  We now longer do lru_add_drain() + lru_add_drain_all() if we fail
collapse because pages aren't found on the LRU.  This has been simplified,
and we just do a lru_add_drain_all() upfront (Yang Shi).

4.  struct collapse_control has been further simplified, and all flags
controlling collapse behavior are now squashed into a single .is_hugepaged
flag.  We also now kmalloc() this structure in MADV_COLLAPSE context.

5.  Rebased on top of Yang Shi's "Cleanup transhuge_xxx helpers" series
[2] as well as Miaohe Lin's "A few cleanup patches for khugepaged" series
[3] which caused some refactoring and allowed for some nice
simplifications - most notably the VMA (re)validation checks.

6.  A new /proc/<pid>/smaps field, PMDMappable, has been added to inform
userspace what VMAs are eligible for MADV_COLLAPSE.

7.  A tracepoint was added to assist with MADV_COLLAPSE debugging

8.  selftest coverage is tightened up and now covers collapsing multiple
hugepage-sized regions.

See the Changelog for more details.

v6 Forward
--------------------------------

v6 improves on v5[4] in 3 major ways:

1.  Changed MADV_COLLAPSE eligibility semantics.  In v5, MADV_COLLAPSE
ignored khugepaged max_ptes_* sysfs settings, as well as all sysfs defrag
settings.  v6 takes this further by also decoupling MADV_COLLAPSE from
sysfs enabled setting.  MADV_COLLAPSE can now initiate a collapse of memory
into THPs in "madvise" and "never" mode, and doesn't ever require
VM_HUGEPAGE.  MADV_COLLAPSE retains it's adherence to not operating on
VM_NOHUGEPAGE-marked VMAs.

2.  Thanks to a patch by Yang Shi to remove UMA hugepage preallocation,
hugepage allocation in khugepaged is independent of CONFIG_NUMA.  This
allows us to reuse all the allocation codepaths between collapse contexts,
greatly simplifying struct collapse_control.  Redundant khugepaged
heuristic flags have also been merged into a new enforce_page_heuristics
flag.

3.  Using MADV_COLLAPSE's new eligibility semantics, the hacks in the
selftests to disable khugepaged are no longer necessary, since we can test
MADV_COLLAPSE in "never" THP mode to prevent khugepaged interaction.

Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was introduced by David Rientjes[5].

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

process_madvise(2)

	Performs a synchronous collapse of the native pages
	mapped by the list of iovecs into transparent hugepages.

	This operation is independent of the system THP sysfs settings,
	but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail.

	THP allocation may enter direct reclaim and/or compaction.

	When a range spans multiple VMAs, the semantics of the collapse
	over of each VMA is independent from the others.

	Caller must have CAP_SYS_ADMIN if not acting on self.

	Return value follows existing process_madvise(2) conventions.  A
	“success” indicates that all hugepage-sized/aligned regions
	covered by the provided range were either successfully
	collapsed, or were already pmd-mapped THPs.

madvise(2)

	Equivalent to process_madvise(2) on self, with 0 returned on
	“success”.

Current Use-Cases
--------------------------------

(1)	Immediately back executable text by THPs.  Current support provided
	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
	system which might impair services from serving at their full rated
	load after (re)starting.  Tricks like mremap(2)'ing text onto
	anonymous memory to immediately realize iTLB performance prevents
	page sharing and demand paging, both of which increase steady state
	memory footprint.  With MADV_COLLAPSE, we get the best of both
	worlds: Peak upfront performance and lower RAM footprints.  Note
	that subsequent support for file-backed memory is required here.

(2)	malloc() implementations that manage memory in hugepage-sized
	chunks, but sometimes subrelease memory back to the system in
	native-sized chunks via MADV_DONTNEED; zapping the pmd.  Later,
	when the memory is hot, the implementation could
	madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
	hugepage coverage and dTLB performance.  TCMalloc is such an
	implementation that could benefit from this[6].  A prior study of
	Google internal workloads during evaluation of Temeraire, a
	hugepage-aware enhancement to TCMalloc, showed that nearly 20% of
	all cpu cycles were spent in dTLB stalls, and that increasing
	hugepage coverage by even small amount can help with that[7].

(3)	userfaultfd-based live migration of virtual machines satisfy UFFD
	faults by fetching native-sized pages over the network (to avoid
	latency of transferring an entire hugepage).  However, after guest
	memory has been fully copied to the new host, MADV_COLLAPSE can
	be used to immediately increase guest performance.  Note that
	subsequent support for file/shmem-backed memory is required here.

(4)	HugeTLB high-granularity mapping allows HugeTLB a HugeTLB page to
	be mapped at different levels in the page tables[8].  As it's not
	"transparent" like THP, HugeTLB high-granularity mappings require
	an explicit user API. It is intended that MADV_COLLAPSE be co-opted
	for this use case[9].  Note that subsequent support for HugeTLB
	memory is required here.

Future work
--------------------------------

Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.

One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps.  For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[10].

Sequence of Patches
--------------------------------
* Patch 1 is a cleanup patch.

* Patch 2 (Yang Shi) removes UMA hugepage preallocation and makes
  khugepaged hugepage allocation independent of CONFIG_NUMA

* Patches 3-8 perform refactoring of collapse logic within khugepaged.c
  and introduce the notion of a collapse context.

* Patch 9 introduces MADV_COLLAPSE and is the main patch in this series.

* Patches 10-13 add additional support: tracepoints, clean-ups,
  process_madvise(2), and /proc/<pid>/smaps output

* Patches 14-18 add selftests.

Applies against mm-unstable

Changelog
--------------------------------
v6 -> v7:
* Added 'mm/khugepaged: remove redundant transhuge_vma_suitable() check'
* 'mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA'
  -> Open-coded khugepaged_alloc_sleep() logic (Peter Xu)
* 'mm/khugepaged: pipe enum scan_result codes back to callers'
  -> Refactored __collapse_huge_page_swapin() to return enum scan_result
  -> A few small cleanups (Yang Shi)
* 'mm/khugepaged: add flag to predicate khugepaged-only behavior'
  -> Renamed from 'mm/khugepaged: add flag to ignore khugepaged heuristics'
  -> The flag is now ".is_hugepaged" (Peter Xu)
* 'mm/khugepaged: add flag to ignore THP sysfs enabled'
  -> Refactored to pass flag to hugepage_vma_check(), and to reuse
     .is_khugepaged flag (Peter Xu)
* 'mm/khugepaged: make allocation semantics context-specific'
  -> !CONFIG_SHMEM bugfix and minor changes (Yang Shi)
  -> Squashed into 'mm/madvise: introduce MADV_COLLAPSE sync hugepage
     collapse'
  -> Removed .gfp member of struct collapse_control.  Instead, use the
     .is_khugepaged member to decide what gfp flags to use.
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Replaced multiple mm_find_pmd() callsites with
     find_pmd_or_thp_or_none() to make sure khugepaged doesn't collapse
     out from under us (Yang Shi)
  -> Added check_pmd_still_valid() helper
  -> Return SCAN_PMD_NULL if pmd_bad() (Yang Shi)
  -> Renamed mm_find_pmd() -> mm_find_pte_pmd()
  -> Renamed mm_find_pmd_raw() -> mm_find_pmd()
  -> Add mm_find_pmd() to split_huge_pmd_address()
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Replace SCAN_PAGE_LRU + lru_add_drain_all() retry logic with single
     lru_add_drain_all() upfront.
  -> errno mapping changes.  Most notably, use ENOMEM when memory
     allocation (most notably, THP allocation) fails.
  -> When !THP, madvise_collapse() and hugepage_madvise() return -EINVAL
     instead of BUG(). (Yang Shi)
* 'tools headers uapi: add MADV_COLLAPSE madvise mode to tools'
  -> Squashed into 'mm/madvise: introduce MADV_COLLAPSE sync hugepage
     collapse' (Yang Shi)
* 'mm/khugepaged: rename prefix of shared collapse functions'
  -> Revert change to huge_memory:mm_khugepaged_scan_pmd tracepoint to
     retain ABI. (Yang Shi)
* Added 'mm/madvise: add huge_memory:mm_madvise_collapse tracepoint'
* Added 'proc/smaps: add PMDMappable field to smaps'
* Added 'selftests/vm: dedup hugepage allocation logic'
* Added 'selftests/vm: add selftest to verify multi THP collapse'
* Collected review tags
* Rebased on ??

v5 -> v6:
* Added 'mm: khugepaged: don't carry huge page to the next loop for
  !CONFIG_NUMA'
  (Yang Shi)
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Add a pmd_bad() check for nonhuge pmds (Peter Xu)
* 'mm/khugepaged: dedup and simplify hugepage alloc and charging'
  -> Remove dependency on 'mm/khugepaged: sched to numa node when collapse
     huge page'
  -> No more !NUMA casing
* 'mm/khugepaged: make allocation semantics context-specific'
  -> Renamed from 'mm/khugepaged: make hugepage allocation
     context-specific'
  -> Removed function pointer hooks. (David Rientjes)
  -> Added gfp_t member to control allocation semantics.
* 'mm/khugepaged: add flag to ignore khugepaged heuristics'
  -> Squashed from
     'mm/khugepaged: add flag to ignore khugepaged_max_ptes_*' and
     'mm/khugepaged: add flag to ignore page young/referenced requirement'.
     (David Rientjes)
* Added 'mm/khugepaged: add flag to ignore THP sysfs enabled'
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Use hugepage_vma_check() instead of transparent_hugepage_active()
     to determine vma eligibility.
  -> Only retry collapse once per hugepage if pages aren't found on LRU
  -> Save last failed result for more accurate errno
  -> Refactored loop structure
  -> Renamed labels
* 'selftests/vm: modularize collapse selftests'
  -> Refactored into straightline code and removed loop over contexts.
* 'selftests/vm: add MADV_COLLAPSE collapse context to selftests;
  -> Removed ->init() and ->cleanup() hooks from struct collapse_context()
     (David Rientjes)
  -> MADV_COLLAPSE operates in "never" THP mode to prevent khugepaged
     interaction. Removed all the previous khugepaged hacks.
* Added 'tools headers uapi: add MADV_COLLAPSE madvise mode to tools'
* Rebased on next-20220603

v4 -> v5:
* Fix kernel test robot <lkp@intel.com> errors
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Fix khugepaged_alloc_page() UMA definition
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Add "fallthrough" pseudo keyword to fix -Wimplicit-fallthrough

v3 -> v4:
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Dropped pmd_none() check from find_pmd_or_thp_or_none()
  -> Moved SCAN_PMD_MAPPED after SCAN_PMD_NULL
  -> Dropped <lkp@intel.com> from sign-offs
* 'mm/khugepaged: add struct collapse_control'
  -> Updated commit description and some code comments
  -> Removed extra brackets added in khugepaged_find_target_node()
* Added 'mm/khugepaged: dedup hugepage allocation and charging code'
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Has been majorly reworked to replace ->gfp() and ->alloc_hpage()
     struct collapse_control hooks with a ->alloc_charge_hpage() hook
     which makes node-allocation, gfp flags, node scheduling, hpage
     allocation, and accounting/charging context-specific.
  -> Dropped <lkp@intel.com> from sign-offs
* Added 'mm/khugepaged: pipe enum scan_result codes back to callers'
  -> Replaces 'mm/khugepaged: add struct collapse_result'
* Dropped 'mm/khugepaged: add struct collapse_result'
* 'mm/khugepaged: add flag to ignore khugepaged_max_ptes_*'
  -> Moved before 'mm/madvise: introduce MADV_COLLAPSE sync hugepage
     collapse'
* 'mm/khugepaged: add flag to ignore page young/referenced requirement'
  -> Moved before 'mm/madvise: introduce MADV_COLLAPSE sync hugepage
     collapse'
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Moved struct collapse_control* argument to end of alloc_hpage()
  -> Some refactoring to rebase on top changes to struct
     collapse_control hook changes and other previous commits.
  -> Reworded commit description
  -> Dropped <lkp@intel.com> from sign-offs
* 'mm/khugepaged: rename prefix of shared collapse functions'
  -> Renamed from 'mm/khugepaged: remove khugepaged prefix from shared
     collapse functions'
  -> Instead of dropping "khugepaged_" prefix, replace with
     "hpage_collapse_"
  -> Dropped <lkp@intel.com> from sign-offs
* Rebased onto next-20220502

v2 -> v3:
* Collapse semantics have changed: the gfp flags used for hugepage
  allocation now are independent of khugepaged.
* Cover-letter: add primary use-cases and update description of collapse
  semantics.
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Added .gfp operation to struct collapse_control
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Added madvise context .gfp implementation.
  -> Set scan_result appropriately on early exit due to mm exit or vma
     vma revalidation.
  -> Reword patch description
* Rebased onto next-20220426

v1 -> v2:
* Cover-letter clarification and added RFC -> v1 notes
* Fixes issues reported by kernel test robot <lkp@intel.com>
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Fixed mixed code/declarations
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Fixed bad function signature in !NUMA && TRANSPARENT_HUGEPAGE configs
  -> Added doc comment to retract_page_tables() for "cc"
* 'mm/khugepaged: add struct collapse_result'
  -> Added doc comment to retract_page_tables() for "cr"
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Added MADV_COLLAPSE definitions for alpha, mips, parisc, xtensa
  -> Moved an "#ifdef NUMA" so that khugepaged_find_target_node() is
     defined in !NUMA && TRANSPARENT_HUGEPAGE configs.
* 'mm/khugepaged: remove khugepaged prefix from shared collapse'
  functions
  -> Removed khugepaged prefix from khugepaged_find_target_node on L914
* Rebased onto next-20220414

RFC -> v1:
* The series was significantly reworked from RFC and most patches are
  entirely new or reworked.
* Collapse eligibility criteria has changed: MADV_COLLAPSE now respects
  VM_NOHUGEPAGE.
* Collapse semantics have changed: the gfp flags used for hugepage
  allocation now match that of khugepaged for the same VMA, instead of the
  gfp flags used
  at-fault for calling process for the VMA.
* Collapse semantics have changed: The collapse semantics for multiple VMAs
  spanning a single MADV_COLLAPSE call are now independent, whereas before
  the idea was to allow direct reclaim/compaction if any spanned VMA
  permitted so.
* The process_madvise(2) flags, MADV_F_COLLAPSE_LIMITS and
  MADV_F_COLLAPSE_DEFRAG have been removed.
* Implementation change: the RFC implemented collapse over a range of
  hugepages in a batched-fashion with the aim of doing multiple page table
  updates inside a single mmap_lock write.  This has been changed, and the
  implementation now collapses each hugepage-aligned/sized region
  iteratively.  This was motivated by an experiment which showed that, when
  multiple threads were concurrently faulting during a MADV_COLLAPSE
  operation, mean and tail latency to acquire mmap_lock in read for threads
  in the fault patch was improved by using a batch size of 1 (batch sizes
  of 1, 8, 16, 32 were tested)[11].
* Added: If a collapse operation fails because a page isn't found on the
  LRU, do a lru_add_drain_all() and retry.
* Added: selftests

[1] https://lore.kernel.org/linux-mm/20220604004004.954674-1-zokeefe@google.com/
[2] https://lore.kernel.org/linux-mm/YrJJoP5vrZflvwd0@google.com/
[3] https://lore.kernel.org/linux-mm/20220625092816.4856-1-linmiaohe@huawei.com/
[4] https://lore.kernel.org/linux-mm/20220504214437.2850685-1-zokeefe@google.com/
[5] https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[6] https://github.com/google/tcmalloc/tree/master/tcmalloc
[7] https://research.google/pubs/pub50370/
[8] https://lore.kernel.org/linux-mm/CAHS8izPnJd5EQjUi9cOk=03u3X1rk0PexTQZi+bEE4VMtFfksQ@mail.gmail.com/
[9] https://lore.kernel.org/linux-mm/20220624173656.2033256-23-jthoughton@google.com/
[10] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/
[11] https://lore.kernel.org/linux-mm/CAAa6QmRc76n-dspGT7UK8DkaqZAOz-CkCsME1V7KGtQ6Yt2FqA@mail.gmail.com/


Zach O'Keefe (18):
  mm/khugepaged: remove redundant transhuge_vma_suitable() check
  mm: khugepaged: don't carry huge page to the next loop for
    !CONFIG_NUMA
  mm/khugepaged: add struct collapse_control
  mm/khugepaged: dedup and simplify hugepage alloc and charging
  mm/khugepaged: pipe enum scan_result codes back to callers
  mm/khugepaged: add flag to predicate khugepaged-only behavior
  mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()
  mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage
  mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  mm/khugepaged: rename prefix of shared collapse functions
  mm/madvise: add huge_memory:mm_madvise_collapse tracepoint
  mm/madvise: add MADV_COLLAPSE to process_madvise()
  proc/smaps: add PMDMappable field to smaps
  selftests/vm: modularize collapse selftests
  selftests/vm: dedup hugepage allocation logic
  selftests/vm: add MADV_COLLAPSE collapse context to selftests
  selftests/vm: add selftest to verify recollapse of THPs
  selftests/vm: add selftest to verify multi THP collapse

 Documentation/filesystems/proc.rst           |  10 +-
 arch/alpha/include/uapi/asm/mman.h           |   2 +
 arch/mips/include/uapi/asm/mman.h            |   2 +
 arch/parisc/include/uapi/asm/mman.h          |   2 +
 arch/xtensa/include/uapi/asm/mman.h          |   2 +
 fs/proc/task_mmu.c                           |   4 +-
 include/linux/huge_mm.h                      |  23 +-
 include/trace/events/huge_memory.h           |  23 +
 include/uapi/asm-generic/mman-common.h       |   2 +
 mm/huge_memory.c                             |  32 +-
 mm/internal.h                                |   2 +-
 mm/khugepaged.c                              | 745 +++++++++++--------
 mm/ksm.c                                     |  10 +
 mm/madvise.c                                 |  11 +-
 mm/memory.c                                  |   4 +-
 mm/rmap.c                                    |  15 +-
 tools/include/uapi/asm-generic/mman-common.h |   2 +
 tools/testing/selftests/vm/khugepaged.c      | 563 ++++++++------
 18 files changed, 845 insertions(+), 609 deletions(-)

-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [mm-unstable v7 01/18] mm/khugepaged: remove redundant transhuge_vma_suitable() check
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-11 20:38   ` Yang Shi
  2022-07-06 23:59 ` [mm-unstable v7 02/18] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Zach O'Keefe
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

transhuge_vma_suitable() is called twice in hugepage_vma_revalidate()
path.  Remove the first check, and rely on the second check inside
hugepage_vma_check().

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cfe231c5958f..5269d15e20f6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -918,8 +918,6 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!vma)
 		return SCAN_VMA_NULL;
 
-	if (!transhuge_vma_suitable(vma, address))
-		return SCAN_ADDRESS_RANGE;
 	if (!hugepage_vma_check(vma, vma->vm_flags, false, false))
 		return SCAN_VMA_CHECK;
 	/*
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 02/18] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
  2022-07-06 23:59 ` [mm-unstable v7 01/18] mm/khugepaged: remove redundant transhuge_vma_suitable() check Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-06 23:59 ` [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control Zach O'Keefe
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

From: Yang Shi <shy828301@gmail.com>

The khugepaged has optimization to reduce huge page allocation calls for
!CONFIG_NUMA by carrying the allocated but failed to collapse huge page to
the next loop.  CONFIG_NUMA doesn't do so since the next loop may try to
collapse huge page from a different node, so it doesn't make too much sense
to carry it.

But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page()
before scanning the address space, so it means huge page may be allocated
even though there is no suitable range for collapsing.  Then the page would
be just freed if khugepaged already made enough progress.  This could make
NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y run.  This
problem actually makes things worse due to the way more pointless THP
allocations and makes the optimization pointless.

This could be fixed by carrying the huge page across scans, but it will
complicate the code further and the huge page may be carried
indefinitely.  But if we take one step back,  the optimization itself seems
not worth keeping nowadays since:
  * Not too many users build NUMA=n kernel nowadays even though the kernel is
    actually running on a non-NUMA machine. Some small devices may run NUMA=n
    kernel, but I don't think they actually use THP.
  * Since commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
    stored on the per-cpu lists"), THP could be cached by pcp.  This actually
    somehow does the job done by the optimization.

Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Co-developed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
---
 mm/khugepaged.c | 120 +++++++++++-------------------------------------
 1 file changed, 26 insertions(+), 94 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5269d15e20f6..196eaadbf415 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -796,29 +796,16 @@ static int khugepaged_find_target_node(void)
 	last_khugepaged_target_node = target_node;
 	return target_node;
 }
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
+#else
+static int khugepaged_find_target_node(void)
 {
-	if (IS_ERR(*hpage)) {
-		if (!*wait)
-			return false;
-
-		*wait = false;
-		*hpage = NULL;
-		khugepaged_alloc_sleep();
-	} else if (*hpage) {
-		put_page(*hpage);
-		*hpage = NULL;
-	}
-
-	return true;
+	return 0;
 }
+#endif
 
 static struct page *
 khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 {
-	VM_BUG_ON_PAGE(*hpage, *hpage);
-
 	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
@@ -830,74 +817,6 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 	count_vm_event(THP_COLLAPSE_ALLOC);
 	return *hpage;
 }
-#else
-static int khugepaged_find_target_node(void)
-{
-	return 0;
-}
-
-static inline struct page *alloc_khugepaged_hugepage(void)
-{
-	struct page *page;
-
-	page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
-			   HPAGE_PMD_ORDER);
-	if (page)
-		prep_transhuge_page(page);
-	return page;
-}
-
-static struct page *khugepaged_alloc_hugepage(bool *wait)
-{
-	struct page *hpage;
-
-	do {
-		hpage = alloc_khugepaged_hugepage();
-		if (!hpage) {
-			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-			if (!*wait)
-				return NULL;
-
-			*wait = false;
-			khugepaged_alloc_sleep();
-		} else
-			count_vm_event(THP_COLLAPSE_ALLOC);
-	} while (unlikely(!hpage) && likely(hugepage_flags_enabled()));
-
-	return hpage;
-}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
-{
-	/*
-	 * If the hpage allocated earlier was briefly exposed in page cache
-	 * before collapse_file() failed, it is possible that racing lookups
-	 * have not yet completed, and would then be unpleasantly surprised by
-	 * finding the hpage reused for the same mapping at a different offset.
-	 * Just release the previous allocation if there is any danger of that.
-	 */
-	if (*hpage && page_count(*hpage) > 1) {
-		put_page(*hpage);
-		*hpage = NULL;
-	}
-
-	if (!*hpage)
-		*hpage = khugepaged_alloc_hugepage(wait);
-
-	if (unlikely(!*hpage))
-		return false;
-
-	return true;
-}
-
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
-{
-	VM_BUG_ON(!*hpage);
-
-	return  *hpage;
-}
-#endif
 
 /*
  * If mmap_lock temporarily dropped, revalidate vma
@@ -1148,8 +1067,10 @@ static void collapse_huge_page(struct mm_struct *mm,
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
-	if (!IS_ERR_OR_NULL(*hpage))
+	if (!IS_ERR_OR_NULL(*hpage)) {
 		mem_cgroup_uncharge(page_folio(*hpage));
+		put_page(*hpage);
+	}
 	trace_mm_collapse_huge_page(mm, isolated, result);
 	return;
 }
@@ -1951,8 +1872,10 @@ static void collapse_file(struct mm_struct *mm,
 	unlock_page(new_page);
 out:
 	VM_BUG_ON(!list_empty(&pagelist));
-	if (!IS_ERR_OR_NULL(*hpage))
+	if (!IS_ERR_OR_NULL(*hpage)) {
 		mem_cgroup_uncharge(page_folio(*hpage));
+		put_page(*hpage);
+	}
 	/* TODO: tracepoints */
 }
 
@@ -2197,10 +2120,7 @@ static void khugepaged_do_scan(void)
 
 	lru_add_drain_all();
 
-	while (progress < pages) {
-		if (!khugepaged_prealloc_page(&hpage, &wait))
-			break;
-
+	while (true) {
 		cond_resched();
 
 		if (unlikely(kthread_should_stop() || try_to_freeze()))
@@ -2216,10 +2136,22 @@ static void khugepaged_do_scan(void)
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
-	}
 
-	if (!IS_ERR_OR_NULL(hpage))
-		put_page(hpage);
+		if (progress >= pages)
+			break;
+
+		if (IS_ERR(hpage)) {
+			/*
+			 * If fail to allocate the first time, try to sleep for
+			 * a while.  When hit again, cancel the scan.
+			 */
+			if (!wait)
+				break;
+			wait = false;
+			hpage = NULL;
+			khugepaged_alloc_sleep();
+		}
+	}
 }
 
 static bool khugepaged_should_wakeup(void)
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
  2022-07-06 23:59 ` [mm-unstable v7 01/18] mm/khugepaged: remove redundant transhuge_vma_suitable() check Zach O'Keefe
  2022-07-06 23:59 ` [mm-unstable v7 02/18] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-08 21:01   ` Andrew Morton
  2022-07-06 23:59 ` [mm-unstable v7 04/18] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Modularize hugepage collapse by introducing struct collapse_control.
This structure serves to describe the properties of the requested
collapse, as well as serve as a local scratch pad to use during the
collapse itself.

Start by moving global per-node khugepaged statistics into this
new structure.  Note that this structure is still statically allocated
since CONFIG_NODES_SHIFT might be arbitrary large, and stack-allocating
a MAX_NUMNODES-sized array could cause -Wframe-large-than= errors.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 87 ++++++++++++++++++++++++++++---------------------
 1 file changed, 50 insertions(+), 37 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 196eaadbf415..f1ef02d9fe07 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -85,6 +85,14 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
 
 #define MAX_PTE_MAPPED_THP 8
 
+struct collapse_control {
+	/* Num pages scanned per node */
+	int node_load[MAX_NUMNODES];
+
+	/* Last target selected in khugepaged_find_target_node() */
+	int last_target_node;
+};
+
 /**
  * struct mm_slot - hash lookup from mm to mm_slot
  * @hash: hash collision list
@@ -735,9 +743,12 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-static int khugepaged_node_load[MAX_NUMNODES];
 
-static bool khugepaged_scan_abort(int nid)
+struct collapse_control khugepaged_collapse_control = {
+	.last_target_node = NUMA_NO_NODE,
+};
+
+static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -749,11 +760,11 @@ static bool khugepaged_scan_abort(int nid)
 		return false;
 
 	/* If there is a count for this node already, it must be acceptable */
-	if (khugepaged_node_load[nid])
+	if (cc->node_load[nid])
 		return false;
 
 	for (i = 0; i < MAX_NUMNODES; i++) {
-		if (!khugepaged_node_load[i])
+		if (!cc->node_load[i])
 			continue;
 		if (node_distance(nid, i) > node_reclaim_distance)
 			return true;
@@ -772,32 +783,31 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(void)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
-	static int last_khugepaged_target_node = NUMA_NO_NODE;
 	int nid, target_node = 0, max_value = 0;
 
 	/* find first node with max normal pages hit */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
-		if (khugepaged_node_load[nid] > max_value) {
-			max_value = khugepaged_node_load[nid];
+		if (cc->node_load[nid] > max_value) {
+			max_value = cc->node_load[nid];
 			target_node = nid;
 		}
 
 	/* do some balance if several nodes have the same hit record */
-	if (target_node <= last_khugepaged_target_node)
-		for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
-				nid++)
-			if (max_value == khugepaged_node_load[nid]) {
+	if (target_node <= cc->last_target_node)
+		for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
+		     nid++)
+			if (max_value == cc->node_load[nid]) {
 				target_node = nid;
 				break;
 			}
 
-	last_khugepaged_target_node = target_node;
+	cc->last_target_node = target_node;
 	return target_node;
 }
 #else
-static int khugepaged_find_target_node(void)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -1075,10 +1085,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 	return;
 }
 
-static int khugepaged_scan_pmd(struct mm_struct *mm,
-			       struct vm_area_struct *vma,
-			       unsigned long address,
-			       struct page **hpage)
+static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, struct page **hpage,
+			       struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1098,7 +1107,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		goto out;
 	}
 
-	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+	memset(cc->node_load, 0, sizeof(cc->node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
@@ -1164,16 +1173,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 
 		/*
 		 * Record which node the original page is from and save this
-		 * information to khugepaged_node_load[].
+		 * information to cc->node_load[].
 		 * Khugepaged will allocate hugepage from the node has the max
 		 * hit record.
 		 */
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
-		khugepaged_node_load[node]++;
+		cc->node_load[node]++;
 		if (!PageLRU(page)) {
 			result = SCAN_PAGE_LRU;
 			goto out_unmap;
@@ -1224,7 +1233,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
-		node = khugepaged_find_target_node();
+		node = khugepaged_find_target_node(cc);
 		/* collapse_huge_page will return with the mmap_lock released */
 		collapse_huge_page(mm, address, hpage, node,
 				referenced, unmapped);
@@ -1879,8 +1888,9 @@ static void collapse_file(struct mm_struct *mm,
 	/* TODO: tracepoints */
 }
 
-static void khugepaged_scan_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start, struct page **hpage)
+static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
+				 pgoff_t start, struct page **hpage,
+				 struct collapse_control *cc)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -1891,7 +1901,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 
 	present = 0;
 	swap = 0;
-	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+	memset(cc->node_load, 0, sizeof(cc->node_load));
 	rcu_read_lock();
 	xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
 		if (xas_retry(&xas, page))
@@ -1916,11 +1926,11 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 		}
 
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			break;
 		}
-		khugepaged_node_load[node]++;
+		cc->node_load[node]++;
 
 		if (!PageLRU(page)) {
 			result = SCAN_PAGE_LRU;
@@ -1953,7 +1963,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			node = khugepaged_find_target_node();
+			node = khugepaged_find_target_node(cc);
 			collapse_file(mm, file, start, hpage, node);
 		}
 	}
@@ -1961,8 +1971,9 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 	/* TODO: tracepoints */
 }
 #else
-static void khugepaged_scan_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start, struct page **hpage)
+static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
+				 pgoff_t start, struct page **hpage,
+				 struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -1973,7 +1984,8 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
-					    struct page **hpage)
+					    struct page **hpage,
+					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
 {
@@ -2050,12 +2062,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 
 				mmap_read_unlock(mm);
 				ret = 1;
-				khugepaged_scan_file(mm, file, pgoff, hpage);
+				khugepaged_scan_file(mm, file, pgoff, hpage,
+						     cc);
 				fput(file);
 			} else {
 				ret = khugepaged_scan_pmd(mm, vma,
 						khugepaged_scan.address,
-						hpage);
+						hpage, cc);
 			}
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
@@ -2111,7 +2124,7 @@ static int khugepaged_wait_event(void)
 		kthread_should_stop();
 }
 
-static void khugepaged_do_scan(void)
+static void khugepaged_do_scan(struct collapse_control *cc)
 {
 	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
@@ -2132,7 +2145,7 @@ static void khugepaged_do_scan(void)
 		if (khugepaged_has_work() &&
 		    pass_through_head < 2)
 			progress += khugepaged_scan_mm_slot(pages - progress,
-							    &hpage);
+							    &hpage, cc);
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
@@ -2188,7 +2201,7 @@ static int khugepaged(void *none)
 	set_user_nice(current, MAX_NICE);
 
 	while (!kthread_should_stop()) {
-		khugepaged_do_scan();
+		khugepaged_do_scan(&khugepaged_collapse_control);
 		khugepaged_wait_work();
 	}
 
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 04/18] mm/khugepaged: dedup and simplify hugepage alloc and charging
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (2 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-06 23:59 ` [mm-unstable v7 05/18] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

The following code is duplicated in collapse_huge_page() and
collapse_file():

        gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;

	new_page = khugepaged_alloc_page(hpage, gfp, node);
        if (!new_page) {
                result = SCAN_ALLOC_HUGE_PAGE_FAIL;
                goto out;
        }

        if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
                result = SCAN_CGROUP_CHARGE_FAIL;
                goto out;
        }
        count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);

Also, "node" is passed as an argument to both collapse_huge_page() and
collapse_file() and obtained the same way, via
khugepaged_find_target_node().

Move all this into a new helper, alloc_charge_hpage(), and remove the
duplicate code from collapse_huge_page() and collapse_file().  Also,
simplify khugepaged_alloc_page() by returning a bool indicating
allocation success instead of a copy of the allocated struct page *.

Suggested-by: Peter Xu <peterx@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 78 ++++++++++++++++++++++---------------------------
 1 file changed, 35 insertions(+), 43 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f1ef02d9fe07..8068adf24620 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -813,19 +813,18 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
 }
 #endif
 
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
+static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 {
 	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 		*hpage = ERR_PTR(-ENOMEM);
-		return NULL;
+		return false;
 	}
 
 	prep_transhuge_page(*hpage);
 	count_vm_event(THP_COLLAPSE_ALLOC);
-	return *hpage;
+	return true;
 }
 
 /*
@@ -921,10 +920,24 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 	return true;
 }
 
-static void collapse_huge_page(struct mm_struct *mm,
-				   unsigned long address,
-				   struct page **hpage,
-				   int node, int referenced, int unmapped)
+static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
+			      struct collapse_control *cc)
+{
+	/* Only allocate from the target node */
+	gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
+	int node = khugepaged_find_target_node(cc);
+
+	if (!khugepaged_alloc_page(hpage, gfp, node))
+		return SCAN_ALLOC_HUGE_PAGE_FAIL;
+	if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
+		return SCAN_CGROUP_CHARGE_FAIL;
+	count_memcg_page_event(*hpage, THP_COLLAPSE_ALLOC);
+	return SCAN_SUCCEED;
+}
+
+static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
+			       struct page **hpage, int referenced,
+			       int unmapped, struct collapse_control *cc)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -935,13 +948,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 	int isolated = 0, result = 0;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
-	gfp_t gfp;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-	/* Only allocate from the target node */
-	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
-
 	/*
 	 * Before allocating the hugepage, release the mmap_lock read lock.
 	 * The allocation can take potentially a long time if it involves
@@ -949,17 +958,12 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * that. We will recheck the vma after taking it again in write mode.
 	 */
 	mmap_read_unlock(mm);
-	new_page = khugepaged_alloc_page(hpage, gfp, node);
-	if (!new_page) {
-		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
-		goto out_nolock;
-	}
 
-	if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
-		result = SCAN_CGROUP_CHARGE_FAIL;
+	result = alloc_charge_hpage(hpage, mm, cc);
+	if (result != SCAN_SUCCEED)
 		goto out_nolock;
-	}
-	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
+
+	new_page = *hpage;
 
 	mmap_read_lock(mm);
 	result = hugepage_vma_revalidate(mm, address, &vma);
@@ -1233,10 +1237,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
-		node = khugepaged_find_target_node(cc);
 		/* collapse_huge_page will return with the mmap_lock released */
-		collapse_huge_page(mm, address, hpage, node,
-				referenced, unmapped);
+		collapse_huge_page(mm, address, hpage, referenced, unmapped,
+				   cc);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
@@ -1504,7 +1507,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  * @file: file that collapse on
  * @start: collapse start address
  * @hpage: new allocated huge page for collapse
- * @node: appointed node the new huge page allocate from
+ * @cc: collapse context and scratchpad
  *
  * Basic scheme is simple, details are more complex:
  *  - allocate and lock a new huge page;
@@ -1521,12 +1524,11 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  *    + restore gaps in the page cache;
  *    + unlock and free huge page;
  */
-static void collapse_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start,
-		struct page **hpage, int node)
+static void collapse_file(struct mm_struct *mm, struct file *file,
+			  pgoff_t start, struct page **hpage,
+			  struct collapse_control *cc)
 {
 	struct address_space *mapping = file->f_mapping;
-	gfp_t gfp;
 	struct page *new_page;
 	pgoff_t index, end = start + HPAGE_PMD_NR;
 	LIST_HEAD(pagelist);
@@ -1538,20 +1540,11 @@ static void collapse_file(struct mm_struct *mm,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	/* Only allocate from the target node */
-	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
-
-	new_page = khugepaged_alloc_page(hpage, gfp, node);
-	if (!new_page) {
-		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+	result = alloc_charge_hpage(hpage, mm, cc);
+	if (result != SCAN_SUCCEED)
 		goto out;
-	}
 
-	if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
-		result = SCAN_CGROUP_CHARGE_FAIL;
-		goto out;
-	}
-	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
+	new_page = *hpage;
 
 	/*
 	 * Ensure we have slots for all the pages in the range.  This is
@@ -1963,8 +1956,7 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			node = khugepaged_find_target_node(cc);
-			collapse_file(mm, file, start, hpage, node);
+			collapse_file(mm, file, start, hpage, cc);
 		}
 	}
 
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 05/18] mm/khugepaged: pipe enum scan_result codes back to callers
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (3 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 04/18] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-06 23:59 ` [mm-unstable v7 06/18] mm/khugepaged: add flag to predicate khugepaged-only behavior Zach O'Keefe
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Pipe enum scan_result codes back through return values of functions
downstream of khugepaged_scan_file() and khugepaged_scan_pmd() to
inform callers if the operation was successful, and if not, why.

Since khugepaged_scan_pmd()'s return value already has a specific
meaning (whether mmap_lock was unlocked or not), add a bool* argument
to khugepaged_scan_pmd() to retrieve this information.

Change khugepaged to take action based on the return values of
khugepaged_scan_file() and khugepaged_scan_pmd() instead of acting
deep within the collapsing functions themselves.

hugepage_vma_revalidate() now returns SCAN_SUCCEED on success to be
more consistent with enum scan_result propagation.

Remove dependency on error pointers to communicate to khugepaged that
allocation failed and it should sleep; instead just use the result of
the scan (SCAN_ALLOC_HUGE_PAGE_FAIL if allocation fails).

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
---
 mm/khugepaged.c | 233 ++++++++++++++++++++++++------------------------
 1 file changed, 117 insertions(+), 116 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8068adf24620..147f5828f052 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -558,7 +558,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	pte_t *_pte;
-	int none_or_zero = 0, shared = 0, result = 0, referenced = 0;
+	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
 	bool writable = false;
 
 	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
@@ -672,13 +672,13 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(page, none_or_zero,
 						    referenced, writable, result);
-		return 1;
+		return result;
 	}
 out:
 	release_pte_pages(pte, _pte, compound_pagelist);
 	trace_mm_collapse_huge_page_isolate(page, none_or_zero,
 					    referenced, writable, result);
-	return 0;
+	return result;
 }
 
 static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
@@ -818,7 +818,6 @@ static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-		*hpage = ERR_PTR(-ENOMEM);
 		return false;
 	}
 
@@ -830,8 +829,7 @@ static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 /*
  * If mmap_lock temporarily dropped, revalidate vma
  * before taking mmap_lock.
- * Return 0 if succeeds, otherwise return none-zero
- * value (scan code).
+ * Returns enum scan_result value.
  */
 
 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
@@ -857,7 +855,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	 */
 	if (!vma->anon_vma || !vma_is_anonymous(vma))
 		return SCAN_VMA_CHECK;
-	return 0;
+	return SCAN_SUCCEED;
 }
 
 /*
@@ -868,10 +866,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
  * Note that if false is returned, mmap_lock will be released.
  */
 
-static bool __collapse_huge_page_swapin(struct mm_struct *mm,
-					struct vm_area_struct *vma,
-					unsigned long haddr, pmd_t *pmd,
-					int referenced)
+static int __collapse_huge_page_swapin(struct mm_struct *mm,
+				       struct vm_area_struct *vma,
+				       unsigned long haddr, pmd_t *pmd,
+				       int referenced)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
@@ -902,12 +900,13 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 		 */
 		if (ret & VM_FAULT_RETRY) {
 			trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
-			return false;
+			/* Likely, but not guaranteed, that page lock failed */
+			return SCAN_PAGE_LOCK;
 		}
 		if (ret & VM_FAULT_ERROR) {
 			mmap_read_unlock(mm);
 			trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
-			return false;
+			return SCAN_FAIL;
 		}
 		swapped_in++;
 	}
@@ -917,7 +916,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 		lru_add_drain();
 
 	trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 1);
-	return true;
+	return SCAN_SUCCEED;
 }
 
 static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
@@ -935,17 +934,17 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
 	return SCAN_SUCCEED;
 }
 
-static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
-			       struct page **hpage, int referenced,
-			       int unmapped, struct collapse_control *cc)
+static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
+			      int referenced, int unmapped,
+			      struct collapse_control *cc)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
 	pte_t *pte;
 	pgtable_t pgtable;
-	struct page *new_page;
+	struct page *hpage;
 	spinlock_t *pmd_ptl, *pte_ptl;
-	int isolated = 0, result = 0;
+	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
 
@@ -959,15 +958,13 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_hpage(hpage, mm, cc);
+	result = alloc_charge_hpage(&hpage, mm, cc);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
-	new_page = *hpage;
-
 	mmap_read_lock(mm);
 	result = hugepage_vma_revalidate(mm, address, &vma);
-	if (result) {
+	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
 	}
@@ -979,14 +976,16 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 	}
 
-	/*
-	 * __collapse_huge_page_swapin will return with mmap_lock released
-	 * when it fails. So we jump out_nolock directly in that case.
-	 * Continuing to collapse causes inconsistency.
-	 */
-	if (unmapped && !__collapse_huge_page_swapin(mm, vma, address,
-						     pmd, referenced)) {
-		goto out_nolock;
+	if (unmapped) {
+		/*
+		 * __collapse_huge_page_swapin will return with mmap_lock
+		 * released when it fails. So we jump out_nolock directly in
+		 * that case.  Continuing to collapse causes inconsistency.
+		 */
+		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
+						     referenced);
+		if (result != SCAN_SUCCEED)
+			goto out_nolock;
 	}
 
 	mmap_read_unlock(mm);
@@ -997,7 +996,7 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_write_lock(mm);
 	result = hugepage_vma_revalidate(mm, address, &vma);
-	if (result)
+	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
 	if (mm_find_pmd(mm, address) != pmd)
@@ -1024,11 +1023,11 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 
 	spin_lock(pte_ptl);
-	isolated = __collapse_huge_page_isolate(vma, address, pte,
-			&compound_pagelist);
+	result =  __collapse_huge_page_isolate(vma, address, pte,
+					       &compound_pagelist);
 	spin_unlock(pte_ptl);
 
-	if (unlikely(!isolated)) {
+	if (unlikely(result != SCAN_SUCCEED)) {
 		pte_unmap(pte);
 		spin_lock(pmd_ptl);
 		BUG_ON(!pmd_none(*pmd));
@@ -1040,7 +1039,6 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
 		spin_unlock(pmd_ptl);
 		anon_vma_unlock_write(vma->anon_vma);
-		result = SCAN_FAIL;
 		goto out_up_write;
 	}
 
@@ -1050,8 +1048,8 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	anon_vma_unlock_write(vma->anon_vma);
 
-	__collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl,
-			&compound_pagelist);
+	__collapse_huge_page_copy(pte, hpage, vma, address, pte_ptl,
+				  &compound_pagelist);
 	pte_unmap(pte);
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), but
@@ -1059,43 +1057,42 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * avoid the copy_huge_page writes to become visible after
 	 * the set_pmd_at() write.
 	 */
-	__SetPageUptodate(new_page);
+	__SetPageUptodate(hpage);
 	pgtable = pmd_pgtable(_pmd);
 
-	_pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+	_pmd = mk_huge_pmd(hpage, vma->vm_page_prot);
 	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address);
-	lru_cache_add_inactive_or_unevictable(new_page, vma);
+	page_add_new_anon_rmap(hpage, vma, address);
+	lru_cache_add_inactive_or_unevictable(hpage, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
 	spin_unlock(pmd_ptl);
 
-	*hpage = NULL;
+	hpage = NULL;
 
-	khugepaged_pages_collapsed++;
 	result = SCAN_SUCCEED;
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
-	if (!IS_ERR_OR_NULL(*hpage)) {
-		mem_cgroup_uncharge(page_folio(*hpage));
-		put_page(*hpage);
+	if (hpage) {
+		mem_cgroup_uncharge(page_folio(hpage));
+		put_page(hpage);
 	}
-	trace_mm_collapse_huge_page(mm, isolated, result);
-	return;
+	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+	return result;
 }
 
 static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-			       unsigned long address, struct page **hpage,
+			       unsigned long address, bool *mmap_locked,
 			       struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
-	int ret = 0, result = 0, referenced = 0;
+	int result = SCAN_FAIL, referenced = 0;
 	int none_or_zero = 0, shared = 0;
 	struct page *page = NULL;
 	unsigned long _address;
@@ -1232,19 +1229,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
-		ret = 1;
 	}
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
-	if (ret) {
+	if (result == SCAN_SUCCEED) {
+		result = collapse_huge_page(mm, address, referenced,
+					    unmapped, cc);
 		/* collapse_huge_page will return with the mmap_lock released */
-		collapse_huge_page(mm, address, hpage, referenced, unmapped,
-				   cc);
+		*mmap_locked = false;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
 				     none_or_zero, result, unmapped);
-	return ret;
+	return result;
 }
 
 static void collect_mm_slot(struct mm_slot *mm_slot)
@@ -1506,7 +1503,6 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  * @mm: process address space where collapse happens
  * @file: file that collapse on
  * @start: collapse start address
- * @hpage: new allocated huge page for collapse
  * @cc: collapse context and scratchpad
  *
  * Basic scheme is simple, details are more complex:
@@ -1524,12 +1520,11 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  *    + restore gaps in the page cache;
  *    + unlock and free huge page;
  */
-static void collapse_file(struct mm_struct *mm, struct file *file,
-			  pgoff_t start, struct page **hpage,
-			  struct collapse_control *cc)
+static int collapse_file(struct mm_struct *mm, struct file *file,
+			 pgoff_t start, struct collapse_control *cc)
 {
 	struct address_space *mapping = file->f_mapping;
-	struct page *new_page;
+	struct page *hpage;
 	pgoff_t index, end = start + HPAGE_PMD_NR;
 	LIST_HEAD(pagelist);
 	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
@@ -1540,12 +1535,10 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_hpage(hpage, mm, cc);
+	result = alloc_charge_hpage(&hpage, mm, cc);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-	new_page = *hpage;
-
 	/*
 	 * Ensure we have slots for all the pages in the range.  This is
 	 * almost certainly a no-op because most of the pages must be present
@@ -1562,14 +1555,14 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 		}
 	} while (1);
 
-	__SetPageLocked(new_page);
+	__SetPageLocked(hpage);
 	if (is_shmem)
-		__SetPageSwapBacked(new_page);
-	new_page->index = start;
-	new_page->mapping = mapping;
+		__SetPageSwapBacked(hpage);
+	hpage->index = start;
+	hpage->mapping = mapping;
 
 	/*
-	 * At this point the new_page is locked and not up-to-date.
+	 * At this point the hpage is locked and not up-to-date.
 	 * It's safe to insert it into the page cache, because nobody would
 	 * be able to map it or use it in another way until we unlock it.
 	 */
@@ -1597,7 +1590,7 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 					result = SCAN_FAIL;
 					goto xa_locked;
 				}
-				xas_store(&xas, new_page);
+				xas_store(&xas, hpage);
 				nr_none++;
 				continue;
 			}
@@ -1739,19 +1732,19 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 		list_add_tail(&page->lru, &pagelist);
 
 		/* Finally, replace with the new page. */
-		xas_store(&xas, new_page);
+		xas_store(&xas, hpage);
 		continue;
 out_unlock:
 		unlock_page(page);
 		put_page(page);
 		goto xa_unlocked;
 	}
-	nr = thp_nr_pages(new_page);
+	nr = thp_nr_pages(hpage);
 
 	if (is_shmem)
-		__mod_lruvec_page_state(new_page, NR_SHMEM_THPS, nr);
+		__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
 	else {
-		__mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
+		__mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
 		filemap_nr_thps_inc(mapping);
 		/*
 		 * Paired with smp_mb() in do_dentry_open() to ensure
@@ -1762,21 +1755,21 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 		smp_mb();
 		if (inode_is_open_for_write(mapping->host)) {
 			result = SCAN_FAIL;
-			__mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
+			__mod_lruvec_page_state(hpage, NR_FILE_THPS, -nr);
 			filemap_nr_thps_dec(mapping);
 			goto xa_locked;
 		}
 	}
 
 	if (nr_none) {
-		__mod_lruvec_page_state(new_page, NR_FILE_PAGES, nr_none);
+		__mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
 		/* nr_none is always 0 for non-shmem. */
-		__mod_lruvec_page_state(new_page, NR_SHMEM, nr_none);
+		__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
 	}
 
 	/* Join all the small entries into a single multi-index entry */
 	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
-	xas_store(&xas, new_page);
+	xas_store(&xas, hpage);
 xa_locked:
 	xas_unlock_irq(&xas);
 xa_unlocked:
@@ -1798,11 +1791,11 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 		index = start;
 		list_for_each_entry_safe(page, tmp, &pagelist, lru) {
 			while (index < page->index) {
-				clear_highpage(new_page + (index % HPAGE_PMD_NR));
+				clear_highpage(hpage + (index % HPAGE_PMD_NR));
 				index++;
 			}
-			copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
-					page);
+			copy_highpage(hpage + (page->index % HPAGE_PMD_NR),
+				      page);
 			list_del(&page->lru);
 			page->mapping = NULL;
 			page_ref_unfreeze(page, 1);
@@ -1813,23 +1806,22 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 			index++;
 		}
 		while (index < end) {
-			clear_highpage(new_page + (index % HPAGE_PMD_NR));
+			clear_highpage(hpage + (index % HPAGE_PMD_NR));
 			index++;
 		}
 
-		SetPageUptodate(new_page);
-		page_ref_add(new_page, HPAGE_PMD_NR - 1);
+		SetPageUptodate(hpage);
+		page_ref_add(hpage, HPAGE_PMD_NR - 1);
 		if (is_shmem)
-			set_page_dirty(new_page);
-		lru_cache_add(new_page);
+			set_page_dirty(hpage);
+		lru_cache_add(hpage);
 
 		/*
 		 * Remove pte page tables, so we can re-fault the page as huge.
 		 */
 		retract_page_tables(mapping, start);
-		*hpage = NULL;
-
-		khugepaged_pages_collapsed++;
+		unlock_page(hpage);
+		hpage = NULL;
 	} else {
 		struct page *page;
 
@@ -1868,22 +1860,23 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 		VM_BUG_ON(nr_none);
 		xas_unlock_irq(&xas);
 
-		new_page->mapping = NULL;
+		hpage->mapping = NULL;
 	}
 
-	unlock_page(new_page);
+	if (hpage)
+		unlock_page(hpage);
 out:
 	VM_BUG_ON(!list_empty(&pagelist));
-	if (!IS_ERR_OR_NULL(*hpage)) {
-		mem_cgroup_uncharge(page_folio(*hpage));
-		put_page(*hpage);
+	if (hpage) {
+		mem_cgroup_uncharge(page_folio(hpage));
+		put_page(hpage);
 	}
 	/* TODO: tracepoints */
+	return result;
 }
 
-static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				 pgoff_t start, struct page **hpage,
-				 struct collapse_control *cc)
+static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
+				pgoff_t start, struct collapse_control *cc)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -1956,16 +1949,16 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			collapse_file(mm, file, start, hpage, cc);
+			result = collapse_file(mm, file, start, cc);
 		}
 	}
 
 	/* TODO: tracepoints */
+	return result;
 }
 #else
-static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				 pgoff_t start, struct page **hpage,
-				 struct collapse_control *cc)
+static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
+				pgoff_t start, struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -1975,8 +1968,7 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 }
 #endif
 
-static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
-					    struct page **hpage,
+static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
@@ -1990,6 +1982,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 
 	VM_BUG_ON(!pages);
 	lockdep_assert_held(&khugepaged_mm_lock);
+	*result = SCAN_FAIL;
 
 	if (khugepaged_scan.mm_slot)
 		mm_slot = khugepaged_scan.mm_slot;
@@ -2039,7 +2032,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 		VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
 
 		while (khugepaged_scan.address < hend) {
-			int ret;
+			bool mmap_locked = true;
+
 			cond_resched();
 			if (unlikely(khugepaged_test_exit(mm)))
 				goto breakouterloop;
@@ -2053,20 +2047,28 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 						khugepaged_scan.address);
 
 				mmap_read_unlock(mm);
-				ret = 1;
-				khugepaged_scan_file(mm, file, pgoff, hpage,
-						     cc);
+				*result = khugepaged_scan_file(mm, file, pgoff,
+							       cc);
+				mmap_locked = false;
 				fput(file);
 			} else {
-				ret = khugepaged_scan_pmd(mm, vma,
-						khugepaged_scan.address,
-						hpage, cc);
+				*result = khugepaged_scan_pmd(mm, vma,
+							      khugepaged_scan.address,
+							      &mmap_locked, cc);
 			}
+			if (*result == SCAN_SUCCEED)
+				++khugepaged_pages_collapsed;
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
 			progress += HPAGE_PMD_NR;
-			if (ret)
-				/* we released mmap_lock so break loop */
+			if (!mmap_locked)
+				/*
+				 * We released mmap_lock so break loop.  Note
+				 * that we drop mmap_lock before all hugepage
+				 * allocations, so if allocation fails, we are
+				 * guaranteed to break here and report the
+				 * correct result back to caller.
+				 */
 				goto breakouterloop_mmap_lock;
 			if (progress >= pages)
 				goto breakouterloop;
@@ -2118,10 +2120,10 @@ static int khugepaged_wait_event(void)
 
 static void khugepaged_do_scan(struct collapse_control *cc)
 {
-	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
 	unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
 	bool wait = true;
+	int result = SCAN_SUCCEED;
 
 	lru_add_drain_all();
 
@@ -2137,7 +2139,7 @@ static void khugepaged_do_scan(struct collapse_control *cc)
 		if (khugepaged_has_work() &&
 		    pass_through_head < 2)
 			progress += khugepaged_scan_mm_slot(pages - progress,
-							    &hpage, cc);
+							    &result, cc);
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
@@ -2145,7 +2147,7 @@ static void khugepaged_do_scan(struct collapse_control *cc)
 		if (progress >= pages)
 			break;
 
-		if (IS_ERR(hpage)) {
+		if (result == SCAN_ALLOC_HUGE_PAGE_FAIL) {
 			/*
 			 * If fail to allocate the first time, try to sleep for
 			 * a while.  When hit again, cancel the scan.
@@ -2153,7 +2155,6 @@ static void khugepaged_do_scan(struct collapse_control *cc)
 			if (!wait)
 				break;
 			wait = false;
-			hpage = NULL;
 			khugepaged_alloc_sleep();
 		}
 	}
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 06/18] mm/khugepaged: add flag to predicate khugepaged-only behavior
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (4 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 05/18] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-11 20:43   ` Yang Shi
  2022-07-06 23:59 ` [mm-unstable v7 07/18] mm/thp: add flag to enforce sysfs THP in hugepage_vma_check() Zach O'Keefe
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add .is_khugepaged flag to struct collapse_control so
khugepaged-specific behavior can be elided by MADV_COLLAPSE context.

Start by protecting khugepaged-specific heuristics by this flag. In
MADV_COLLAPSE, the user presumably has reason to believe the collapse
will be beneficial and khugepaged heuristics shouldn't prevent the user
from doing so:

1) sysfs-controlled knobs khugepaged_max_ptes_[none|swap|shared]

2) requirement that some pages in region being collapsed be young or
   referenced

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---

v6 -> v7: There is no functional change here from v6, just a renaming of
	  flags to explicitly be predicated on khugepaged.
---
 mm/khugepaged.c | 62 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 43 insertions(+), 19 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 147f5828f052..d89056d8cbad 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -73,6 +73,8 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
  * default collapse hugepages if there is at least one pte mapped like
  * it would have happened if the vma was large enough during page
  * fault.
+ *
+ * Note that these are only respected if collapse was initiated by khugepaged.
  */
 static unsigned int khugepaged_max_ptes_none __read_mostly;
 static unsigned int khugepaged_max_ptes_swap __read_mostly;
@@ -86,6 +88,8 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
 #define MAX_PTE_MAPPED_THP 8
 
 struct collapse_control {
+	bool is_khugepaged;
+
 	/* Num pages scanned per node */
 	int node_load[MAX_NUMNODES];
 
@@ -554,6 +558,7 @@ static bool is_refcount_suitable(struct page *page)
 static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
+					struct collapse_control *cc,
 					struct list_head *compound_pagelist)
 {
 	struct page *page = NULL;
@@ -567,7 +572,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		if (pte_none(pteval) || (pte_present(pteval) &&
 				is_zero_pfn(pte_pfn(pteval)))) {
 			if (!userfaultfd_armed(vma) &&
-			    ++none_or_zero <= khugepaged_max_ptes_none) {
+			    (++none_or_zero <= khugepaged_max_ptes_none ||
+			     !cc->is_khugepaged)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -587,8 +593,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		VM_BUG_ON_PAGE(!PageAnon(page), page);
 
-		if (page_mapcount(page) > 1 &&
-				++shared > khugepaged_max_ptes_shared) {
+		if (cc->is_khugepaged && page_mapcount(page) > 1 &&
+		    ++shared > khugepaged_max_ptes_shared) {
 			result = SCAN_EXCEED_SHARED_PTE;
 			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 			goto out;
@@ -654,10 +660,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		if (PageCompound(page))
 			list_add_tail(&page->lru, compound_pagelist);
 next:
-		/* There should be enough young pte to collapse the page */
-		if (pte_young(pteval) ||
-		    page_is_young(page) || PageReferenced(page) ||
-		    mmu_notifier_test_young(vma->vm_mm, address))
+		/*
+		 * If collapse was initiated by khugepaged, check that there is
+		 * enough young pte to justify collapsing the page
+		 */
+		if (cc->is_khugepaged &&
+		    (pte_young(pteval) || page_is_young(page) ||
+		     PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
+								     address)))
 			referenced++;
 
 		if (pte_write(pteval))
@@ -666,7 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 	if (unlikely(!writable)) {
 		result = SCAN_PAGE_RO;
-	} else if (unlikely(!referenced)) {
+	} else if (unlikely(cc->is_khugepaged && !referenced)) {
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
@@ -745,6 +755,7 @@ static void khugepaged_alloc_sleep(void)
 
 
 struct collapse_control khugepaged_collapse_control = {
+	.is_khugepaged = true,
 	.last_target_node = NUMA_NO_NODE,
 };
 
@@ -1023,7 +1034,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 
 	spin_lock(pte_ptl);
-	result =  __collapse_huge_page_isolate(vma, address, pte,
+	result =  __collapse_huge_page_isolate(vma, address, pte, cc,
 					       &compound_pagelist);
 	spin_unlock(pte_ptl);
 
@@ -1114,7 +1125,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (is_swap_pte(pteval)) {
-			if (++unmapped <= khugepaged_max_ptes_swap) {
+			if (++unmapped <= khugepaged_max_ptes_swap ||
+			    !cc->is_khugepaged) {
 				/*
 				 * Always be strict with uffd-wp
 				 * enabled swap entries.  Please see
@@ -1133,7 +1145,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			if (!userfaultfd_armed(vma) &&
-			    ++none_or_zero <= khugepaged_max_ptes_none) {
+			    (++none_or_zero <= khugepaged_max_ptes_none ||
+			     !cc->is_khugepaged)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1163,8 +1176,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto out_unmap;
 		}
 
-		if (page_mapcount(page) > 1 &&
-				++shared > khugepaged_max_ptes_shared) {
+		if (cc->is_khugepaged &&
+		    page_mapcount(page) > 1 &&
+		    ++shared > khugepaged_max_ptes_shared) {
 			result = SCAN_EXCEED_SHARED_PTE;
 			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 			goto out_unmap;
@@ -1218,14 +1232,22 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 			result = SCAN_PAGE_COUNT;
 			goto out_unmap;
 		}
-		if (pte_young(pteval) ||
-		    page_is_young(page) || PageReferenced(page) ||
-		    mmu_notifier_test_young(vma->vm_mm, address))
+
+		/*
+		 * If collapse was initiated by khugepaged, check that there is
+		 * enough young pte to justify collapsing the page
+		 */
+		if (cc->is_khugepaged &&
+		    (pte_young(pteval) || page_is_young(page) ||
+		     PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
+								     address)))
 			referenced++;
 	}
 	if (!writable) {
 		result = SCAN_PAGE_RO;
-	} else if (!referenced || (unmapped && referenced < HPAGE_PMD_NR/2)) {
+	} else if (cc->is_khugepaged &&
+		   (!referenced ||
+		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
@@ -1894,7 +1916,8 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 			continue;
 
 		if (xa_is_value(page)) {
-			if (++swap > khugepaged_max_ptes_swap) {
+			if (cc->is_khugepaged &&
+			    ++swap > khugepaged_max_ptes_swap) {
 				result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				break;
@@ -1945,7 +1968,8 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 	rcu_read_unlock();
 
 	if (result == SCAN_SUCCEED) {
-		if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+		if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none &&
+		    cc->is_khugepaged) {
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 07/18] mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (5 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 06/18] mm/khugepaged: add flag to predicate khugepaged-only behavior Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-11 20:57   ` Yang Shi
  2022-07-06 23:59 ` [mm-unstable v7 08/18] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage Zach O'Keefe
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1].

hugepage_vma_check() is the authority on determining if a VMA is eligible
for THP allocation/collapse, and currently enforces the sysfs THP settings.
Add a flag to disable these checks.  For now, only apply this arg to anon
and file, which use /sys/kernel/transparent_hugepage/enabled.  We can
expand this to shmem, which uses
/sys/kernel/transparent_hugepage/shmem_enabled, later.

Use this flag in collapse_pte_mapped_thp() where previously the VMA flags
passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the
VM_HUGEPAGE check in "madvise" THP mode. Prior to "mm: khugepaged: check
THP flag in hugepage_vma_check()", this check also didn't check "never" THP
mode.  As such, this restores the previous behavior of
collapse_pte_mapped_thp() where sysfs THP settings are ignored.  See
comment in code for justification why this is OK.

[1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 fs/proc/task_mmu.c      |  2 +-
 include/linux/huge_mm.h |  9 ++++-----
 mm/huge_memory.c        | 14 ++++++--------
 mm/khugepaged.c         | 25 ++++++++++++++-----------
 mm/memory.c             |  4 ++--
 5 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 34d292cec79a..f8cd58846a28 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -866,7 +866,7 @@ static int show_smap(struct seq_file *m, void *v)
 	__show_smap(m, &mss, false);
 
 	seq_printf(m, "THPeligible:    %d\n",
-		   hugepage_vma_check(vma, vma->vm_flags, true, false));
+		   hugepage_vma_check(vma, vma->vm_flags, true, false, true));
 
 	if (arch_pkeys_enabled())
 		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 37f2f11a6d7e..00312fc251c1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -168,9 +168,8 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 	       !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
 }
 
-bool hugepage_vma_check(struct vm_area_struct *vma,
-			unsigned long vm_flags,
-			bool smaps, bool in_pf);
+bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
+			bool smaps, bool in_pf, bool enforce_sysfs);
 
 #define transparent_hugepage_use_zero_page()				\
 	(transparent_hugepage_flags &					\
@@ -321,8 +320,8 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
 }
 
 static inline bool hugepage_vma_check(struct vm_area_struct *vma,
-				       unsigned long vm_flags,
-				       bool smaps, bool in_pf)
+				      unsigned long vm_flags, bool smaps,
+				      bool in_pf, bool enforce_sysfs)
 {
 	return false;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index da300ce9dedb..4fbe43dc1568 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -69,9 +69,8 @@ static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
 
-bool hugepage_vma_check(struct vm_area_struct *vma,
-			unsigned long vm_flags,
-			bool smaps, bool in_pf)
+bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
+			bool smaps, bool in_pf, bool enforce_sysfs)
 {
 	if (!vma->vm_mm)		/* vdso */
 		return false;
@@ -120,11 +119,10 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
 	if (!in_pf && shmem_file(vma->vm_file))
 		return shmem_huge_enabled(vma);
 
-	if (!hugepage_flags_enabled())
-		return false;
-
-	/* THP settings require madvise. */
-	if (!(vm_flags & VM_HUGEPAGE) && !hugepage_flags_always())
+	/* Enforce sysfs THP requirements as necessary */
+	if (enforce_sysfs &&
+	    (!hugepage_flags_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
+					   !hugepage_flags_always())))
 		return false;
 
 	/* Only regular file is valid */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d89056d8cbad..b0e20db3f805 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -478,7 +478,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
 	    hugepage_flags_enabled()) {
-		if (hugepage_vma_check(vma, vm_flags, false, false))
+		if (hugepage_vma_check(vma, vm_flags, false, false, true))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -844,7 +844,8 @@ static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
  */
 
 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
-		struct vm_area_struct **vmap)
+				   struct vm_area_struct **vmap,
+				   struct collapse_control *cc)
 {
 	struct vm_area_struct *vma;
 
@@ -855,7 +856,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!vma)
 		return SCAN_VMA_NULL;
 
-	if (!hugepage_vma_check(vma, vma->vm_flags, false, false))
+	if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
+				cc->is_khugepaged))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -974,7 +976,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma);
+	result = hugepage_vma_revalidate(mm, address, &vma, cc);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1006,7 +1008,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * handled by the anon_vma lock + PG_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma);
+	result = hugepage_vma_revalidate(mm, address, &vma, cc);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -1350,12 +1352,13 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 		return;
 
 	/*
-	 * This vm_flags may not have VM_HUGEPAGE if the page was not
-	 * collapsed by this mm. But we can still collapse if the page is
-	 * the valid THP. Add extra VM_HUGEPAGE so hugepage_vma_check()
-	 * will not fail the vma for missing VM_HUGEPAGE
+	 * If we are here, we've succeeded in replacing all the native pages
+	 * in the page cache with a single hugepage. If a mm were to fault-in
+	 * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
+	 * and map it by a PMD, regardless of sysfs THP settings. As such, let's
+	 * analogously elide sysfs THP settings here.
 	 */
-	if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE, false, false))
+	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
 		return;
 
 	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
@@ -2042,7 +2045,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!hugepage_vma_check(vma, vma->vm_flags, false, false)) {
+		if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true)) {
 skip:
 			progress++;
 			continue;
diff --git a/mm/memory.c b/mm/memory.c
index 8917bea2f0bc..96cd776e84f1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5001,7 +5001,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 retry_pud:
 	if (pud_none(*vmf.pud) &&
-	    hugepage_vma_check(vma, vm_flags, false, true)) {
+	    hugepage_vma_check(vma, vm_flags, false, true, true)) {
 		ret = create_huge_pud(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
@@ -5035,7 +5035,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		goto retry_pud;
 
 	if (pmd_none(*vmf.pmd) &&
-	    hugepage_vma_check(vma, vm_flags, false, true)) {
+	    hugepage_vma_check(vma, vm_flags, false, true, true)) {
 		ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 08/18] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (6 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 07/18] mm/thp: add flag to enforce sysfs THP in hugepage_vma_check() Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-11 21:03   ` Yang Shi
  2022-07-06 23:59 ` [mm-unstable v7 09/18] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

When scanning an anon pmd to see if it's eligible for collapse, return
SCAN_PMD_MAPPED if the pmd already maps a hugepage.  Note that
SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
file-collapse path, since the latter might identify pte-mapped compound
pages.  This is required by MADV_COLLAPSE which necessarily needs to
know what hugepage-aligned/sized regions are already pmd-mapped.

In order to determine if a pmd already maps a hugepage, refactor
mm_find_pmd():

Return mm_find_pmd() to it's pre-commit f72e7dcdd252 ("mm: let mm_find_pmd
fix buggy race with THP fault") behavior.  ksm was the only caller that
explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
there (pmd_present() and pmd_trans_huge() checks).

Undo revert change in commit f72e7dcdd252 ("mm: let mm_find_pmd fix buggy race
with THP fault") that open-coded split_huge_pmd_address() pmd lookup and
use mm_find_pmd() instead.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/trace/events/huge_memory.h |  1 +
 mm/huge_memory.c                   | 18 +--------
 mm/internal.h                      |  2 +-
 mm/khugepaged.c                    | 60 ++++++++++++++++++++++++------
 mm/ksm.c                           | 10 +++++
 mm/rmap.c                          | 15 +++-----
 6 files changed, 67 insertions(+), 39 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index d651f3437367..55392bf30a03 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -11,6 +11,7 @@
 	EM( SCAN_FAIL,			"failed")			\
 	EM( SCAN_SUCCEED,		"succeeded")			\
 	EM( SCAN_PMD_NULL,		"pmd_null")			\
+	EM( SCAN_PMD_MAPPED,		"page_pmd_mapped")		\
 	EM( SCAN_EXCEED_NONE_PTE,	"exceed_none_pte")		\
 	EM( SCAN_EXCEED_SWAP_PTE,	"exceed_swap_pte")		\
 	EM( SCAN_EXCEED_SHARED_PTE,	"exceed_shared_pte")		\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4fbe43dc1568..fb76db6c703e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2363,25 +2363,11 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze, struct folio *folio)
 {
-	pgd_t *pgd;
-	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
+	pmd_t *pmd = mm_find_pmd(vma->vm_mm, address);
 
-	pgd = pgd_offset(vma->vm_mm, address);
-	if (!pgd_present(*pgd))
+	if (!pmd)
 		return;
 
-	p4d = p4d_offset(pgd, address);
-	if (!p4d_present(*p4d))
-		return;
-
-	pud = pud_offset(p4d, address);
-	if (!pud_present(*pud))
-		return;
-
-	pmd = pmd_offset(pud, address);
-
 	__split_huge_pmd(vma, pmd, address, freeze, folio);
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 6e14749ad1e5..ef8c23fb678f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -188,7 +188,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
 /*
  * in mm/rmap.c:
  */
-extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
+pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
 
 /*
  * in mm/page_alloc.c
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b0e20db3f805..c7a09cc9a0e8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -28,6 +28,7 @@ enum scan_result {
 	SCAN_FAIL,
 	SCAN_SUCCEED,
 	SCAN_PMD_NULL,
+	SCAN_PMD_MAPPED,
 	SCAN_EXCEED_NONE_PTE,
 	SCAN_EXCEED_SWAP_PTE,
 	SCAN_EXCEED_SHARED_PTE,
@@ -871,6 +872,45 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	return SCAN_SUCCEED;
 }
 
+static int find_pmd_or_thp_or_none(struct mm_struct *mm,
+				   unsigned long address,
+				   pmd_t **pmd)
+{
+	pmd_t pmde;
+
+	*pmd = mm_find_pmd(mm, address);
+	if (!*pmd)
+		return SCAN_PMD_NULL;
+
+	pmde = pmd_read_atomic(*pmd);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/* See comments in pmd_none_or_trans_huge_or_clear_bad() */
+	barrier();
+#endif
+	if (!pmd_present(pmde))
+		return SCAN_PMD_NULL;
+	if (pmd_trans_huge(pmde))
+		return SCAN_PMD_MAPPED;
+	if (pmd_bad(pmde))
+		return SCAN_PMD_NULL;
+	return SCAN_SUCCEED;
+}
+
+static int check_pmd_still_valid(struct mm_struct *mm,
+				 unsigned long address,
+				 pmd_t *pmd)
+{
+	pmd_t *new_pmd;
+	int result = find_pmd_or_thp_or_none(mm, address, &new_pmd);
+
+	if (result != SCAN_SUCCEED)
+		return result;
+	if (new_pmd != pmd)
+		return SCAN_FAIL;
+	return SCAN_SUCCEED;
+}
+
 /*
  * Bring missing pages in from swap, to complete THP collapse.
  * Only done if khugepaged_scan_pmd believes it is worthwhile.
@@ -982,9 +1022,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 	}
 
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd) {
-		result = SCAN_PMD_NULL;
+	result = find_pmd_or_thp_or_none(mm, address, &pmd);
+	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
 	}
@@ -1012,7 +1051,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
-	if (mm_find_pmd(mm, address) != pmd)
+	result = check_pmd_still_valid(mm, address, pmd);
+	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 
 	anon_vma_lock_write(vma->anon_vma);
@@ -1115,11 +1155,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd) {
-		result = SCAN_PMD_NULL;
+	result = find_pmd_or_thp_or_none(mm, address, &pmd);
+	if (result != SCAN_SUCCEED)
 		goto out;
-	}
 
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
@@ -1373,8 +1411,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	if (!PageHead(hpage))
 		goto drop_hpage;
 
-	pmd = mm_find_pmd(mm, haddr);
-	if (!pmd)
+	if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED)
 		goto drop_hpage;
 
 	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
@@ -1492,8 +1529,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		if (vma->vm_end < addr + HPAGE_PMD_SIZE)
 			continue;
 		mm = vma->vm_mm;
-		pmd = mm_find_pmd(mm, addr);
-		if (!pmd)
+		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
 			continue;
 		/*
 		 * We need exclusive mmap_lock to retract page table.
diff --git a/mm/ksm.c b/mm/ksm.c
index 075123602bd0..3e0a0a42fa1f 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1136,6 +1136,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
+	pmd_t pmde;
 	pte_t *ptep;
 	pte_t newpte;
 	spinlock_t *ptl;
@@ -1150,6 +1151,15 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pmd = mm_find_pmd(mm, addr);
 	if (!pmd)
 		goto out;
+	/*
+	 * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
+	 * without holding anon_vma lock for write.  So when looking for a
+	 * genuine pmde (in which to find pte), test present and !THP together.
+	 */
+	pmde = *pmd;
+	barrier();
+	if (!pmd_present(pmde) || pmd_trans_huge(pmde))
+		goto out;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr,
 				addr + PAGE_SIZE);
diff --git a/mm/rmap.c b/mm/rmap.c
index edc06c52bc82..af775855e58f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -767,13 +767,17 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 	return vma_address(page, vma);
 }
 
+/*
+ * Returns the actual pmd_t* where we expect 'address' to be mapped from, or
+ * NULL if it doesn't exist.  No guarantees / checks on what the pmd_t*
+ * represents.
+ */
 pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd = NULL;
-	pmd_t pmde;
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
@@ -788,15 +792,6 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 		goto out;
 
 	pmd = pmd_offset(pud, address);
-	/*
-	 * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
-	 * without holding anon_vma lock for write.  So when looking for a
-	 * genuine pmde (in which to find pte), test present and !THP together.
-	 */
-	pmde = *pmd;
-	barrier();
-	if (!pmd_present(pmde) || pmd_trans_huge(pmde))
-		pmd = NULL;
 out:
 	return pmd;
 }
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 09/18] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (7 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 08/18] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-11 21:22   ` Yang Shi
  2022-07-06 23:59 ` [mm-unstable v7 10/18] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

This idea was introduced by David Rientjes[1].

Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
synchronous collapse of memory at their own expense.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the
  THP
* Avoid unpredictable timing of khugepaged collapse

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is
independent from the others.  This implies a hugepage cannot cross a VMA
boundary.  If collapse of a given hugepage-aligned/sized region fails,
the operation may continue to attempt collapsing the remainder of memory
specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last
hugepage-aligned address covered by said range.  The memory ranges must
span at least one hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible hugepage
aligned/sized region to-be collapsed, at least one page must currently be
backed by memory (a PMD covering the address range must already exist).

Allocation for the new hugepage may enter direct reclaim and/or
compaction, regardless of VMA flags.  When the system has multiple NUMA
nodes, the hugepage will be allocated from the node providing the most
native pages.  This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how
pages will be mapped, constructed, or faulted in the future

Return Value

If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful.  On success, process_madvise(2)
returns the number of bytes advised, and madvise(2) returns 0.  Else, -1
is returned and errno is set to indicate the error for the most-recently
attempted hugepage collapse.  Note that many failures might have
occurred, since the operation may continue to collapse in the event a
single hugepage-sized/aligned region fails.

	ENOMEM	Memory allocation failed or VMA not found
	EBUSY	Memcg charging failed
	EAGAIN	Required resource temporarily unavailable.  Try again
		might succeed.
	EINVAL	Other error: No PMD found, subpage doesn't have Present
		bit set, "Special" page no backed by struct page, VMA
		incorrectly sized, address not page-aligned, ...

Most notable here is ENOMEM and EBUSY (new to madvise) which are
intended to provide the caller with actionable feedback so they may take
an appropriate fallback measure.

Use Cases

An immediate user of this new functionality are malloc() implementations
that manage memory in hugepage-sized chunks, but sometimes subrelease
memory back to the system in native-sized chunks via MADV_DONTNEED;
zapping the pmd.  Later, when the memory is hot, the implementation
could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
hugepage coverage and dTLB performance.  TCMalloc is such an
implementation that could benefit from this[2].

Only privately-mapped anon memory is supported for now, but additional
support for file, shmem, and HugeTLB high-granularity mappings[2] is
expected.  File and tmpfs/shmem support would permit:

* Backing executable text by THPs.  Current support provided by
  CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which
  might impair services from serving at their full rated load after
  (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
  immediately realize iTLB performance prevents page sharing and demand
  paging, both of which increase steady state memory footprint.  With
  MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
  and lower RAM footprints.
* Backing guest memory by hugapages after the memory contents have been
  migrated in native-page-sized chunks to a new host, in a
  userfaultfd-based live-migration stack.

[1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[2] https://github.com/google/tcmalloc/tree/master/tcmalloc

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 arch/alpha/include/uapi/asm/mman.h           |   2 +
 arch/mips/include/uapi/asm/mman.h            |   2 +
 arch/parisc/include/uapi/asm/mman.h          |   2 +
 arch/xtensa/include/uapi/asm/mman.h          |   2 +
 include/linux/huge_mm.h                      |  14 ++-
 include/uapi/asm-generic/mman-common.h       |   2 +
 mm/khugepaged.c                              | 118 ++++++++++++++++++-
 mm/madvise.c                                 |   5 +
 tools/include/uapi/asm-generic/mman-common.h |   2 +
 9 files changed, 146 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 4aa996423b0d..763929e814e9 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -76,6 +76,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 1be428663c10..c6e1fc77c996 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -103,6 +103,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index a7ea3204a5fa..22133a6a506e 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -70,6 +70,8 @@
 #define MADV_WIPEONFORK 71		/* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 72		/* Undo MADV_WIPEONFORK */
 
+#define MADV_COLLAPSE	73		/* Synchronous hugepage collapse */
+
 #define MADV_HWPOISON     100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 7966a58af472..1ff0c858544f 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -111,6 +111,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 00312fc251c1..39193623442e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -218,6 +218,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
+int madvise_collapse(struct vm_area_struct *vma,
+		     struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -361,9 +364,16 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
 static inline int hugepage_madvise(struct vm_area_struct *vma,
 				   unsigned long *vm_flags, int advice)
 {
-	BUG();
-	return 0;
+	return -EINVAL;
 }
+
+static inline int madvise_collapse(struct vm_area_struct *vma,
+				   struct vm_area_struct **prev,
+				   unsigned long start, unsigned long end)
+{
+	return -EINVAL;
+}
+
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6c1aa92a92e4..6ce1f1ceb432 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -77,6 +77,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c7a09cc9a0e8..2b2d832e44f2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -976,7 +976,8 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
 			      struct collapse_control *cc)
 {
 	/* Only allocate from the target node */
-	gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
+	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
+		     GFP_TRANSHUGE) | __GFP_THISNODE;
 	int node = khugepaged_find_target_node(cc);
 
 	if (!khugepaged_alloc_page(hpage, gfp, node))
@@ -2356,3 +2357,118 @@ void khugepaged_min_free_kbytes_update(void)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
+
+static int madvise_collapse_errno(enum scan_result r)
+{
+	/*
+	 * MADV_COLLAPSE breaks from existing madvise(2) conventions to provide
+	 * actionable feedback to caller, so they may take an appropriate
+	 * fallback measure depending on the nature of the failure.
+	 */
+	switch (r) {
+	case SCAN_ALLOC_HUGE_PAGE_FAIL:
+		return -ENOMEM;
+	case SCAN_CGROUP_CHARGE_FAIL:
+		return -EBUSY;
+	/* Resource temporary unavailable - trying again might succeed */
+	case SCAN_PAGE_LOCK:
+	case SCAN_PAGE_LRU:
+		return -EAGAIN;
+	/*
+	 * Other: Trying again likely not to succeed / error intrinsic to
+	 * specified memory range. khugepaged likely won't be able to collapse
+	 * either.
+	 */
+	default:
+		return -EINVAL;
+	}
+}
+
+int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end)
+{
+	struct collapse_control *cc;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long hstart, hend, addr;
+	int thps = 0, last_fail = SCAN_FAIL;
+	bool mmap_locked = true;
+
+	BUG_ON(vma->vm_start > start);
+	BUG_ON(vma->vm_end < end);
+
+	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
+	if (!cc)
+		return -ENOMEM;
+	cc->is_khugepaged = false;
+	cc->last_target_node = NUMA_NO_NODE;
+
+	*prev = vma;
+
+	/* TODO: Support file/shmem */
+	if (!vma->anon_vma || !vma_is_anonymous(vma))
+		return -EINVAL;
+
+	hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = end & HPAGE_PMD_MASK;
+
+	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
+		return -EINVAL;
+
+	mmgrab(mm);
+	lru_add_drain_all();
+
+	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
+		int result = SCAN_FAIL;
+
+		if (!mmap_locked) {
+			cond_resched();
+			mmap_read_lock(mm);
+			mmap_locked = true;
+			result = hugepage_vma_revalidate(mm, addr, &vma, cc);
+			if (result  != SCAN_SUCCEED) {
+				last_fail = result;
+				goto out_nolock;
+			}
+		}
+		mmap_assert_locked(mm);
+		memset(cc->node_load, 0, sizeof(cc->node_load));
+		result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, cc);
+		if (!mmap_locked)
+			*prev = NULL;  /* Tell caller we dropped mmap_lock */
+
+		switch (result) {
+		case SCAN_SUCCEED:
+		case SCAN_PMD_MAPPED:
+			++thps;
+			break;
+		/* Whitelisted set of results where continuing OK */
+		case SCAN_PMD_NULL:
+		case SCAN_PTE_NON_PRESENT:
+		case SCAN_PTE_UFFD_WP:
+		case SCAN_PAGE_RO:
+		case SCAN_LACK_REFERENCED_PAGE:
+		case SCAN_PAGE_NULL:
+		case SCAN_PAGE_COUNT:
+		case SCAN_PAGE_LOCK:
+		case SCAN_PAGE_COMPOUND:
+		case SCAN_PAGE_LRU:
+			last_fail = result;
+			break;
+		default:
+			last_fail = result;
+			/* Other error, exit */
+			goto out_maybelock;
+		}
+	}
+
+out_maybelock:
+	/* Caller expects us to hold mmap_lock on return */
+	if (!mmap_locked)
+		mmap_read_lock(mm);
+out_nolock:
+	mmap_assert_locked(mm);
+	mmdrop(mm);
+
+	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
+			: madvise_collapse_errno(last_fail);
+}
diff --git a/mm/madvise.c b/mm/madvise.c
index 851fa4e134bc..9f08e958ea86 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_FREE:
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
+	case MADV_COLLAPSE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_COLLAPSE:
+		return madvise_collapse(vma, prev, start, end);
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
+	case MADV_COLLAPSE:
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
@@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
+ *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 6c1aa92a92e4..6ce1f1ceb432 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -77,6 +77,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 10/18] mm/khugepaged: rename prefix of shared collapse functions
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (8 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 09/18] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-06 23:59 ` [mm-unstable v7 11/18] mm/madvise: add huge_memory:mm_madvise_collapse tracepoint Zach O'Keefe
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

The following functions are shared between khugepaged and madvise
collapse contexts.  Replace the "khugepaged_" prefix with generic
"hpage_collapse_" prefix in such cases:

khugepaged_test_exit() -> hpage_collapse_test_exit()
khugepaged_scan_abort() -> hpage_collapse_scan_abort()
khugepaged_scan_pmd() -> hpage_collapse_scan_pmd()
khugepaged_find_target_node() -> hpage_collapse_find_target_node()
khugepaged_alloc_page() -> hpage_collapse_alloc_page()

The kerenel ABI (e.g. huge_memory:mm_khugepaged_scan_pmd tracepoint)
is unaltered.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
---
 mm/khugepaged.c | 68 +++++++++++++++++++++++++------------------------
 1 file changed, 35 insertions(+), 33 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2b2d832e44f2..e0d00180512c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -94,7 +94,7 @@ struct collapse_control {
 	/* Num pages scanned per node */
 	int node_load[MAX_NUMNODES];
 
-	/* Last target selected in khugepaged_find_target_node() */
+	/* Last target selected in hpage_collapse_find_target_node() */
 	int last_target_node;
 };
 
@@ -438,7 +438,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
 	hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
 }
 
-static inline int khugepaged_test_exit(struct mm_struct *mm)
+static inline int hpage_collapse_test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
@@ -453,7 +453,7 @@ void __khugepaged_enter(struct mm_struct *mm)
 		return;
 
 	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
+	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
 		free_mm_slot(mm_slot);
 		return;
@@ -505,11 +505,10 @@ void __khugepaged_exit(struct mm_struct *mm)
 	} else if (mm_slot) {
 		/*
 		 * This is required to serialize against
-		 * khugepaged_test_exit() (which is guaranteed to run
-		 * under mmap sem read mode). Stop here (after we
-		 * return all pagetables will be destroyed) until
-		 * khugepaged has finished working on the pagetables
-		 * under the mmap_lock.
+		 * hpage_collapse_test_exit() (which is guaranteed to run
+		 * under mmap sem read mode). Stop here (after we return all
+		 * pagetables will be destroyed) until khugepaged has finished
+		 * working on the pagetables under the mmap_lock.
 		 */
 		mmap_write_lock(mm);
 		mmap_write_unlock(mm);
@@ -754,13 +753,12 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-
 struct collapse_control khugepaged_collapse_control = {
 	.is_khugepaged = true,
 	.last_target_node = NUMA_NO_NODE,
 };
 
-static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
+static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -795,7 +793,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(struct collapse_control *cc)
+static int hpage_collapse_find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
 
@@ -819,13 +817,13 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
 	return target_node;
 }
 #else
-static int khugepaged_find_target_node(struct collapse_control *cc)
+static int hpage_collapse_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
 #endif
 
-static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
+static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node)
 {
 	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
 	if (unlikely(!*hpage)) {
@@ -850,7 +848,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 {
 	struct vm_area_struct *vma;
 
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(hpage_collapse_test_exit(mm)))
 		return SCAN_ANY_PROCESS;
 
 	*vmap = vma = find_vma(mm, address);
@@ -913,7 +911,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 
 /*
  * Bring missing pages in from swap, to complete THP collapse.
- * Only done if khugepaged_scan_pmd believes it is worthwhile.
+ * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held.
  * Note that if false is returned, mmap_lock will be released.
@@ -978,9 +976,9 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
 	/* Only allocate from the target node */
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE) | __GFP_THISNODE;
-	int node = khugepaged_find_target_node(cc);
+	int node = hpage_collapse_find_target_node(cc);
 
-	if (!khugepaged_alloc_page(hpage, gfp, node))
+	if (!hpage_collapse_alloc_page(hpage, gfp, node))
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 	if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
 		return SCAN_CGROUP_CHARGE_FAIL;
@@ -1140,9 +1138,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	return result;
 }
 
-static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-			       unsigned long address, bool *mmap_locked,
-			       struct collapse_control *cc)
+static int hpage_collapse_scan_pmd(struct mm_struct *mm,
+				   struct vm_area_struct *vma,
+				   unsigned long address, bool *mmap_locked,
+				   struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1234,7 +1233,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * hit record.
 		 */
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node, cc)) {
+		if (hpage_collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
@@ -1313,7 +1312,7 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (khugepaged_test_exit(mm)) {
+	if (hpage_collapse_test_exit(mm)) {
 		/* free mm_slot */
 		hash_del(&mm_slot->hash);
 		list_del(&mm_slot->mm_node);
@@ -1486,7 +1485,7 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 	if (!mmap_write_trylock(mm))
 		return;
 
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(hpage_collapse_test_exit(mm)))
 		goto out;
 
 	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
@@ -1548,7 +1547,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			 * it'll always mapped in small page size for uffd-wp
 			 * registered ranges.
 			 */
-			if (!khugepaged_test_exit(mm) && !userfaultfd_wp(vma))
+			if (!hpage_collapse_test_exit(mm) &&
+			    !userfaultfd_wp(vma))
 				collapse_and_free_pmd(mm, vma, addr, pmd);
 			mmap_write_unlock(mm);
 		} else {
@@ -1975,7 +1975,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 		}
 
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node, cc)) {
+		if (hpage_collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			break;
 		}
@@ -2069,7 +2069,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		goto breakouterloop_mmap_lock;
 
 	progress++;
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(hpage_collapse_test_exit(mm)))
 		goto breakouterloop;
 
 	address = khugepaged_scan.address;
@@ -2078,7 +2078,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		unsigned long hstart, hend;
 
 		cond_resched();
-		if (unlikely(khugepaged_test_exit(mm))) {
+		if (unlikely(hpage_collapse_test_exit(mm))) {
 			progress++;
 			break;
 		}
@@ -2099,7 +2099,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			bool mmap_locked = true;
 
 			cond_resched();
-			if (unlikely(khugepaged_test_exit(mm)))
+			if (unlikely(hpage_collapse_test_exit(mm)))
 				goto breakouterloop;
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2116,9 +2116,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 				mmap_locked = false;
 				fput(file);
 			} else {
-				*result = khugepaged_scan_pmd(mm, vma,
-							      khugepaged_scan.address,
-							      &mmap_locked, cc);
+				*result = hpage_collapse_scan_pmd(mm, vma,
+								  khugepaged_scan.address,
+								  &mmap_locked,
+								  cc);
 			}
 			if (*result == SCAN_SUCCEED)
 				++khugepaged_pages_collapsed;
@@ -2148,7 +2149,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (khugepaged_test_exit(mm) || !vma) {
+	if (hpage_collapse_test_exit(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
@@ -2432,7 +2433,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		}
 		mmap_assert_locked(mm);
 		memset(cc->node_load, 0, sizeof(cc->node_load));
-		result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, cc);
+		result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
+						 cc);
 		if (!mmap_locked)
 			*prev = NULL;  /* Tell caller we dropped mmap_lock */
 
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 11/18] mm/madvise: add huge_memory:mm_madvise_collapse tracepoint
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (9 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 10/18] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-11 21:32   ` Yang Shi
  2022-07-06 23:59 ` [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add a tracepoint to expose mm, address, and enum scan_result of each
hugepage attempted to be collapsed by call to madvise(MADV_COLLAPSE).

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/trace/events/huge_memory.h | 22 ++++++++++++++++++++++
 mm/khugepaged.c                    |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 55392bf30a03..38d339ffdb16 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -167,5 +167,27 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
 		__entry->ret)
 );
 
+TRACE_EVENT(mm_madvise_collapse,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long addr, int result),
+
+	TP_ARGS(mm, addr, result),
+
+	TP_STRUCT__entry(__field(struct mm_struct *, mm)
+			 __field(unsigned long, addr)
+			 __field(int, result)
+	),
+
+	TP_fast_assign(__entry->mm = mm;
+		       __entry->addr = addr;
+		       __entry->result = result;
+	),
+
+	TP_printk("mm=%p addr=%#lx result=%s",
+		  __entry->mm,
+		  __entry->addr,
+		  __print_symbolic(__entry->result, SCAN_STATUS))
+);
+
 #endif /* __HUGE_MEMORY_H */
 #include <trace/define_trace.h>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e0d00180512c..0207fc0a5b2a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2438,6 +2438,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		if (!mmap_locked)
 			*prev = NULL;  /* Tell caller we dropped mmap_lock */
 
+		trace_mm_madvise_collapse(mm, addr, result);
+
 		switch (result) {
 		case SCAN_SUCCEED:
 		case SCAN_PMD_MAPPED:
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (10 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 11/18] mm/madvise: add huge_memory:mm_madvise_collapse tracepoint Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-08 20:47   ` Andrew Morton
  2022-07-06 23:59 ` [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps Zach O'Keefe
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
CAP_SYS_ADMIN or is requesting collapse of it's own memory.

This is useful for the development of userspace agents that seek to
optimize THP utilization system-wide by using userspace signals to
prioritize what memory is most deserving of being THP-backed.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Acked-by: David Rientjes <rientjes@google.com>
---
 mm/madvise.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 9f08e958ea86..6fb6b7160bda 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1171,13 +1171,15 @@ madvise_behavior_valid(int behavior)
 }
 
 static bool
-process_madvise_behavior_valid(int behavior)
+process_madvise_behavior_valid(int behavior, struct task_struct *task)
 {
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
 	case MADV_WILLNEED:
 		return true;
+	case MADV_COLLAPSE:
+		return task == current || capable(CAP_SYS_ADMIN);
 	default:
 		return false;
 	}
@@ -1455,7 +1457,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto free_iov;
 	}
 
-	if (!process_madvise_behavior_valid(behavior)) {
+	if (!process_madvise_behavior_valid(behavior, task)) {
 		ret = -EINVAL;
 		goto release_task;
 	}
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (11 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-11 21:37   ` Yang Shi
  2022-07-06 23:59 ` [mm-unstable v7 14/18] selftests/vm: modularize collapse selftests Zach O'Keefe
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add PMDMappable field to smaps output which informs the user if memory
in the VMA can be PMD-mapped by MADV_COLLAPSE.

The distinction from THPeligible is needed for two reasons:

1) For THP, MADV_COLLAPSE is not coupled to THP sysfs controls, which
   THPeligible reports.

2) PMDMappable can also be used in HugeTLB fine-granularity mappings,
   which are independent from THP.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 Documentation/filesystems/proc.rst | 10 ++++++++--
 fs/proc/task_mmu.c                 |  2 ++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 47e95dbc820d..f207903a57a5 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -466,6 +466,7 @@ Memory Area, or VMA) there is a series of lines such as the following::
     MMUPageSize:           4 kB
     Locked:                0 kB
     THPeligible:           0
+    PMDMappable:           0
     VmFlags: rd ex mr mw me dw
 
 The first of these lines shows the same information as is displayed for the
@@ -518,9 +519,14 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
 does not take into account swapped out page of underlying shmem objects.
 "Locked" indicates whether the mapping is locked in memory or not.
 
+"PMDMappable" indicates if the memory can be mapped by PMDs - 1 if true, 0
+otherwise.  It just shows the current status. Note that this is memory
+operable on explicitly by MADV_COLLAPSE.
+
 "THPeligible" indicates whether the mapping is eligible for allocating THP
-pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
-It just shows the current status.
+pages by the kernel, as well as the THP is PMD mappable or not - 1 if true, 0
+otherwise. It just shows the current status.  Note this is memory the kernel can
+transparently provide as THPs.
 
 "VmFlags" field deserves a separate description. This member represents the
 kernel flags associated with the particular virtual memory area in two letter
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f8cd58846a28..29f2089456ba 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -867,6 +867,8 @@ static int show_smap(struct seq_file *m, void *v)
 
 	seq_printf(m, "THPeligible:    %d\n",
 		   hugepage_vma_check(vma, vma->vm_flags, true, false, true));
+	seq_printf(m, "PMDMappable:    %d\n",
+		   hugepage_vma_check(vma, vma->vm_flags, true, false, false));
 
 	if (arch_pkeys_enabled())
 		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 14/18] selftests/vm: modularize collapse selftests
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (12 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-06 23:59 ` [mm-unstable v7 15/18] selftests/vm: dedup hugepage allocation logic Zach O'Keefe
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Modularize the collapse action of khugepaged collapse selftests by
introducing a struct collapse_context which specifies how to collapse a
given memory range and the expected semantics of the collapse.  This
can be reused later to test other collapse contexts.

Additionally, all tests have logic that checks if a collapse occurred
via reading /proc/self/smaps, and report if this is different than
expected.  Move this logic into the per-context ->collapse() hook
instead of repeating it in every test.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 251 +++++++++++-------------
 1 file changed, 110 insertions(+), 141 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 155120b67a16..0f1bee0eff24 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -23,6 +23,11 @@ static int hpage_pmd_nr;
 #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
 #define PID_SMAPS "/proc/self/smaps"
 
+struct collapse_context {
+	void (*collapse)(const char *msg, char *p, bool expect);
+	bool enforce_pte_scan_limits;
+};
+
 enum thp_enabled {
 	THP_ALWAYS,
 	THP_MADVISE,
@@ -501,6 +506,21 @@ static bool wait_for_scan(const char *msg, char *p)
 	return timeout == -1;
 }
 
+static void khugepaged_collapse(const char *msg, char *p, bool expect)
+{
+	if (wait_for_scan(msg, p)) {
+		if (expect)
+			fail("Timeout");
+		else
+			success("OK");
+		return;
+	} else if (check_huge(p) == expect) {
+		success("OK");
+	} else {
+		fail("Fail");
+	}
+}
+
 static void alloc_at_fault(void)
 {
 	struct settings settings = default_settings;
@@ -528,53 +548,39 @@ static void alloc_at_fault(void)
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_full(void)
+static void collapse_full(struct collapse_context *c)
 {
 	void *p;
 
 	p = alloc_mapping();
 	fill_memory(p, 0, hpage_pmd_size);
-	if (wait_for_scan("Collapse fully populated PTE table", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	c->collapse("Collapse fully populated PTE table", p, true);
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_empty(void)
+static void collapse_empty(struct collapse_context *c)
 {
 	void *p;
 
 	p = alloc_mapping();
-	if (wait_for_scan("Do not collapse empty PTE table", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		fail("Fail");
-	else
-		success("OK");
+	c->collapse("Do not collapse empty PTE table", p, false);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_single_pte_entry(void)
+static void collapse_single_pte_entry(struct collapse_context *c)
 {
 	void *p;
 
 	p = alloc_mapping();
 	fill_memory(p, 0, page_size);
-	if (wait_for_scan("Collapse PTE table with single PTE entry present", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	c->collapse("Collapse PTE table with single PTE entry present", p,
+		    true);
 	validate_memory(p, 0, page_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_none(void)
+static void collapse_max_ptes_none(struct collapse_context *c)
 {
 	int max_ptes_none = hpage_pmd_nr / 2;
 	struct settings settings = default_settings;
@@ -586,28 +592,22 @@ static void collapse_max_ptes_none(void)
 	p = alloc_mapping();
 
 	fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
-	if (wait_for_scan("Do not collapse with max_ptes_none exceeded", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		fail("Fail");
-	else
-		success("OK");
+	c->collapse("Maybe collapse with max_ptes_none exceeded", p,
+		    !c->enforce_pte_scan_limits);
 	validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
 
-	fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
-	if (wait_for_scan("Collapse with max_ptes_none PTEs empty", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
-	validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
+	if (c->enforce_pte_scan_limits) {
+		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
+		c->collapse("Collapse with max_ptes_none PTEs empty", p, true);
+		validate_memory(p, 0,
+				(hpage_pmd_nr - max_ptes_none) * page_size);
+	}
 
 	munmap(p, hpage_pmd_size);
 	write_settings(&default_settings);
 }
 
-static void collapse_swapin_single_pte(void)
+static void collapse_swapin_single_pte(struct collapse_context *c)
 {
 	void *p;
 	p = alloc_mapping();
@@ -625,18 +625,13 @@ static void collapse_swapin_single_pte(void)
 		goto out;
 	}
 
-	if (wait_for_scan("Collapse with swapping in single PTE entry", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	c->collapse("Collapse with swapping in single PTE entry", p, true);
 	validate_memory(p, 0, hpage_pmd_size);
 out:
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_swap(void)
+static void collapse_max_ptes_swap(struct collapse_context *c)
 {
 	int max_ptes_swap = read_num("khugepaged/max_ptes_swap");
 	void *p;
@@ -656,39 +651,34 @@ static void collapse_max_ptes_swap(void)
 		goto out;
 	}
 
-	if (wait_for_scan("Do not collapse with max_ptes_swap exceeded", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		fail("Fail");
-	else
-		success("OK");
+	c->collapse("Maybe collapse with max_ptes_swap exceeded", p,
+		    !c->enforce_pte_scan_limits);
 	validate_memory(p, 0, hpage_pmd_size);
 
-	fill_memory(p, 0, hpage_pmd_size);
-	printf("Swapout %d of %d pages...", max_ptes_swap, hpage_pmd_nr);
-	if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) {
-		perror("madvise(MADV_PAGEOUT)");
-		exit(EXIT_FAILURE);
-	}
-	if (check_swap(p, max_ptes_swap * page_size)) {
-		success("OK");
-	} else {
-		fail("Fail");
-		goto out;
-	}
+	if (c->enforce_pte_scan_limits) {
+		fill_memory(p, 0, hpage_pmd_size);
+		printf("Swapout %d of %d pages...", max_ptes_swap,
+		       hpage_pmd_nr);
+		if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) {
+			perror("madvise(MADV_PAGEOUT)");
+			exit(EXIT_FAILURE);
+		}
+		if (check_swap(p, max_ptes_swap * page_size)) {
+			success("OK");
+		} else {
+			fail("Fail");
+			goto out;
+		}
 
-	if (wait_for_scan("Collapse with max_ptes_swap pages swapped out", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
-	validate_memory(p, 0, hpage_pmd_size);
+		c->collapse("Collapse with max_ptes_swap pages swapped out", p,
+			    true);
+		validate_memory(p, 0, hpage_pmd_size);
+	}
 out:
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_single_pte_entry_compound(void)
+static void collapse_single_pte_entry_compound(struct collapse_context *c)
 {
 	void *p;
 
@@ -710,17 +700,13 @@ static void collapse_single_pte_entry_compound(void)
 	else
 		fail("Fail");
 
-	if (wait_for_scan("Collapse PTE table with single PTE mapping compound page", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	c->collapse("Collapse PTE table with single PTE mapping compound page",
+		    p, true);
 	validate_memory(p, 0, page_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_full_of_compound(void)
+static void collapse_full_of_compound(struct collapse_context *c)
 {
 	void *p;
 
@@ -742,17 +728,12 @@ static void collapse_full_of_compound(void)
 	else
 		fail("Fail");
 
-	if (wait_for_scan("Collapse PTE table full of compound pages", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	c->collapse("Collapse PTE table full of compound pages", p, true);
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_compound_extreme(void)
+static void collapse_compound_extreme(struct collapse_context *c)
 {
 	void *p;
 	int i;
@@ -798,18 +779,14 @@ static void collapse_compound_extreme(void)
 	else
 		fail("Fail");
 
-	if (wait_for_scan("Collapse PTE table full of different compound pages", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	c->collapse("Collapse PTE table full of different compound pages", p,
+		    true);
 
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_fork(void)
+static void collapse_fork(struct collapse_context *c)
 {
 	int wstatus;
 	void *p;
@@ -835,13 +812,8 @@ static void collapse_fork(void)
 			fail("Fail");
 
 		fill_memory(p, page_size, 2 * page_size);
-
-		if (wait_for_scan("Collapse PTE table with single page shared with parent process", p))
-			fail("Timeout");
-		else if (check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
+		c->collapse("Collapse PTE table with single page shared with parent process",
+			    p, true);
 
 		validate_memory(p, 0, page_size);
 		munmap(p, hpage_pmd_size);
@@ -860,7 +832,7 @@ static void collapse_fork(void)
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_fork_compound(void)
+static void collapse_fork_compound(struct collapse_context *c)
 {
 	int wstatus;
 	void *p;
@@ -896,14 +868,10 @@ static void collapse_fork_compound(void)
 		fill_memory(p, 0, page_size);
 
 		write_num("khugepaged/max_ptes_shared", hpage_pmd_nr - 1);
-		if (wait_for_scan("Collapse PTE table full of compound pages in child", p))
-			fail("Timeout");
-		else if (check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
+		c->collapse("Collapse PTE table full of compound pages in child",
+			    p, true);
 		write_num("khugepaged/max_ptes_shared",
-				default_settings.khugepaged.max_ptes_shared);
+			  default_settings.khugepaged.max_ptes_shared);
 
 		validate_memory(p, 0, hpage_pmd_size);
 		munmap(p, hpage_pmd_size);
@@ -922,7 +890,7 @@ static void collapse_fork_compound(void)
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_shared()
+static void collapse_max_ptes_shared(struct collapse_context *c)
 {
 	int max_ptes_shared = read_num("khugepaged/max_ptes_shared");
 	int wstatus;
@@ -957,28 +925,22 @@ static void collapse_max_ptes_shared()
 		else
 			fail("Fail");
 
-		if (wait_for_scan("Do not collapse with max_ptes_shared exceeded", p))
-			fail("Timeout");
-		else if (!check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
-
-		printf("Trigger CoW on page %d of %d...",
-				hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
-		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) * page_size);
-		if (!check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
-
-
-		if (wait_for_scan("Collapse with max_ptes_shared PTEs shared", p))
-			fail("Timeout");
-		else if (check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
+		c->collapse("Maybe collapse with max_ptes_shared exceeded", p,
+			    !c->enforce_pte_scan_limits);
+
+		if (c->enforce_pte_scan_limits) {
+			printf("Trigger CoW on page %d of %d...",
+			       hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
+			fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) *
+				    page_size);
+			if (!check_huge(p))
+				success("OK");
+			else
+				fail("Fail");
+
+			c->collapse("Collapse with max_ptes_shared PTEs shared",
+				    p, true);
+		}
 
 		validate_memory(p, 0, hpage_pmd_size);
 		munmap(p, hpage_pmd_size);
@@ -999,6 +961,8 @@ static void collapse_max_ptes_shared()
 
 int main(void)
 {
+	struct collapse_context c;
+
 	setbuf(stdout, NULL);
 
 	page_size = getpagesize();
@@ -1014,18 +978,23 @@ int main(void)
 	adjust_settings();
 
 	alloc_at_fault();
-	collapse_full();
-	collapse_empty();
-	collapse_single_pte_entry();
-	collapse_max_ptes_none();
-	collapse_swapin_single_pte();
-	collapse_max_ptes_swap();
-	collapse_single_pte_entry_compound();
-	collapse_full_of_compound();
-	collapse_compound_extreme();
-	collapse_fork();
-	collapse_fork_compound();
-	collapse_max_ptes_shared();
+
+	printf("\n*** Testing context: khugepaged ***\n");
+	c.collapse = &khugepaged_collapse;
+	c.enforce_pte_scan_limits = true;
+
+	collapse_full(&c);
+	collapse_empty(&c);
+	collapse_single_pte_entry(&c);
+	collapse_max_ptes_none(&c);
+	collapse_swapin_single_pte(&c);
+	collapse_max_ptes_swap(&c);
+	collapse_single_pte_entry_compound(&c);
+	collapse_full_of_compound(&c);
+	collapse_compound_extreme(&c);
+	collapse_fork(&c);
+	collapse_fork_compound(&c);
+	collapse_max_ptes_shared(&c);
 
 	restore_settings(0);
 }
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 15/18] selftests/vm: dedup hugepage allocation logic
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (13 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 14/18] selftests/vm: modularize collapse selftests Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-06 23:59 ` [mm-unstable v7 16/18] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

The code

	p = alloc_mapping();
	printf("Allocate huge page...");
	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
	fill_memory(p, 0, hpage_pmd_size);
	if (check_huge(p))
		success("OK");
	else
		fail("Fail");

Is repeated many times in different tests.  Add a helper, alloc_hpage()
to handle this.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 62 +++++++++----------------
 1 file changed, 23 insertions(+), 39 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 0f1bee0eff24..eb6f5bbacff1 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -461,6 +461,25 @@ static void fill_memory(int *p, unsigned long start, unsigned long end)
 		p[i * page_size / sizeof(*p)] = i + 0xdead0000;
 }
 
+/*
+ * Returns pmd-mapped hugepage in VMA marked VM_HUGEPAGE, filled with
+ * validate_memory()'able contents.
+ */
+static void *alloc_hpage(void)
+{
+	void *p;
+
+	p = alloc_mapping();
+	printf("Allocate huge page...");
+	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+	fill_memory(p, 0, hpage_pmd_size);
+	if (check_huge(p))
+		success("OK");
+	else
+		fail("Fail");
+	return p;
+}
+
 static void validate_memory(int *p, unsigned long start, unsigned long end)
 {
 	int i;
@@ -682,15 +701,7 @@ static void collapse_single_pte_entry_compound(struct collapse_context *c)
 {
 	void *p;
 
-	p = alloc_mapping();
-
-	printf("Allocate huge page...");
-	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
-	fill_memory(p, 0, hpage_pmd_size);
-	if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	p = alloc_hpage();
 	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
 
 	printf("Split huge page leaving single PTE mapping compound page...");
@@ -710,16 +721,7 @@ static void collapse_full_of_compound(struct collapse_context *c)
 {
 	void *p;
 
-	p = alloc_mapping();
-
-	printf("Allocate huge page...");
-	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
-	fill_memory(p, 0, hpage_pmd_size);
-	if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
-
+	p = alloc_hpage();
 	printf("Split huge page leaving single PTE page table full of compound pages...");
 	madvise(p, page_size, MADV_NOHUGEPAGE);
 	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
@@ -837,16 +839,7 @@ static void collapse_fork_compound(struct collapse_context *c)
 	int wstatus;
 	void *p;
 
-	p = alloc_mapping();
-
-	printf("Allocate huge page...");
-	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
-	fill_memory(p, 0, hpage_pmd_size);
-	if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
-
+	p = alloc_hpage();
 	printf("Share huge page over fork()...");
 	if (!fork()) {
 		/* Do not touch settings on child exit */
@@ -896,16 +889,7 @@ static void collapse_max_ptes_shared(struct collapse_context *c)
 	int wstatus;
 	void *p;
 
-	p = alloc_mapping();
-
-	printf("Allocate huge page...");
-	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
-	fill_memory(p, 0, hpage_pmd_size);
-	if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
-
+	p = alloc_hpage();
 	printf("Share huge page over fork()...");
 	if (!fork()) {
 		/* Do not touch settings on child exit */
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 16/18] selftests/vm: add MADV_COLLAPSE collapse context to selftests
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (14 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 15/18] selftests/vm: dedup hugepage allocation logic Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-06 23:59 ` [mm-unstable v7 17/18] selftests/vm: add selftest to verify recollapse of THPs Zach O'Keefe
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add madvise collapse context to hugepage collapse selftests.  This
context is tested with /sys/kernel/mm/transparent_hugepage/enabled set
to "never" in order to avoid unwanted interaction with khugepaged during
testing.

Also, refactor updates to sysfs THP settings using a stack so that the
THP settings from nested callers can be restored.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 171 +++++++++++++++++-------
 1 file changed, 125 insertions(+), 46 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index eb6f5bbacff1..780f04440e15 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -14,6 +14,9 @@
 #ifndef MADV_PAGEOUT
 #define MADV_PAGEOUT 21
 #endif
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
 
 #define BASE_ADDR ((void *)(1UL << 30))
 static unsigned long hpage_pmd_size;
@@ -95,18 +98,6 @@ struct settings {
 	struct khugepaged_settings khugepaged;
 };
 
-static struct settings default_settings = {
-	.thp_enabled = THP_MADVISE,
-	.thp_defrag = THP_DEFRAG_ALWAYS,
-	.shmem_enabled = SHMEM_NEVER,
-	.use_zero_page = 0,
-	.khugepaged = {
-		.defrag = 1,
-		.alloc_sleep_millisecs = 10,
-		.scan_sleep_millisecs = 10,
-	},
-};
-
 static struct settings saved_settings;
 static bool skip_settings_restore;
 
@@ -284,6 +275,39 @@ static void write_settings(struct settings *settings)
 	write_num("khugepaged/pages_to_scan", khugepaged->pages_to_scan);
 }
 
+#define MAX_SETTINGS_DEPTH 4
+static struct settings settings_stack[MAX_SETTINGS_DEPTH];
+static int settings_index;
+
+static struct settings *current_settings(void)
+{
+	if (!settings_index) {
+		printf("Fail: No settings set");
+		exit(EXIT_FAILURE);
+	}
+	return settings_stack + settings_index - 1;
+}
+
+static void push_settings(struct settings *settings)
+{
+	if (settings_index >= MAX_SETTINGS_DEPTH) {
+		printf("Fail: Settings stack exceeded");
+		exit(EXIT_FAILURE);
+	}
+	settings_stack[settings_index++] = *settings;
+	write_settings(current_settings());
+}
+
+static void pop_settings(void)
+{
+	if (settings_index <= 0) {
+		printf("Fail: Settings stack empty");
+		exit(EXIT_FAILURE);
+	}
+	--settings_index;
+	write_settings(current_settings());
+}
+
 static void restore_settings(int sig)
 {
 	if (skip_settings_restore)
@@ -327,14 +351,6 @@ static void save_settings(void)
 	signal(SIGQUIT, restore_settings);
 }
 
-static void adjust_settings(void)
-{
-
-	printf("Adjust settings...");
-	write_settings(&default_settings);
-	success("OK");
-}
-
 #define MAX_LINE_LENGTH 500
 
 static bool check_for_pattern(FILE *fp, char *pattern, char *buf)
@@ -493,6 +509,38 @@ static void validate_memory(int *p, unsigned long start, unsigned long end)
 	}
 }
 
+static void madvise_collapse(const char *msg, char *p, bool expect)
+{
+	int ret;
+	struct settings settings = *current_settings();
+
+	printf("%s...", msg);
+	/* Sanity check */
+	if (check_huge(p)) {
+		printf("Unexpected huge page\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/*
+	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
+	 * ignores /sys/kernel/mm/transparent_hugepage/enabled
+	 */
+	settings.thp_enabled = THP_NEVER;
+	push_settings(&settings);
+
+	/* Clear VM_NOHUGEPAGE */
+	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+	ret = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+	if (((bool)ret) == expect)
+		fail("Fail: Bad return value");
+	else if (check_huge(p) != expect)
+		fail("Fail: check_huge()");
+	else
+		success("OK");
+
+	pop_settings();
+}
+
 #define TICK 500000
 static bool wait_for_scan(const char *msg, char *p)
 {
@@ -542,11 +590,11 @@ static void khugepaged_collapse(const char *msg, char *p, bool expect)
 
 static void alloc_at_fault(void)
 {
-	struct settings settings = default_settings;
+	struct settings settings = *current_settings();
 	char *p;
 
 	settings.thp_enabled = THP_ALWAYS;
-	write_settings(&settings);
+	push_settings(&settings);
 
 	p = alloc_mapping();
 	*p = 1;
@@ -556,7 +604,7 @@ static void alloc_at_fault(void)
 	else
 		fail("Fail");
 
-	write_settings(&default_settings);
+	pop_settings();
 
 	madvise(p, page_size, MADV_DONTNEED);
 	printf("Split huge PMD on MADV_DONTNEED...");
@@ -602,11 +650,11 @@ static void collapse_single_pte_entry(struct collapse_context *c)
 static void collapse_max_ptes_none(struct collapse_context *c)
 {
 	int max_ptes_none = hpage_pmd_nr / 2;
-	struct settings settings = default_settings;
+	struct settings settings = *current_settings();
 	void *p;
 
 	settings.khugepaged.max_ptes_none = max_ptes_none;
-	write_settings(&settings);
+	push_settings(&settings);
 
 	p = alloc_mapping();
 
@@ -623,7 +671,7 @@ static void collapse_max_ptes_none(struct collapse_context *c)
 	}
 
 	munmap(p, hpage_pmd_size);
-	write_settings(&default_settings);
+	pop_settings();
 }
 
 static void collapse_swapin_single_pte(struct collapse_context *c)
@@ -703,7 +751,6 @@ static void collapse_single_pte_entry_compound(struct collapse_context *c)
 
 	p = alloc_hpage();
 	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
-
 	printf("Split huge page leaving single PTE mapping compound page...");
 	madvise(p + page_size, hpage_pmd_size - page_size, MADV_DONTNEED);
 	if (!check_huge(p))
@@ -864,7 +911,7 @@ static void collapse_fork_compound(struct collapse_context *c)
 		c->collapse("Collapse PTE table full of compound pages in child",
 			    p, true);
 		write_num("khugepaged/max_ptes_shared",
-			  default_settings.khugepaged.max_ptes_shared);
+			  current_settings()->khugepaged.max_ptes_shared);
 
 		validate_memory(p, 0, hpage_pmd_size);
 		munmap(p, hpage_pmd_size);
@@ -943,9 +990,21 @@ static void collapse_max_ptes_shared(struct collapse_context *c)
 	munmap(p, hpage_pmd_size);
 }
 
-int main(void)
+int main(int argc, const char **argv)
 {
 	struct collapse_context c;
+	struct settings default_settings = {
+		.thp_enabled = THP_MADVISE,
+		.thp_defrag = THP_DEFRAG_ALWAYS,
+		.shmem_enabled = SHMEM_NEVER,
+		.use_zero_page = 0,
+		.khugepaged = {
+			.defrag = 1,
+			.alloc_sleep_millisecs = 10,
+			.scan_sleep_millisecs = 10,
+		},
+	};
+	const char *tests = argc == 1 ? "all" : argv[1];
 
 	setbuf(stdout, NULL);
 
@@ -959,26 +1018,46 @@ int main(void)
 	default_settings.khugepaged.pages_to_scan = hpage_pmd_nr * 8;
 
 	save_settings();
-	adjust_settings();
+	push_settings(&default_settings);
 
 	alloc_at_fault();
 
-	printf("\n*** Testing context: khugepaged ***\n");
-	c.collapse = &khugepaged_collapse;
-	c.enforce_pte_scan_limits = true;
-
-	collapse_full(&c);
-	collapse_empty(&c);
-	collapse_single_pte_entry(&c);
-	collapse_max_ptes_none(&c);
-	collapse_swapin_single_pte(&c);
-	collapse_max_ptes_swap(&c);
-	collapse_single_pte_entry_compound(&c);
-	collapse_full_of_compound(&c);
-	collapse_compound_extreme(&c);
-	collapse_fork(&c);
-	collapse_fork_compound(&c);
-	collapse_max_ptes_shared(&c);
+	if (!strcmp(tests, "khugepaged") || !strcmp(tests, "all")) {
+		printf("\n*** Testing context: khugepaged ***\n");
+		c.collapse = &khugepaged_collapse;
+		c.enforce_pte_scan_limits = true;
+
+		collapse_full(&c);
+		collapse_empty(&c);
+		collapse_single_pte_entry(&c);
+		collapse_max_ptes_none(&c);
+		collapse_swapin_single_pte(&c);
+		collapse_max_ptes_swap(&c);
+		collapse_single_pte_entry_compound(&c);
+		collapse_full_of_compound(&c);
+		collapse_compound_extreme(&c);
+		collapse_fork(&c);
+		collapse_fork_compound(&c);
+		collapse_max_ptes_shared(&c);
+	}
+	if (!strcmp(tests, "madvise") || !strcmp(tests, "all")) {
+		printf("\n*** Testing context: madvise ***\n");
+		c.collapse = &madvise_collapse;
+		c.enforce_pte_scan_limits = false;
+
+		collapse_full(&c);
+		collapse_empty(&c);
+		collapse_single_pte_entry(&c);
+		collapse_max_ptes_none(&c);
+		collapse_swapin_single_pte(&c);
+		collapse_max_ptes_swap(&c);
+		collapse_single_pte_entry_compound(&c);
+		collapse_full_of_compound(&c);
+		collapse_compound_extreme(&c);
+		collapse_fork(&c);
+		collapse_fork_compound(&c);
+		collapse_max_ptes_shared(&c);
+	}
 
 	restore_settings(0);
 }
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 17/18] selftests/vm: add selftest to verify recollapse of THPs
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (15 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 16/18] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-06 23:59 ` [mm-unstable v7 18/18] selftests/vm: add selftest to verify multi THP collapse Zach O'Keefe
  2022-07-14 18:55 ` [RFC] mm: userspace hugepage collapse: file/shmem semantics Zach O'Keefe
  18 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add selftest specific to madvise collapse context that tests
MADV_COLLAPSE is "successful" if a hugepage-aligned/sized region is
already pmd-mapped.

This test also verifies that MADV_COLLAPSE can collapse memory into THPs
even in "madvise" THP mode and the memory isn't marked VM_HUGEPAGE.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 31 +++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 780f04440e15..87cd0b99477f 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -990,6 +990,36 @@ static void collapse_max_ptes_shared(struct collapse_context *c)
 	munmap(p, hpage_pmd_size);
 }
 
+static void madvise_collapse_existing_thps(void)
+{
+	void *p;
+	int err;
+
+	p = alloc_mapping();
+	fill_memory(p, 0, hpage_pmd_size);
+
+	printf("Collapse fully populated PTE table...");
+	/*
+	 * Note that we don't set MADV_HUGEPAGE here, which
+	 * also tests that VM_HUGEPAGE isn't required for
+	 * MADV_COLLAPSE in "madvise" mode.
+	 */
+	err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+	if (err == 0 && check_huge(p)) {
+		success("OK");
+		printf("Re-collapse PMD-mapped hugepage");
+		err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+		if (err == 0 && check_huge(p))
+			success("OK");
+		else
+			fail("Fail");
+	} else {
+		fail("Fail");
+	}
+	validate_memory(p, 0, hpage_pmd_size);
+	munmap(p, hpage_pmd_size);
+}
+
 int main(int argc, const char **argv)
 {
 	struct collapse_context c;
@@ -1057,6 +1087,7 @@ int main(int argc, const char **argv)
 		collapse_fork(&c);
 		collapse_fork_compound(&c);
 		collapse_max_ptes_shared(&c);
+		madvise_collapse_existing_thps();
 	}
 
 	restore_settings(0);
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [mm-unstable v7 18/18] selftests/vm: add selftest to verify multi THP collapse
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (16 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 17/18] selftests/vm: add selftest to verify recollapse of THPs Zach O'Keefe
@ 2022-07-06 23:59 ` Zach O'Keefe
  2022-07-14 18:55 ` [RFC] mm: userspace hugepage collapse: file/shmem semantics Zach O'Keefe
  18 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-06 23:59 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add support to allocate and verify collapse of multiple hugepage-sized
regions into multiple THPs.

Add "nr" argument to check_huge() that instructs check_huge() to check
for exactly "nr_hpages" THPs.  This has the added benefit of now being
able to check for exactly 0 THPs, and so callsites that previously
checked the negation of exactly 1 THP are now more correct.

->collapse struct collapse_context hook has been expanded with a
"nr_hpages" argument to collapse "nr_hpages" hugepages.  The
collapse_full() test has been repurposed to collapse 4 THPs at once.  It
is expected more tests will want to test multi THP collapse (e.g.
file/shmem).

This is of particular benefit to madvise collapse context given that it
may do many THP collapses during a single syscall.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 140 ++++++++++++------------
 1 file changed, 73 insertions(+), 67 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 87cd0b99477f..b77b1e28cdb3 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -27,7 +27,7 @@ static int hpage_pmd_nr;
 #define PID_SMAPS "/proc/self/smaps"
 
 struct collapse_context {
-	void (*collapse)(const char *msg, char *p, bool expect);
+	void (*collapse)(const char *msg, char *p, int nr_hpages, bool expect);
 	bool enforce_pte_scan_limits;
 };
 
@@ -362,7 +362,7 @@ static bool check_for_pattern(FILE *fp, char *pattern, char *buf)
 	return false;
 }
 
-static bool check_huge(void *addr)
+static bool check_huge(void *addr, int nr_hpages)
 {
 	bool thp = false;
 	int ret;
@@ -387,7 +387,7 @@ static bool check_huge(void *addr)
 		goto err_out;
 
 	ret = snprintf(addr_pattern, MAX_LINE_LENGTH, "AnonHugePages:%10ld kB",
-		       hpage_pmd_size >> 10);
+		       nr_hpages * (hpage_pmd_size >> 10));
 	if (ret >= MAX_LINE_LENGTH) {
 		printf("%s: Pattern is too long\n", __func__);
 		exit(EXIT_FAILURE);
@@ -455,12 +455,12 @@ static bool check_swap(void *addr, unsigned long size)
 	return swap;
 }
 
-static void *alloc_mapping(void)
+static void *alloc_mapping(int nr)
 {
 	void *p;
 
-	p = mmap(BASE_ADDR, hpage_pmd_size, PROT_READ | PROT_WRITE,
-			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	p = mmap(BASE_ADDR, nr * hpage_pmd_size, PROT_READ | PROT_WRITE,
+		 MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
 	if (p != BASE_ADDR) {
 		printf("Failed to allocate VMA at %p\n", BASE_ADDR);
 		exit(EXIT_FAILURE);
@@ -485,11 +485,11 @@ static void *alloc_hpage(void)
 {
 	void *p;
 
-	p = alloc_mapping();
+	p = alloc_mapping(1);
 	printf("Allocate huge page...");
 	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
 	fill_memory(p, 0, hpage_pmd_size);
-	if (check_huge(p))
+	if (check_huge(p, 1))
 		success("OK");
 	else
 		fail("Fail");
@@ -509,14 +509,15 @@ static void validate_memory(int *p, unsigned long start, unsigned long end)
 	}
 }
 
-static void madvise_collapse(const char *msg, char *p, bool expect)
+static void madvise_collapse(const char *msg, char *p, int nr_hpages,
+			     bool expect)
 {
 	int ret;
 	struct settings settings = *current_settings();
 
 	printf("%s...", msg);
 	/* Sanity check */
-	if (check_huge(p)) {
+	if (!check_huge(p, 0)) {
 		printf("Unexpected huge page\n");
 		exit(EXIT_FAILURE);
 	}
@@ -529,11 +530,11 @@ static void madvise_collapse(const char *msg, char *p, bool expect)
 	push_settings(&settings);
 
 	/* Clear VM_NOHUGEPAGE */
-	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
-	ret = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+	madvise(p, nr_hpages * hpage_pmd_size, MADV_HUGEPAGE);
+	ret = madvise(p, nr_hpages * hpage_pmd_size, MADV_COLLAPSE);
 	if (((bool)ret) == expect)
 		fail("Fail: Bad return value");
-	else if (check_huge(p) != expect)
+	else if (check_huge(p, nr_hpages) != expect)
 		fail("Fail: check_huge()");
 	else
 		success("OK");
@@ -542,25 +543,25 @@ static void madvise_collapse(const char *msg, char *p, bool expect)
 }
 
 #define TICK 500000
-static bool wait_for_scan(const char *msg, char *p)
+static bool wait_for_scan(const char *msg, char *p, int nr_hpages)
 {
 	int full_scans;
 	int timeout = 6; /* 3 seconds */
 
 	/* Sanity check */
-	if (check_huge(p)) {
+	if (!check_huge(p, 0)) {
 		printf("Unexpected huge page\n");
 		exit(EXIT_FAILURE);
 	}
 
-	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+	madvise(p, nr_hpages * hpage_pmd_size, MADV_HUGEPAGE);
 
 	/* Wait until the second full_scan completed */
 	full_scans = read_num("khugepaged/full_scans") + 2;
 
 	printf("%s...", msg);
 	while (timeout--) {
-		if (check_huge(p))
+		if (check_huge(p, nr_hpages))
 			break;
 		if (read_num("khugepaged/full_scans") >= full_scans)
 			break;
@@ -568,20 +569,21 @@ static bool wait_for_scan(const char *msg, char *p)
 		usleep(TICK);
 	}
 
-	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
+	madvise(p, nr_hpages * hpage_pmd_size, MADV_NOHUGEPAGE);
 
 	return timeout == -1;
 }
 
-static void khugepaged_collapse(const char *msg, char *p, bool expect)
+static void khugepaged_collapse(const char *msg, char *p, int nr_hpages,
+				bool expect)
 {
-	if (wait_for_scan(msg, p)) {
+	if (wait_for_scan(msg, p, nr_hpages)) {
 		if (expect)
 			fail("Timeout");
 		else
 			success("OK");
 		return;
-	} else if (check_huge(p) == expect) {
+	} else if (check_huge(p, nr_hpages) == expect) {
 		success("OK");
 	} else {
 		fail("Fail");
@@ -596,10 +598,10 @@ static void alloc_at_fault(void)
 	settings.thp_enabled = THP_ALWAYS;
 	push_settings(&settings);
 
-	p = alloc_mapping();
+	p = alloc_mapping(1);
 	*p = 1;
 	printf("Allocate huge page on fault...");
-	if (check_huge(p))
+	if (check_huge(p, 1))
 		success("OK");
 	else
 		fail("Fail");
@@ -608,7 +610,7 @@ static void alloc_at_fault(void)
 
 	madvise(p, page_size, MADV_DONTNEED);
 	printf("Split huge PMD on MADV_DONTNEED...");
-	if (!check_huge(p))
+	if (check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
@@ -618,20 +620,23 @@ static void alloc_at_fault(void)
 static void collapse_full(struct collapse_context *c)
 {
 	void *p;
+	int nr_hpages = 4;
+	unsigned long size = nr_hpages * hpage_pmd_size;
 
-	p = alloc_mapping();
-	fill_memory(p, 0, hpage_pmd_size);
-	c->collapse("Collapse fully populated PTE table", p, true);
-	validate_memory(p, 0, hpage_pmd_size);
-	munmap(p, hpage_pmd_size);
+	p = alloc_mapping(nr_hpages);
+	fill_memory(p, 0, size);
+	c->collapse("Collapse multiple fully populated PTE table", p, nr_hpages,
+		    true);
+	validate_memory(p, 0, size);
+	munmap(p, size);
 }
 
 static void collapse_empty(struct collapse_context *c)
 {
 	void *p;
 
-	p = alloc_mapping();
-	c->collapse("Do not collapse empty PTE table", p, false);
+	p = alloc_mapping(1);
+	c->collapse("Do not collapse empty PTE table", p, 1, false);
 	munmap(p, hpage_pmd_size);
 }
 
@@ -639,10 +644,10 @@ static void collapse_single_pte_entry(struct collapse_context *c)
 {
 	void *p;
 
-	p = alloc_mapping();
+	p = alloc_mapping(1);
 	fill_memory(p, 0, page_size);
 	c->collapse("Collapse PTE table with single PTE entry present", p,
-		    true);
+		    1, true);
 	validate_memory(p, 0, page_size);
 	munmap(p, hpage_pmd_size);
 }
@@ -656,16 +661,17 @@ static void collapse_max_ptes_none(struct collapse_context *c)
 	settings.khugepaged.max_ptes_none = max_ptes_none;
 	push_settings(&settings);
 
-	p = alloc_mapping();
+	p = alloc_mapping(1);
 
 	fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
-	c->collapse("Maybe collapse with max_ptes_none exceeded", p,
+	c->collapse("Maybe collapse with max_ptes_none exceeded", p, 1,
 		    !c->enforce_pte_scan_limits);
 	validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
 
 	if (c->enforce_pte_scan_limits) {
 		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
-		c->collapse("Collapse with max_ptes_none PTEs empty", p, true);
+		c->collapse("Collapse with max_ptes_none PTEs empty", p, 1,
+			    true);
 		validate_memory(p, 0,
 				(hpage_pmd_nr - max_ptes_none) * page_size);
 	}
@@ -677,7 +683,7 @@ static void collapse_max_ptes_none(struct collapse_context *c)
 static void collapse_swapin_single_pte(struct collapse_context *c)
 {
 	void *p;
-	p = alloc_mapping();
+	p = alloc_mapping(1);
 	fill_memory(p, 0, hpage_pmd_size);
 
 	printf("Swapout one page...");
@@ -692,7 +698,7 @@ static void collapse_swapin_single_pte(struct collapse_context *c)
 		goto out;
 	}
 
-	c->collapse("Collapse with swapping in single PTE entry", p, true);
+	c->collapse("Collapse with swapping in single PTE entry", p, 1, true);
 	validate_memory(p, 0, hpage_pmd_size);
 out:
 	munmap(p, hpage_pmd_size);
@@ -703,7 +709,7 @@ static void collapse_max_ptes_swap(struct collapse_context *c)
 	int max_ptes_swap = read_num("khugepaged/max_ptes_swap");
 	void *p;
 
-	p = alloc_mapping();
+	p = alloc_mapping(1);
 
 	fill_memory(p, 0, hpage_pmd_size);
 	printf("Swapout %d of %d pages...", max_ptes_swap + 1, hpage_pmd_nr);
@@ -718,7 +724,7 @@ static void collapse_max_ptes_swap(struct collapse_context *c)
 		goto out;
 	}
 
-	c->collapse("Maybe collapse with max_ptes_swap exceeded", p,
+	c->collapse("Maybe collapse with max_ptes_swap exceeded", p, 1,
 		    !c->enforce_pte_scan_limits);
 	validate_memory(p, 0, hpage_pmd_size);
 
@@ -738,7 +744,7 @@ static void collapse_max_ptes_swap(struct collapse_context *c)
 		}
 
 		c->collapse("Collapse with max_ptes_swap pages swapped out", p,
-			    true);
+			    1, true);
 		validate_memory(p, 0, hpage_pmd_size);
 	}
 out:
@@ -753,13 +759,13 @@ static void collapse_single_pte_entry_compound(struct collapse_context *c)
 	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
 	printf("Split huge page leaving single PTE mapping compound page...");
 	madvise(p + page_size, hpage_pmd_size - page_size, MADV_DONTNEED);
-	if (!check_huge(p))
+	if (check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
 
 	c->collapse("Collapse PTE table with single PTE mapping compound page",
-		    p, true);
+		    p, 1, true);
 	validate_memory(p, 0, page_size);
 	munmap(p, hpage_pmd_size);
 }
@@ -772,12 +778,12 @@ static void collapse_full_of_compound(struct collapse_context *c)
 	printf("Split huge page leaving single PTE page table full of compound pages...");
 	madvise(p, page_size, MADV_NOHUGEPAGE);
 	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
-	if (!check_huge(p))
+	if (check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
 
-	c->collapse("Collapse PTE table full of compound pages", p, true);
+	c->collapse("Collapse PTE table full of compound pages", p, 1, true);
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
@@ -787,14 +793,14 @@ static void collapse_compound_extreme(struct collapse_context *c)
 	void *p;
 	int i;
 
-	p = alloc_mapping();
+	p = alloc_mapping(1);
 	for (i = 0; i < hpage_pmd_nr; i++) {
 		printf("\rConstruct PTE page table full of different PTE-mapped compound pages %3d/%d...",
 				i + 1, hpage_pmd_nr);
 
 		madvise(BASE_ADDR, hpage_pmd_size, MADV_HUGEPAGE);
 		fill_memory(BASE_ADDR, 0, hpage_pmd_size);
-		if (!check_huge(BASE_ADDR)) {
+		if (!check_huge(BASE_ADDR, 1)) {
 			printf("Failed to allocate huge page\n");
 			exit(EXIT_FAILURE);
 		}
@@ -823,12 +829,12 @@ static void collapse_compound_extreme(struct collapse_context *c)
 
 	munmap(BASE_ADDR, hpage_pmd_size);
 	fill_memory(p, 0, hpage_pmd_size);
-	if (!check_huge(p))
+	if (check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
 
-	c->collapse("Collapse PTE table full of different compound pages", p,
+	c->collapse("Collapse PTE table full of different compound pages", p, 1,
 		    true);
 
 	validate_memory(p, 0, hpage_pmd_size);
@@ -840,11 +846,11 @@ static void collapse_fork(struct collapse_context *c)
 	int wstatus;
 	void *p;
 
-	p = alloc_mapping();
+	p = alloc_mapping(1);
 
 	printf("Allocate small page...");
 	fill_memory(p, 0, page_size);
-	if (!check_huge(p))
+	if (check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
@@ -855,14 +861,14 @@ static void collapse_fork(struct collapse_context *c)
 		skip_settings_restore = true;
 		exit_status = 0;
 
-		if (!check_huge(p))
+		if (check_huge(p, 0))
 			success("OK");
 		else
 			fail("Fail");
 
 		fill_memory(p, page_size, 2 * page_size);
 		c->collapse("Collapse PTE table with single page shared with parent process",
-			    p, true);
+			    p, 1, true);
 
 		validate_memory(p, 0, page_size);
 		munmap(p, hpage_pmd_size);
@@ -873,7 +879,7 @@ static void collapse_fork(struct collapse_context *c)
 	exit_status += WEXITSTATUS(wstatus);
 
 	printf("Check if parent still has small page...");
-	if (!check_huge(p))
+	if (check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
@@ -893,7 +899,7 @@ static void collapse_fork_compound(struct collapse_context *c)
 		skip_settings_restore = true;
 		exit_status = 0;
 
-		if (check_huge(p))
+		if (check_huge(p, 1))
 			success("OK");
 		else
 			fail("Fail");
@@ -901,7 +907,7 @@ static void collapse_fork_compound(struct collapse_context *c)
 		printf("Split huge page PMD in child process...");
 		madvise(p, page_size, MADV_NOHUGEPAGE);
 		madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
-		if (!check_huge(p))
+		if (check_huge(p, 0))
 			success("OK");
 		else
 			fail("Fail");
@@ -909,7 +915,7 @@ static void collapse_fork_compound(struct collapse_context *c)
 
 		write_num("khugepaged/max_ptes_shared", hpage_pmd_nr - 1);
 		c->collapse("Collapse PTE table full of compound pages in child",
-			    p, true);
+			    p, 1, true);
 		write_num("khugepaged/max_ptes_shared",
 			  current_settings()->khugepaged.max_ptes_shared);
 
@@ -922,7 +928,7 @@ static void collapse_fork_compound(struct collapse_context *c)
 	exit_status += WEXITSTATUS(wstatus);
 
 	printf("Check if parent still has huge page...");
-	if (check_huge(p))
+	if (check_huge(p, 1))
 		success("OK");
 	else
 		fail("Fail");
@@ -943,7 +949,7 @@ static void collapse_max_ptes_shared(struct collapse_context *c)
 		skip_settings_restore = true;
 		exit_status = 0;
 
-		if (check_huge(p))
+		if (check_huge(p, 1))
 			success("OK");
 		else
 			fail("Fail");
@@ -951,26 +957,26 @@ static void collapse_max_ptes_shared(struct collapse_context *c)
 		printf("Trigger CoW on page %d of %d...",
 				hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr);
 		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
-		if (!check_huge(p))
+		if (check_huge(p, 0))
 			success("OK");
 		else
 			fail("Fail");
 
 		c->collapse("Maybe collapse with max_ptes_shared exceeded", p,
-			    !c->enforce_pte_scan_limits);
+			    1, !c->enforce_pte_scan_limits);
 
 		if (c->enforce_pte_scan_limits) {
 			printf("Trigger CoW on page %d of %d...",
 			       hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
 			fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) *
 				    page_size);
-			if (!check_huge(p))
+			if (check_huge(p, 0))
 				success("OK");
 			else
 				fail("Fail");
 
 			c->collapse("Collapse with max_ptes_shared PTEs shared",
-				    p, true);
+				    p, 1,  true);
 		}
 
 		validate_memory(p, 0, hpage_pmd_size);
@@ -982,7 +988,7 @@ static void collapse_max_ptes_shared(struct collapse_context *c)
 	exit_status += WEXITSTATUS(wstatus);
 
 	printf("Check if parent still has huge page...");
-	if (check_huge(p))
+	if (check_huge(p, 1))
 		success("OK");
 	else
 		fail("Fail");
@@ -995,7 +1001,7 @@ static void madvise_collapse_existing_thps(void)
 	void *p;
 	int err;
 
-	p = alloc_mapping();
+	p = alloc_mapping(1);
 	fill_memory(p, 0, hpage_pmd_size);
 
 	printf("Collapse fully populated PTE table...");
@@ -1005,11 +1011,11 @@ static void madvise_collapse_existing_thps(void)
 	 * MADV_COLLAPSE in "madvise" mode.
 	 */
 	err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
-	if (err == 0 && check_huge(p)) {
+	if (err == 0 && check_huge(p, 1)) {
 		success("OK");
 		printf("Re-collapse PMD-mapped hugepage");
 		err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
-		if (err == 0 && check_huge(p))
+		if (err == 0 && check_huge(p, 1))
 			success("OK");
 		else
 			fail("Fail");
-- 
2.37.0.rc0.161.g10f37bed90-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-07-06 23:59 ` [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
@ 2022-07-08 20:47   ` Andrew Morton
  2022-07-13  1:05     ` Zach O'Keefe
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2022-07-08 20:47 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm, Andrea Arcangeli, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Wed,  6 Jul 2022 16:59:30 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:

> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> CAP_SYS_ADMIN or is requesting collapse of it's own memory.

This is maximally restrictive.  I didn't see any discussion of why this
was chosen either here of in the [0/N].  I expect that people will be
coming after us to relax this.

So please do add (a lot of) words explaining this decision, and
describing what might be done in the future to relax it.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control
  2022-07-06 23:59 ` [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control Zach O'Keefe
@ 2022-07-08 21:01   ` Andrew Morton
  2022-07-11 18:29     ` Zach O'Keefe
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2022-07-08 21:01 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm, Andrea Arcangeli, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Wed,  6 Jul 2022 16:59:21 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:

> Modularize hugepage collapse by introducing struct collapse_control.
> This structure serves to describe the properties of the requested
> collapse, as well as serve as a local scratch pad to use during the
> collapse itself.
> 
> Start by moving global per-node khugepaged statistics into this
> new structure.  Note that this structure is still statically allocated
> since CONFIG_NODES_SHIFT might be arbitrary large, and stack-allocating
> a MAX_NUMNODES-sized array could cause -Wframe-large-than= errors.
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/khugepaged.c | 87 ++++++++++++++++++++++++++++---------------------
>  1 file changed, 50 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 196eaadbf415..f1ef02d9fe07 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -85,6 +85,14 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
>  
>  #define MAX_PTE_MAPPED_THP 8
>  
> +struct collapse_control {
> +	/* Num pages scanned per node */
> +	int node_load[MAX_NUMNODES];

Does this actually need to be 32-bit?  Looking at the current code I'm
suspecting that khugepaged_node_load[] could be a ushort?

[And unsigned int would be more appropriate, but we always do that :(]




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control
  2022-07-08 21:01   ` Andrew Morton
@ 2022-07-11 18:29     ` Zach O'Keefe
  2022-07-11 18:45       ` Andrew Morton
  2022-07-11 21:51       ` Yang Shi
  0 siblings, 2 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-11 18:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm, Andrea Arcangeli, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jul 8, 2022 at 2:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed,  6 Jul 2022 16:59:21 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
>
> > Modularize hugepage collapse by introducing struct collapse_control.
> > This structure serves to describe the properties of the requested
> > collapse, as well as serve as a local scratch pad to use during the
> > collapse itself.
> >
> > Start by moving global per-node khugepaged statistics into this
> > new structure.  Note that this structure is still statically allocated
> > since CONFIG_NODES_SHIFT might be arbitrary large, and stack-allocating
> > a MAX_NUMNODES-sized array could cause -Wframe-large-than= errors.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  mm/khugepaged.c | 87 ++++++++++++++++++++++++++++---------------------
> >  1 file changed, 50 insertions(+), 37 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 196eaadbf415..f1ef02d9fe07 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -85,6 +85,14 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
> >
> >  #define MAX_PTE_MAPPED_THP 8
> >
> > +struct collapse_control {
> > +     /* Num pages scanned per node */
> > +     int node_load[MAX_NUMNODES];
>
> Does this actually need to be 32-bit?  Looking at the current code I'm
> suspecting that khugepaged_node_load[] could be a ushort?
>
> [And unsigned int would be more appropriate, but we always do that :(]
>

Hey Andrew,

Thanks for taking the time to review, and good catch - I don't think
we need 32 bits.

Minimally, we just need to be able to hold the maximum value of
HPAGE_PMD_NR = 1 << (PMD_SHIFT - PAGE_SHIFT).

I'm not sure what arch/config options (that also use THP) produce the
minimum/maximum value here. I looked through most of the archs that
define PMD_SHIFT, and couldn't find an example where we'd need > 16
bits, with most cases still requiring > 8 bits. All the various
configs do get complicated though.

Is it acceptable to use u16, with an #error if HPAGE_PMD_ORDER >= 16?

Thanks,
Zach


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control
  2022-07-11 18:29     ` Zach O'Keefe
@ 2022-07-11 18:45       ` Andrew Morton
  2022-07-12 14:17         ` Zach O'Keefe
  2022-07-11 21:51       ` Yang Shi
  1 sibling, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2022-07-11 18:45 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm, Andrea Arcangeli, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Mon, 11 Jul 2022 11:29:13 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:

> On Fri, Jul 8, 2022 at 2:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Wed,  6 Jul 2022 16:59:21 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> >
> > > Modularize hugepage collapse by introducing struct collapse_control.
> > > This structure serves to describe the properties of the requested
> > > collapse, as well as serve as a local scratch pad to use during the
> > > collapse itself.
> > >
> > > Start by moving global per-node khugepaged statistics into this
> > > new structure.  Note that this structure is still statically allocated
> > > since CONFIG_NODES_SHIFT might be arbitrary large, and stack-allocating
> > > a MAX_NUMNODES-sized array could cause -Wframe-large-than= errors.
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >  mm/khugepaged.c | 87 ++++++++++++++++++++++++++++---------------------
> > >  1 file changed, 50 insertions(+), 37 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 196eaadbf415..f1ef02d9fe07 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -85,6 +85,14 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
> > >
> > >  #define MAX_PTE_MAPPED_THP 8
> > >
> > > +struct collapse_control {
> > > +     /* Num pages scanned per node */
> > > +     int node_load[MAX_NUMNODES];
> >
> > Does this actually need to be 32-bit?  Looking at the current code I'm
> > suspecting that khugepaged_node_load[] could be a ushort?
> >
> > [And unsigned int would be more appropriate, but we always do that :(]
> >
> 
> Hey Andrew,
> 
> Thanks for taking the time to review, and good catch - I don't think
> we need 32 bits.
> 
> Minimally, we just need to be able to hold the maximum value of
> HPAGE_PMD_NR = 1 << (PMD_SHIFT - PAGE_SHIFT).
> 
> I'm not sure what arch/config options (that also use THP) produce the
> minimum/maximum value here. I looked through most of the archs that
> define PMD_SHIFT, and couldn't find an example where we'd need > 16
> bits, with most cases still requiring > 8 bits. All the various
> configs do get complicated though.
> 
> Is it acceptable to use u16, with an #error if HPAGE_PMD_ORDER >= 16?

It might be ;)

It was just a thought - perhaps something which you or someone else
might choose to look at, but I don't think this work needs to be part
of the current series, unless the current series consumes egregious
amounts of memory.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 01/18] mm/khugepaged: remove redundant transhuge_vma_suitable() check
  2022-07-06 23:59 ` [mm-unstable v7 01/18] mm/khugepaged: remove redundant transhuge_vma_suitable() check Zach O'Keefe
@ 2022-07-11 20:38   ` Yang Shi
  2022-07-12 17:14     ` Zach O'Keefe
  0 siblings, 1 reply; 47+ messages in thread
From: Yang Shi @ 2022-07-11 20:38 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> transhuge_vma_suitable() is called twice in hugepage_vma_revalidate()
> path.  Remove the first check, and rely on the second check inside
> hugepage_vma_check().
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/khugepaged.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index cfe231c5958f..5269d15e20f6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -918,8 +918,6 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>         if (!vma)
>                 return SCAN_VMA_NULL;
>
> -       if (!transhuge_vma_suitable(vma, address))
> -               return SCAN_ADDRESS_RANGE;

It seems this is the only user of SCAN_ADDRESS_RANGE, so
SCAN_ADDRESS_RANGE could be deleted as well.

>         if (!hugepage_vma_check(vma, vma->vm_flags, false, false))
>                 return SCAN_VMA_CHECK;
>         /*
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 06/18] mm/khugepaged: add flag to predicate khugepaged-only behavior
  2022-07-06 23:59 ` [mm-unstable v7 06/18] mm/khugepaged: add flag to predicate khugepaged-only behavior Zach O'Keefe
@ 2022-07-11 20:43   ` Yang Shi
  2022-07-12 17:06     ` Zach O'Keefe
  0 siblings, 1 reply; 47+ messages in thread
From: Yang Shi @ 2022-07-11 20:43 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Add .is_khugepaged flag to struct collapse_control so
> khugepaged-specific behavior can be elided by MADV_COLLAPSE context.
>
> Start by protecting khugepaged-specific heuristics by this flag. In
> MADV_COLLAPSE, the user presumably has reason to believe the collapse
> will be beneficial and khugepaged heuristics shouldn't prevent the user
> from doing so:
>
> 1) sysfs-controlled knobs khugepaged_max_ptes_[none|swap|shared]
>
> 2) requirement that some pages in region being collapsed be young or
>    referenced
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>
> v6 -> v7: There is no functional change here from v6, just a renaming of
>           flags to explicitly be predicated on khugepaged.

Reviewed-by: Yang Shi <shy828301@gmail.com>

Just a nit, some conditions check is_khugepaged first, some don't. Why
not make them more consistent to check is_khugepaged first?

> ---
>  mm/khugepaged.c | 62 ++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 43 insertions(+), 19 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 147f5828f052..d89056d8cbad 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -73,6 +73,8 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
>   * default collapse hugepages if there is at least one pte mapped like
>   * it would have happened if the vma was large enough during page
>   * fault.
> + *
> + * Note that these are only respected if collapse was initiated by khugepaged.
>   */
>  static unsigned int khugepaged_max_ptes_none __read_mostly;
>  static unsigned int khugepaged_max_ptes_swap __read_mostly;
> @@ -86,6 +88,8 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
>  #define MAX_PTE_MAPPED_THP 8
>
>  struct collapse_control {
> +       bool is_khugepaged;
> +
>         /* Num pages scanned per node */
>         int node_load[MAX_NUMNODES];
>
> @@ -554,6 +558,7 @@ static bool is_refcount_suitable(struct page *page)
>  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                                         unsigned long address,
>                                         pte_t *pte,
> +                                       struct collapse_control *cc,
>                                         struct list_head *compound_pagelist)
>  {
>         struct page *page = NULL;
> @@ -567,7 +572,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                 if (pte_none(pteval) || (pte_present(pteval) &&
>                                 is_zero_pfn(pte_pfn(pteval)))) {
>                         if (!userfaultfd_armed(vma) &&
> -                           ++none_or_zero <= khugepaged_max_ptes_none) {
> +                           (++none_or_zero <= khugepaged_max_ptes_none ||
> +                            !cc->is_khugepaged)) {
>                                 continue;
>                         } else {
>                                 result = SCAN_EXCEED_NONE_PTE;
> @@ -587,8 +593,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>
>                 VM_BUG_ON_PAGE(!PageAnon(page), page);
>
> -               if (page_mapcount(page) > 1 &&
> -                               ++shared > khugepaged_max_ptes_shared) {
> +               if (cc->is_khugepaged && page_mapcount(page) > 1 &&
> +                   ++shared > khugepaged_max_ptes_shared) {
>                         result = SCAN_EXCEED_SHARED_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>                         goto out;
> @@ -654,10 +660,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                 if (PageCompound(page))
>                         list_add_tail(&page->lru, compound_pagelist);
>  next:
> -               /* There should be enough young pte to collapse the page */
> -               if (pte_young(pteval) ||
> -                   page_is_young(page) || PageReferenced(page) ||
> -                   mmu_notifier_test_young(vma->vm_mm, address))
> +               /*
> +                * If collapse was initiated by khugepaged, check that there is
> +                * enough young pte to justify collapsing the page
> +                */
> +               if (cc->is_khugepaged &&
> +                   (pte_young(pteval) || page_is_young(page) ||
> +                    PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
> +                                                                    address)))
>                         referenced++;
>
>                 if (pte_write(pteval))
> @@ -666,7 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>
>         if (unlikely(!writable)) {
>                 result = SCAN_PAGE_RO;
> -       } else if (unlikely(!referenced)) {
> +       } else if (unlikely(cc->is_khugepaged && !referenced)) {
>                 result = SCAN_LACK_REFERENCED_PAGE;
>         } else {
>                 result = SCAN_SUCCEED;
> @@ -745,6 +755,7 @@ static void khugepaged_alloc_sleep(void)
>
>
>  struct collapse_control khugepaged_collapse_control = {
> +       .is_khugepaged = true,
>         .last_target_node = NUMA_NO_NODE,
>  };
>
> @@ -1023,7 +1034,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         mmu_notifier_invalidate_range_end(&range);
>
>         spin_lock(pte_ptl);
> -       result =  __collapse_huge_page_isolate(vma, address, pte,
> +       result =  __collapse_huge_page_isolate(vma, address, pte, cc,
>                                                &compound_pagelist);
>         spin_unlock(pte_ptl);
>
> @@ -1114,7 +1125,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>              _pte++, _address += PAGE_SIZE) {
>                 pte_t pteval = *_pte;
>                 if (is_swap_pte(pteval)) {
> -                       if (++unmapped <= khugepaged_max_ptes_swap) {
> +                       if (++unmapped <= khugepaged_max_ptes_swap ||
> +                           !cc->is_khugepaged) {
>                                 /*
>                                  * Always be strict with uffd-wp
>                                  * enabled swap entries.  Please see
> @@ -1133,7 +1145,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>                 }
>                 if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>                         if (!userfaultfd_armed(vma) &&
> -                           ++none_or_zero <= khugepaged_max_ptes_none) {
> +                           (++none_or_zero <= khugepaged_max_ptes_none ||
> +                            !cc->is_khugepaged)) {
>                                 continue;
>                         } else {
>                                 result = SCAN_EXCEED_NONE_PTE;
> @@ -1163,8 +1176,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>                         goto out_unmap;
>                 }
>
> -               if (page_mapcount(page) > 1 &&
> -                               ++shared > khugepaged_max_ptes_shared) {
> +               if (cc->is_khugepaged &&
> +                   page_mapcount(page) > 1 &&
> +                   ++shared > khugepaged_max_ptes_shared) {
>                         result = SCAN_EXCEED_SHARED_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>                         goto out_unmap;
> @@ -1218,14 +1232,22 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>                         result = SCAN_PAGE_COUNT;
>                         goto out_unmap;
>                 }
> -               if (pte_young(pteval) ||
> -                   page_is_young(page) || PageReferenced(page) ||
> -                   mmu_notifier_test_young(vma->vm_mm, address))
> +
> +               /*
> +                * If collapse was initiated by khugepaged, check that there is
> +                * enough young pte to justify collapsing the page
> +                */
> +               if (cc->is_khugepaged &&
> +                   (pte_young(pteval) || page_is_young(page) ||
> +                    PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
> +                                                                    address)))
>                         referenced++;
>         }
>         if (!writable) {
>                 result = SCAN_PAGE_RO;
> -       } else if (!referenced || (unmapped && referenced < HPAGE_PMD_NR/2)) {
> +       } else if (cc->is_khugepaged &&
> +                  (!referenced ||
> +                   (unmapped && referenced < HPAGE_PMD_NR / 2))) {
>                 result = SCAN_LACK_REFERENCED_PAGE;
>         } else {
>                 result = SCAN_SUCCEED;
> @@ -1894,7 +1916,8 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
>                         continue;
>
>                 if (xa_is_value(page)) {
> -                       if (++swap > khugepaged_max_ptes_swap) {
> +                       if (cc->is_khugepaged &&
> +                           ++swap > khugepaged_max_ptes_swap) {
>                                 result = SCAN_EXCEED_SWAP_PTE;
>                                 count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>                                 break;
> @@ -1945,7 +1968,8 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
>         rcu_read_unlock();
>
>         if (result == SCAN_SUCCEED) {
> -               if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
> +               if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none &&
> +                   cc->is_khugepaged) {
>                         result = SCAN_EXCEED_NONE_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>                 } else {
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 07/18] mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()
  2022-07-06 23:59 ` [mm-unstable v7 07/18] mm/thp: add flag to enforce sysfs THP in hugepage_vma_check() Zach O'Keefe
@ 2022-07-11 20:57   ` Yang Shi
  2022-07-12 16:58     ` Zach O'Keefe
  0 siblings, 1 reply; 47+ messages in thread
From: Yang Shi @ 2022-07-11 20:57 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1].
>
> hugepage_vma_check() is the authority on determining if a VMA is eligible
> for THP allocation/collapse, and currently enforces the sysfs THP settings.
> Add a flag to disable these checks.  For now, only apply this arg to anon
> and file, which use /sys/kernel/transparent_hugepage/enabled.  We can
> expand this to shmem, which uses
> /sys/kernel/transparent_hugepage/shmem_enabled, later.
>
> Use this flag in collapse_pte_mapped_thp() where previously the VMA flags
> passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the
> VM_HUGEPAGE check in "madvise" THP mode. Prior to "mm: khugepaged: check
> THP flag in hugepage_vma_check()", this check also didn't check "never" THP
> mode.  As such, this restores the previous behavior of
> collapse_pte_mapped_thp() where sysfs THP settings are ignored.  See
> comment in code for justification why this is OK.
>
> [1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  fs/proc/task_mmu.c      |  2 +-
>  include/linux/huge_mm.h |  9 ++++-----
>  mm/huge_memory.c        | 14 ++++++--------
>  mm/khugepaged.c         | 25 ++++++++++++++-----------
>  mm/memory.c             |  4 ++--
>  5 files changed, 27 insertions(+), 27 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 34d292cec79a..f8cd58846a28 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -866,7 +866,7 @@ static int show_smap(struct seq_file *m, void *v)
>         __show_smap(m, &mss, false);
>
>         seq_printf(m, "THPeligible:    %d\n",
> -                  hugepage_vma_check(vma, vma->vm_flags, true, false));
> +                  hugepage_vma_check(vma, vma->vm_flags, true, false, true));
>
>         if (arch_pkeys_enabled())
>                 seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 37f2f11a6d7e..00312fc251c1 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -168,9 +168,8 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>                !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>  }
>
> -bool hugepage_vma_check(struct vm_area_struct *vma,
> -                       unsigned long vm_flags,
> -                       bool smaps, bool in_pf);
> +bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
> +                       bool smaps, bool in_pf, bool enforce_sysfs);
>
>  #define transparent_hugepage_use_zero_page()                           \
>         (transparent_hugepage_flags &                                   \
> @@ -321,8 +320,8 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
>  }
>
>  static inline bool hugepage_vma_check(struct vm_area_struct *vma,
> -                                      unsigned long vm_flags,
> -                                      bool smaps, bool in_pf)
> +                                     unsigned long vm_flags, bool smaps,
> +                                     bool in_pf, bool enforce_sysfs)
>  {
>         return false;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index da300ce9dedb..4fbe43dc1568 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -69,9 +69,8 @@ static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
>  unsigned long huge_zero_pfn __read_mostly = ~0UL;
>
> -bool hugepage_vma_check(struct vm_area_struct *vma,
> -                       unsigned long vm_flags,
> -                       bool smaps, bool in_pf)
> +bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
> +                       bool smaps, bool in_pf, bool enforce_sysfs)
>  {
>         if (!vma->vm_mm)                /* vdso */
>                 return false;
> @@ -120,11 +119,10 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>         if (!in_pf && shmem_file(vma->vm_file))
>                 return shmem_huge_enabled(vma);
>
> -       if (!hugepage_flags_enabled())
> -               return false;
> -
> -       /* THP settings require madvise. */
> -       if (!(vm_flags & VM_HUGEPAGE) && !hugepage_flags_always())
> +       /* Enforce sysfs THP requirements as necessary */
> +       if (enforce_sysfs &&
> +           (!hugepage_flags_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
> +                                          !hugepage_flags_always())))
>                 return false;
>
>         /* Only regular file is valid */
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d89056d8cbad..b0e20db3f805 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -478,7 +478,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>  {
>         if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
>             hugepage_flags_enabled()) {
> -               if (hugepage_vma_check(vma, vm_flags, false, false))
> +               if (hugepage_vma_check(vma, vm_flags, false, false, true))
>                         __khugepaged_enter(vma->vm_mm);
>         }
>  }
> @@ -844,7 +844,8 @@ static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>   */
>
>  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> -               struct vm_area_struct **vmap)
> +                                  struct vm_area_struct **vmap,
> +                                  struct collapse_control *cc)
>  {
>         struct vm_area_struct *vma;
>
> @@ -855,7 +856,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>         if (!vma)
>                 return SCAN_VMA_NULL;
>
> -       if (!hugepage_vma_check(vma, vma->vm_flags, false, false))
> +       if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
> +                               cc->is_khugepaged))
>                 return SCAN_VMA_CHECK;
>         /*
>          * Anon VMA expected, the address may be unmapped then
> @@ -974,7 +976,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>                 goto out_nolock;
>
>         mmap_read_lock(mm);
> -       result = hugepage_vma_revalidate(mm, address, &vma);
> +       result = hugepage_vma_revalidate(mm, address, &vma, cc);
>         if (result != SCAN_SUCCEED) {
>                 mmap_read_unlock(mm);
>                 goto out_nolock;
> @@ -1006,7 +1008,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>          * handled by the anon_vma lock + PG_lock.
>          */
>         mmap_write_lock(mm);
> -       result = hugepage_vma_revalidate(mm, address, &vma);
> +       result = hugepage_vma_revalidate(mm, address, &vma, cc);
>         if (result != SCAN_SUCCEED)
>                 goto out_up_write;
>         /* check if the pmd is still valid */
> @@ -1350,12 +1352,13 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
>                 return;
>
>         /*
> -        * This vm_flags may not have VM_HUGEPAGE if the page was not
> -        * collapsed by this mm. But we can still collapse if the page is
> -        * the valid THP. Add extra VM_HUGEPAGE so hugepage_vma_check()
> -        * will not fail the vma for missing VM_HUGEPAGE
> +        * If we are here, we've succeeded in replacing all the native pages
> +        * in the page cache with a single hugepage. If a mm were to fault-in
> +        * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
> +        * and map it by a PMD, regardless of sysfs THP settings. As such, let's
> +        * analogously elide sysfs THP settings here.
>          */
> -       if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE, false, false))
> +       if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
>                 return;
>
>         /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> @@ -2042,7 +2045,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>                         progress++;
>                         break;
>                 }
> -               if (!hugepage_vma_check(vma, vma->vm_flags, false, false)) {
> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true)) {
>  skip:
>                         progress++;
>                         continue;
> diff --git a/mm/memory.c b/mm/memory.c
> index 8917bea2f0bc..96cd776e84f1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5001,7 +5001,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>                 return VM_FAULT_OOM;
>  retry_pud:
>         if (pud_none(*vmf.pud) &&
> -           hugepage_vma_check(vma, vm_flags, false, true)) {
> +           hugepage_vma_check(vma, vm_flags, false, true, true)) {
>                 ret = create_huge_pud(&vmf);
>                 if (!(ret & VM_FAULT_FALLBACK))
>                         return ret;
> @@ -5035,7 +5035,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>                 goto retry_pud;
>
>         if (pmd_none(*vmf.pmd) &&
> -           hugepage_vma_check(vma, vm_flags, false, true)) {
> +           hugepage_vma_check(vma, vm_flags, false, true, true)) {
>                 ret = create_huge_pmd(&vmf);
>                 if (!(ret & VM_FAULT_FALLBACK))
>                         return ret;
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 08/18] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage
  2022-07-06 23:59 ` [mm-unstable v7 08/18] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage Zach O'Keefe
@ 2022-07-11 21:03   ` Yang Shi
  2022-07-12 16:50     ` Zach O'Keefe
  0 siblings, 1 reply; 47+ messages in thread
From: Yang Shi @ 2022-07-11 21:03 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> When scanning an anon pmd to see if it's eligible for collapse, return
> SCAN_PMD_MAPPED if the pmd already maps a hugepage.  Note that
> SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
> file-collapse path, since the latter might identify pte-mapped compound
> pages.  This is required by MADV_COLLAPSE which necessarily needs to
> know what hugepage-aligned/sized regions are already pmd-mapped.
>
> In order to determine if a pmd already maps a hugepage, refactor
> mm_find_pmd():
>
> Return mm_find_pmd() to it's pre-commit f72e7dcdd252 ("mm: let mm_find_pmd
> fix buggy race with THP fault") behavior.  ksm was the only caller that
> explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
> there (pmd_present() and pmd_trans_huge() checks).
>
> Undo revert change in commit f72e7dcdd252 ("mm: let mm_find_pmd fix buggy race
> with THP fault") that open-coded split_huge_pmd_address() pmd lookup and
> use mm_find_pmd() instead.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  include/trace/events/huge_memory.h |  1 +
>  mm/huge_memory.c                   | 18 +--------
>  mm/internal.h                      |  2 +-
>  mm/khugepaged.c                    | 60 ++++++++++++++++++++++++------
>  mm/ksm.c                           | 10 +++++
>  mm/rmap.c                          | 15 +++-----
>  6 files changed, 67 insertions(+), 39 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index d651f3437367..55392bf30a03 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -11,6 +11,7 @@
>         EM( SCAN_FAIL,                  "failed")                       \
>         EM( SCAN_SUCCEED,               "succeeded")                    \
>         EM( SCAN_PMD_NULL,              "pmd_null")                     \
> +       EM( SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
>         EM( SCAN_EXCEED_NONE_PTE,       "exceed_none_pte")              \
>         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
>         EM( SCAN_EXCEED_SHARED_PTE,     "exceed_shared_pte")            \
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4fbe43dc1568..fb76db6c703e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2363,25 +2363,11 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
>                 bool freeze, struct folio *folio)
>  {
> -       pgd_t *pgd;
> -       p4d_t *p4d;
> -       pud_t *pud;
> -       pmd_t *pmd;
> +       pmd_t *pmd = mm_find_pmd(vma->vm_mm, address);
>
> -       pgd = pgd_offset(vma->vm_mm, address);
> -       if (!pgd_present(*pgd))
> +       if (!pmd)
>                 return;
>
> -       p4d = p4d_offset(pgd, address);
> -       if (!p4d_present(*p4d))
> -               return;
> -
> -       pud = pud_offset(p4d, address);
> -       if (!pud_present(*pud))
> -               return;
> -
> -       pmd = pmd_offset(pud, address);
> -
>         __split_huge_pmd(vma, pmd, address, freeze, folio);
>  }
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 6e14749ad1e5..ef8c23fb678f 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -188,7 +188,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
>  /*
>   * in mm/rmap.c:
>   */
> -extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
>
>  /*
>   * in mm/page_alloc.c
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b0e20db3f805..c7a09cc9a0e8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -28,6 +28,7 @@ enum scan_result {
>         SCAN_FAIL,
>         SCAN_SUCCEED,
>         SCAN_PMD_NULL,
> +       SCAN_PMD_MAPPED,
>         SCAN_EXCEED_NONE_PTE,
>         SCAN_EXCEED_SWAP_PTE,
>         SCAN_EXCEED_SHARED_PTE,
> @@ -871,6 +872,45 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>         return SCAN_SUCCEED;
>  }
>
> +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> +                                  unsigned long address,
> +                                  pmd_t **pmd)
> +{
> +       pmd_t pmde;
> +
> +       *pmd = mm_find_pmd(mm, address);
> +       if (!*pmd)
> +               return SCAN_PMD_NULL;
> +
> +       pmde = pmd_read_atomic(*pmd);
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> +       barrier();
> +#endif
> +       if (!pmd_present(pmde))
> +               return SCAN_PMD_NULL;
> +       if (pmd_trans_huge(pmde))
> +               return SCAN_PMD_MAPPED;
> +       if (pmd_bad(pmde))
> +               return SCAN_PMD_NULL;
> +       return SCAN_SUCCEED;
> +}
> +
> +static int check_pmd_still_valid(struct mm_struct *mm,
> +                                unsigned long address,
> +                                pmd_t *pmd)
> +{
> +       pmd_t *new_pmd;
> +       int result = find_pmd_or_thp_or_none(mm, address, &new_pmd);
> +
> +       if (result != SCAN_SUCCEED)
> +               return result;
> +       if (new_pmd != pmd)
> +               return SCAN_FAIL;
> +       return SCAN_SUCCEED;
> +}
> +
>  /*
>   * Bring missing pages in from swap, to complete THP collapse.
>   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> @@ -982,9 +1022,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>                 goto out_nolock;
>         }
>
> -       pmd = mm_find_pmd(mm, address);
> -       if (!pmd) {
> -               result = SCAN_PMD_NULL;
> +       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> +       if (result != SCAN_SUCCEED) {
>                 mmap_read_unlock(mm);
>                 goto out_nolock;
>         }
> @@ -1012,7 +1051,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         if (result != SCAN_SUCCEED)
>                 goto out_up_write;
>         /* check if the pmd is still valid */
> -       if (mm_find_pmd(mm, address) != pmd)
> +       result = check_pmd_still_valid(mm, address, pmd);
> +       if (result != SCAN_SUCCEED)
>                 goto out_up_write;
>
>         anon_vma_lock_write(vma->anon_vma);
> @@ -1115,11 +1155,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>
>         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> -       pmd = mm_find_pmd(mm, address);
> -       if (!pmd) {
> -               result = SCAN_PMD_NULL;
> +       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> +       if (result != SCAN_SUCCEED)
>                 goto out;
> -       }
>
>         memset(cc->node_load, 0, sizeof(cc->node_load));
>         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> @@ -1373,8 +1411,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
>         if (!PageHead(hpage))
>                 goto drop_hpage;
>
> -       pmd = mm_find_pmd(mm, haddr);
> -       if (!pmd)
> +       if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED)
>                 goto drop_hpage;
>
>         start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> @@ -1492,8 +1529,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>                 if (vma->vm_end < addr + HPAGE_PMD_SIZE)
>                         continue;
>                 mm = vma->vm_mm;
> -               pmd = mm_find_pmd(mm, addr);
> -               if (!pmd)
> +               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
>                         continue;
>                 /*
>                  * We need exclusive mmap_lock to retract page table.
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 075123602bd0..3e0a0a42fa1f 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1136,6 +1136,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  {
>         struct mm_struct *mm = vma->vm_mm;
>         pmd_t *pmd;
> +       pmd_t pmde;
>         pte_t *ptep;
>         pte_t newpte;
>         spinlock_t *ptl;
> @@ -1150,6 +1151,15 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>         pmd = mm_find_pmd(mm, addr);
>         if (!pmd)
>                 goto out;
> +       /*
> +        * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
> +        * without holding anon_vma lock for write.  So when looking for a
> +        * genuine pmde (in which to find pte), test present and !THP together.
> +        */
> +       pmde = *pmd;
> +       barrier();
> +       if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> +               goto out;
>
>         mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr,
>                                 addr + PAGE_SIZE);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index edc06c52bc82..af775855e58f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -767,13 +767,17 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
>         return vma_address(page, vma);
>  }
>
> +/*
> + * Returns the actual pmd_t* where we expect 'address' to be mapped from, or
> + * NULL if it doesn't exist.  No guarantees / checks on what the pmd_t*
> + * represents.
> + */
>  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
>         pud_t *pud;
>         pmd_t *pmd = NULL;
> -       pmd_t pmde;
>
>         pgd = pgd_offset(mm, address);
>         if (!pgd_present(*pgd))
> @@ -788,15 +792,6 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>                 goto out;
>
>         pmd = pmd_offset(pud, address);
> -       /*
> -        * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
> -        * without holding anon_vma lock for write.  So when looking for a
> -        * genuine pmde (in which to find pte), test present and !THP together.
> -        */
> -       pmde = *pmd;
> -       barrier();
> -       if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> -               pmd = NULL;
>  out:
>         return pmd;
>  }
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 09/18] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-07-06 23:59 ` [mm-unstable v7 09/18] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
@ 2022-07-11 21:22   ` Yang Shi
  2022-07-12 16:54     ` Zach O'Keefe
  0 siblings, 1 reply; 47+ messages in thread
From: Yang Shi @ 2022-07-11 21:22 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	linux-api

On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> This idea was introduced by David Rientjes[1].
>
> Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
> synchronous collapse of memory at their own expense.
>
> The benefits of this approach are:
>
> * CPU is charged to the process that wants to spend the cycles for the
>   THP
> * Avoid unpredictable timing of khugepaged collapse
>
> Semantics
>
> This call is independent of the system-wide THP sysfs settings, but will
> fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
> multiple VMAs, the semantics of the collapse over each VMA is
> independent from the others.  This implies a hugepage cannot cross a VMA
> boundary.  If collapse of a given hugepage-aligned/sized region fails,
> the operation may continue to attempt collapsing the remainder of memory
> specified.
>
> The memory ranges provided must be page-aligned, but are not required to
> be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
> start/end of the range will be clamped to the first/last
> hugepage-aligned address covered by said range.  The memory ranges must
> span at least one hugepage-sized region.
>
> All non-resident pages covered by the range will first be
> swapped/faulted-in, before being internally copied onto a freshly
> allocated hugepage.  Unmapped pages will have their data directly
> initialized to 0 in the new hugepage.  However, for every eligible hugepage
> aligned/sized region to-be collapsed, at least one page must currently be
> backed by memory (a PMD covering the address range must already exist).
>
> Allocation for the new hugepage may enter direct reclaim and/or
> compaction, regardless of VMA flags.  When the system has multiple NUMA
> nodes, the hugepage will be allocated from the node providing the most
> native pages.  This operation operates on the current state of the
> specified process and makes no persistent changes or guarantees on how
> pages will be mapped, constructed, or faulted in the future
>
> Return Value
>
> If all hugepage-sized/aligned regions covered by the provided range were
> either successfully collapsed, or were already PMD-mapped THPs, this
> operation will be deemed successful.  On success, process_madvise(2)
> returns the number of bytes advised, and madvise(2) returns 0.  Else, -1
> is returned and errno is set to indicate the error for the most-recently
> attempted hugepage collapse.  Note that many failures might have
> occurred, since the operation may continue to collapse in the event a
> single hugepage-sized/aligned region fails.
>
>         ENOMEM  Memory allocation failed or VMA not found
>         EBUSY   Memcg charging failed
>         EAGAIN  Required resource temporarily unavailable.  Try again
>                 might succeed.
>         EINVAL  Other error: No PMD found, subpage doesn't have Present
>                 bit set, "Special" page no backed by struct page, VMA
>                 incorrectly sized, address not page-aligned, ...
>
> Most notable here is ENOMEM and EBUSY (new to madvise) which are
> intended to provide the caller with actionable feedback so they may take
> an appropriate fallback measure.

Don't forget to update man-pages. And cc'ed linux-api.

>
> Use Cases
>
> An immediate user of this new functionality are malloc() implementations
> that manage memory in hugepage-sized chunks, but sometimes subrelease
> memory back to the system in native-sized chunks via MADV_DONTNEED;
> zapping the pmd.  Later, when the memory is hot, the implementation
> could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
> hugepage coverage and dTLB performance.  TCMalloc is such an
> implementation that could benefit from this[2].
>
> Only privately-mapped anon memory is supported for now, but additional
> support for file, shmem, and HugeTLB high-granularity mappings[2] is
> expected.  File and tmpfs/shmem support would permit:
>
> * Backing executable text by THPs.  Current support provided by
>   CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which
>   might impair services from serving at their full rated load after
>   (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
>   immediately realize iTLB performance prevents page sharing and demand
>   paging, both of which increase steady state memory footprint.  With
>   MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
>   and lower RAM footprints.
> * Backing guest memory by hugapages after the memory contents have been
>   migrated in native-page-sized chunks to a new host, in a
>   userfaultfd-based live-migration stack.
>
> [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
>
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  arch/alpha/include/uapi/asm/mman.h           |   2 +
>  arch/mips/include/uapi/asm/mman.h            |   2 +
>  arch/parisc/include/uapi/asm/mman.h          |   2 +
>  arch/xtensa/include/uapi/asm/mman.h          |   2 +
>  include/linux/huge_mm.h                      |  14 ++-
>  include/uapi/asm-generic/mman-common.h       |   2 +
>  mm/khugepaged.c                              | 118 ++++++++++++++++++-
>  mm/madvise.c                                 |   5 +
>  tools/include/uapi/asm-generic/mman-common.h |   2 +
>  9 files changed, 146 insertions(+), 3 deletions(-)
>
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 4aa996423b0d..763929e814e9 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -76,6 +76,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index 1be428663c10..c6e1fc77c996 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -103,6 +103,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index a7ea3204a5fa..22133a6a506e 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -70,6 +70,8 @@
>  #define MADV_WIPEONFORK 71             /* Zero memory on fork, child only */
>  #define MADV_KEEPONFORK 72             /* Undo MADV_WIPEONFORK */
>
> +#define MADV_COLLAPSE  73              /* Synchronous hugepage collapse */
> +
>  #define MADV_HWPOISON     100          /* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
>
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 7966a58af472..1ff0c858544f 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -111,6 +111,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 00312fc251c1..39193623442e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -218,6 +218,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>
>  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
>                      int advice);
> +int madvise_collapse(struct vm_area_struct *vma,
> +                    struct vm_area_struct **prev,
> +                    unsigned long start, unsigned long end);
>  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
>                            unsigned long end, long adjust_next);
>  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> @@ -361,9 +364,16 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
>  static inline int hugepage_madvise(struct vm_area_struct *vma,
>                                    unsigned long *vm_flags, int advice)
>  {
> -       BUG();
> -       return 0;
> +       return -EINVAL;
>  }
> +
> +static inline int madvise_collapse(struct vm_area_struct *vma,
> +                                  struct vm_area_struct **prev,
> +                                  unsigned long start, unsigned long end)
> +{
> +       return -EINVAL;
> +}
> +
>  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
>                                          unsigned long start,
>                                          unsigned long end,
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6c1aa92a92e4..6ce1f1ceb432 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -77,6 +77,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c7a09cc9a0e8..2b2d832e44f2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -976,7 +976,8 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
>                               struct collapse_control *cc)
>  {
>         /* Only allocate from the target node */
> -       gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> +       gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> +                    GFP_TRANSHUGE) | __GFP_THISNODE;
>         int node = khugepaged_find_target_node(cc);
>
>         if (!khugepaged_alloc_page(hpage, gfp, node))
> @@ -2356,3 +2357,118 @@ void khugepaged_min_free_kbytes_update(void)
>                 set_recommended_min_free_kbytes();
>         mutex_unlock(&khugepaged_mutex);
>  }
> +
> +static int madvise_collapse_errno(enum scan_result r)
> +{
> +       /*
> +        * MADV_COLLAPSE breaks from existing madvise(2) conventions to provide
> +        * actionable feedback to caller, so they may take an appropriate
> +        * fallback measure depending on the nature of the failure.
> +        */
> +       switch (r) {
> +       case SCAN_ALLOC_HUGE_PAGE_FAIL:
> +               return -ENOMEM;
> +       case SCAN_CGROUP_CHARGE_FAIL:
> +               return -EBUSY;
> +       /* Resource temporary unavailable - trying again might succeed */
> +       case SCAN_PAGE_LOCK:
> +       case SCAN_PAGE_LRU:
> +               return -EAGAIN;
> +       /*
> +        * Other: Trying again likely not to succeed / error intrinsic to
> +        * specified memory range. khugepaged likely won't be able to collapse
> +        * either.
> +        */
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
> +int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> +                    unsigned long start, unsigned long end)
> +{
> +       struct collapse_control *cc;
> +       struct mm_struct *mm = vma->vm_mm;
> +       unsigned long hstart, hend, addr;
> +       int thps = 0, last_fail = SCAN_FAIL;
> +       bool mmap_locked = true;
> +
> +       BUG_ON(vma->vm_start > start);
> +       BUG_ON(vma->vm_end < end);
> +
> +       cc = kmalloc(sizeof(*cc), GFP_KERNEL);
> +       if (!cc)
> +               return -ENOMEM;
> +       cc->is_khugepaged = false;
> +       cc->last_target_node = NUMA_NO_NODE;
> +
> +       *prev = vma;
> +
> +       /* TODO: Support file/shmem */
> +       if (!vma->anon_vma || !vma_is_anonymous(vma))
> +               return -EINVAL;
> +
> +       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> +       hend = end & HPAGE_PMD_MASK;
> +
> +       if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> +               return -EINVAL;
> +
> +       mmgrab(mm);
> +       lru_add_drain_all();
> +
> +       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> +               int result = SCAN_FAIL;
> +
> +               if (!mmap_locked) {
> +                       cond_resched();
> +                       mmap_read_lock(mm);
> +                       mmap_locked = true;
> +                       result = hugepage_vma_revalidate(mm, addr, &vma, cc);
> +                       if (result  != SCAN_SUCCEED) {
> +                               last_fail = result;
> +                               goto out_nolock;
> +                       }
> +               }
> +               mmap_assert_locked(mm);
> +               memset(cc->node_load, 0, sizeof(cc->node_load));
> +               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, cc);
> +               if (!mmap_locked)
> +                       *prev = NULL;  /* Tell caller we dropped mmap_lock */
> +
> +               switch (result) {
> +               case SCAN_SUCCEED:
> +               case SCAN_PMD_MAPPED:
> +                       ++thps;
> +                       break;
> +               /* Whitelisted set of results where continuing OK */
> +               case SCAN_PMD_NULL:
> +               case SCAN_PTE_NON_PRESENT:
> +               case SCAN_PTE_UFFD_WP:
> +               case SCAN_PAGE_RO:
> +               case SCAN_LACK_REFERENCED_PAGE:
> +               case SCAN_PAGE_NULL:
> +               case SCAN_PAGE_COUNT:
> +               case SCAN_PAGE_LOCK:
> +               case SCAN_PAGE_COMPOUND:
> +               case SCAN_PAGE_LRU:
> +                       last_fail = result;
> +                       break;
> +               default:
> +                       last_fail = result;
> +                       /* Other error, exit */
> +                       goto out_maybelock;
> +               }
> +       }
> +
> +out_maybelock:
> +       /* Caller expects us to hold mmap_lock on return */
> +       if (!mmap_locked)
> +               mmap_read_lock(mm);
> +out_nolock:
> +       mmap_assert_locked(mm);
> +       mmdrop(mm);
> +
> +       return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> +                       : madvise_collapse_errno(last_fail);
> +}
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 851fa4e134bc..9f08e958ea86 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
>         case MADV_FREE:
>         case MADV_POPULATE_READ:
>         case MADV_POPULATE_WRITE:
> +       case MADV_COLLAPSE:
>                 return 0;
>         default:
>                 /* be safe, default to 1. list exceptions explicitly */
> @@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>                 if (error)
>                         goto out;
>                 break;
> +       case MADV_COLLAPSE:
> +               return madvise_collapse(vma, prev, start, end);
>         }
>
>         anon_name = anon_vma_name(vma);
> @@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         case MADV_HUGEPAGE:
>         case MADV_NOHUGEPAGE:
> +       case MADV_COLLAPSE:
>  #endif
>         case MADV_DONTDUMP:
>         case MADV_DODUMP:
> @@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
>   *             transparent huge pages so the existing pages will not be
>   *             coalesced into THP and new pages will not be allocated as THP.
> + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *             from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> index 6c1aa92a92e4..6ce1f1ceb432 100644
> --- a/tools/include/uapi/asm-generic/mman-common.h
> +++ b/tools/include/uapi/asm-generic/mman-common.h
> @@ -77,6 +77,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 11/18] mm/madvise: add huge_memory:mm_madvise_collapse tracepoint
  2022-07-06 23:59 ` [mm-unstable v7 11/18] mm/madvise: add huge_memory:mm_madvise_collapse tracepoint Zach O'Keefe
@ 2022-07-11 21:32   ` Yang Shi
  2022-07-12 16:21     ` Zach O'Keefe
  0 siblings, 1 reply; 47+ messages in thread
From: Yang Shi @ 2022-07-11 21:32 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Add a tracepoint to expose mm, address, and enum scan_result of each
> hugepage attempted to be collapsed by call to madvise(MADV_COLLAPSE).

Is this necessary? Isn't mm_khugepaged_scan_pmd tracepoint good
enough? It doesn't have "address", but you should be able to calculate
address from it with syscall trace together.


>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  include/trace/events/huge_memory.h | 22 ++++++++++++++++++++++
>  mm/khugepaged.c                    |  2 ++
>  2 files changed, 24 insertions(+)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 55392bf30a03..38d339ffdb16 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -167,5 +167,27 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
>                 __entry->ret)
>  );
>
> +TRACE_EVENT(mm_madvise_collapse,
> +
> +       TP_PROTO(struct mm_struct *mm, unsigned long addr, int result),
> +
> +       TP_ARGS(mm, addr, result),
> +
> +       TP_STRUCT__entry(__field(struct mm_struct *, mm)
> +                        __field(unsigned long, addr)
> +                        __field(int, result)
> +       ),
> +
> +       TP_fast_assign(__entry->mm = mm;
> +                      __entry->addr = addr;
> +                      __entry->result = result;
> +       ),
> +
> +       TP_printk("mm=%p addr=%#lx result=%s",
> +                 __entry->mm,
> +                 __entry->addr,
> +                 __print_symbolic(__entry->result, SCAN_STATUS))
> +);
> +
>  #endif /* __HUGE_MEMORY_H */
>  #include <trace/define_trace.h>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e0d00180512c..0207fc0a5b2a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2438,6 +2438,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>                 if (!mmap_locked)
>                         *prev = NULL;  /* Tell caller we dropped mmap_lock */
>
> +               trace_mm_madvise_collapse(mm, addr, result);
> +
>                 switch (result) {
>                 case SCAN_SUCCEED:
>                 case SCAN_PMD_MAPPED:
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps
  2022-07-06 23:59 ` [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps Zach O'Keefe
@ 2022-07-11 21:37   ` Yang Shi
  2022-07-12 16:31     ` Zach O'Keefe
  0 siblings, 1 reply; 47+ messages in thread
From: Yang Shi @ 2022-07-11 21:37 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Add PMDMappable field to smaps output which informs the user if memory
> in the VMA can be PMD-mapped by MADV_COLLAPSE.
>
> The distinction from THPeligible is needed for two reasons:
>
> 1) For THP, MADV_COLLAPSE is not coupled to THP sysfs controls, which
>    THPeligible reports.
>
> 2) PMDMappable can also be used in HugeTLB fine-granularity mappings,
>    which are independent from THP.

Could you please elaborate the usecase? The user checks this hint
before calling MADV_COLLAPSE? Is it really necessary?

And, TBH it sounds confusing and we don't have to maintain both
THPeligible and PMDMappable. We could just relax THPeligible to make
it return 1 even though THP is disabled by sysfs but MADV_COLLAPSE
could collapse it if such hint is useful.


>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  Documentation/filesystems/proc.rst | 10 ++++++++--
>  fs/proc/task_mmu.c                 |  2 ++
>  2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 47e95dbc820d..f207903a57a5 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -466,6 +466,7 @@ Memory Area, or VMA) there is a series of lines such as the following::
>      MMUPageSize:           4 kB
>      Locked:                0 kB
>      THPeligible:           0
> +    PMDMappable:           0
>      VmFlags: rd ex mr mw me dw
>
>  The first of these lines shows the same information as is displayed for the
> @@ -518,9 +519,14 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
>  does not take into account swapped out page of underlying shmem objects.
>  "Locked" indicates whether the mapping is locked in memory or not.
>
> +"PMDMappable" indicates if the memory can be mapped by PMDs - 1 if true, 0
> +otherwise.  It just shows the current status. Note that this is memory
> +operable on explicitly by MADV_COLLAPSE.
> +
>  "THPeligible" indicates whether the mapping is eligible for allocating THP
> -pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
> -It just shows the current status.
> +pages by the kernel, as well as the THP is PMD mappable or not - 1 if true, 0
> +otherwise. It just shows the current status.  Note this is memory the kernel can
> +transparently provide as THPs.
>
>  "VmFlags" field deserves a separate description. This member represents the
>  kernel flags associated with the particular virtual memory area in two letter
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f8cd58846a28..29f2089456ba 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -867,6 +867,8 @@ static int show_smap(struct seq_file *m, void *v)
>
>         seq_printf(m, "THPeligible:    %d\n",
>                    hugepage_vma_check(vma, vma->vm_flags, true, false, true));
> +       seq_printf(m, "PMDMappable:    %d\n",
> +                  hugepage_vma_check(vma, vma->vm_flags, true, false, false));
>
>         if (arch_pkeys_enabled())
>                 seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control
  2022-07-11 18:29     ` Zach O'Keefe
  2022-07-11 18:45       ` Andrew Morton
@ 2022-07-11 21:51       ` Yang Shi
  1 sibling, 0 replies; 47+ messages in thread
From: Yang Shi @ 2022-07-11 21:51 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Andrew Morton, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Michal Hocko, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Mon, Jul 11, 2022 at 11:29 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Fri, Jul 8, 2022 at 2:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Wed,  6 Jul 2022 16:59:21 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> >
> > > Modularize hugepage collapse by introducing struct collapse_control.
> > > This structure serves to describe the properties of the requested
> > > collapse, as well as serve as a local scratch pad to use during the
> > > collapse itself.
> > >
> > > Start by moving global per-node khugepaged statistics into this
> > > new structure.  Note that this structure is still statically allocated
> > > since CONFIG_NODES_SHIFT might be arbitrary large, and stack-allocating
> > > a MAX_NUMNODES-sized array could cause -Wframe-large-than= errors.
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >  mm/khugepaged.c | 87 ++++++++++++++++++++++++++++---------------------
> > >  1 file changed, 50 insertions(+), 37 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 196eaadbf415..f1ef02d9fe07 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -85,6 +85,14 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
> > >
> > >  #define MAX_PTE_MAPPED_THP 8
> > >
> > > +struct collapse_control {
> > > +     /* Num pages scanned per node */
> > > +     int node_load[MAX_NUMNODES];
> >
> > Does this actually need to be 32-bit?  Looking at the current code I'm
> > suspecting that khugepaged_node_load[] could be a ushort?
> >
> > [And unsigned int would be more appropriate, but we always do that :(]
> >
>
> Hey Andrew,
>
> Thanks for taking the time to review, and good catch - I don't think
> we need 32 bits.
>
> Minimally, we just need to be able to hold the maximum value of
> HPAGE_PMD_NR = 1 << (PMD_SHIFT - PAGE_SHIFT).
>
> I'm not sure what arch/config options (that also use THP) produce the
> minimum/maximum value here. I looked through most of the archs that
> define PMD_SHIFT, and couldn't find an example where we'd need > 16
> bits, with most cases still requiring > 8 bits. All the various
> configs do get complicated though.
>
> Is it acceptable to use u16, with an #error if HPAGE_PMD_ORDER >= 16?

Fine to me.

>
> Thanks,
> Zach


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control
  2022-07-11 18:45       ` Andrew Morton
@ 2022-07-12 14:17         ` Zach O'Keefe
  0 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-12 14:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm, Andrea Arcangeli, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Jul 11 11:45, Andrew Morton wrote:
> On Mon, 11 Jul 2022 11:29:13 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> 
> > On Fri, Jul 8, 2022 at 2:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Wed,  6 Jul 2022 16:59:21 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> > >
> > > > Modularize hugepage collapse by introducing struct collapse_control.
> > > > This structure serves to describe the properties of the requested
> > > > collapse, as well as serve as a local scratch pad to use during the
> > > > collapse itself.
> > > >
> > > > Start by moving global per-node khugepaged statistics into this
> > > > new structure.  Note that this structure is still statically allocated
> > > > since CONFIG_NODES_SHIFT might be arbitrary large, and stack-allocating
> > > > a MAX_NUMNODES-sized array could cause -Wframe-large-than= errors.
> > > >
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >  mm/khugepaged.c | 87 ++++++++++++++++++++++++++++---------------------
> > > >  1 file changed, 50 insertions(+), 37 deletions(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 196eaadbf415..f1ef02d9fe07 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -85,6 +85,14 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
> > > >
> > > >  #define MAX_PTE_MAPPED_THP 8
> > > >
> > > > +struct collapse_control {
> > > > +     /* Num pages scanned per node */
> > > > +     int node_load[MAX_NUMNODES];
> > >
> > > Does this actually need to be 32-bit?  Looking at the current code I'm
> > > suspecting that khugepaged_node_load[] could be a ushort?
> > >
> > > [And unsigned int would be more appropriate, but we always do that :(]
> > >
> > 
> > Hey Andrew,
> > 
> > Thanks for taking the time to review, and good catch - I don't think
> > we need 32 bits.
> > 
> > Minimally, we just need to be able to hold the maximum value of
> > HPAGE_PMD_NR = 1 << (PMD_SHIFT - PAGE_SHIFT).
> > 
> > I'm not sure what arch/config options (that also use THP) produce the
> > minimum/maximum value here. I looked through most of the archs that
> > define PMD_SHIFT, and couldn't find an example where we'd need > 16
> > bits, with most cases still requiring > 8 bits. All the various
> > configs do get complicated though.
> > 
> > Is it acceptable to use u16, with an #error if HPAGE_PMD_ORDER >= 16?
> 
> It might be ;)
> 
> It was just a thought - perhaps something which you or someone else
> might choose to look at, but I don't think this work needs to be part
> of the current series, unless the current series consumes egregious
> amounts of memory.
> 

I think it makes sense. Reason we moved this struct to kmalloc was MAX_NUMNODES
can be pretty large - so might as well save a few bytes for a pretty small
change. Yang seems good with it, anyways :)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 11/18] mm/madvise: add huge_memory:mm_madvise_collapse tracepoint
  2022-07-11 21:32   ` Yang Shi
@ 2022-07-12 16:21     ` Zach O'Keefe
  2022-07-12 17:05       ` Yang Shi
  0 siblings, 1 reply; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-12 16:21 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

Hey Yang,

Thanks for taking the time to review this series again.

On Jul 11 14:32, Yang Shi wrote:
> On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > Add a tracepoint to expose mm, address, and enum scan_result of each
> > hugepage attempted to be collapsed by call to madvise(MADV_COLLAPSE).
> 
> Is this necessary? Isn't mm_khugepaged_scan_pmd tracepoint good
> enough? It doesn't have "address", but you should be able to calculate
> address from it with syscall trace together.
> 

I've also found this useful debugging along the file path. Perhaps the answer to
that is: add tracepoints to the file path - and we should probably do that - but
the other issue is that turning on these tracepoints (for the purposes of
debugging MADV_COLLAPSE) generates a lot of noise from khugepaged that is hard
to separate out. Augmenting existing tracepoints with .is_khugepaged data
incurred the risks associated with altering an existing kernel API. WDYT?

Thanks again,
Zach


> 
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  include/trace/events/huge_memory.h | 22 ++++++++++++++++++++++
> >  mm/khugepaged.c                    |  2 ++
> >  2 files changed, 24 insertions(+)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 55392bf30a03..38d339ffdb16 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -167,5 +167,27 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
> >                 __entry->ret)
> >  );
> >
> > +TRACE_EVENT(mm_madvise_collapse,
> > +
> > +       TP_PROTO(struct mm_struct *mm, unsigned long addr, int result),
> > +
> > +       TP_ARGS(mm, addr, result),
> > +
> > +       TP_STRUCT__entry(__field(struct mm_struct *, mm)
> > +                        __field(unsigned long, addr)
> > +                        __field(int, result)
> > +       ),
> > +
> > +       TP_fast_assign(__entry->mm = mm;
> > +                      __entry->addr = addr;
> > +                      __entry->result = result;
> > +       ),
> > +
> > +       TP_printk("mm=%p addr=%#lx result=%s",
> > +                 __entry->mm,
> > +                 __entry->addr,
> > +                 __print_symbolic(__entry->result, SCAN_STATUS))
> > +);
> > +
> >  #endif /* __HUGE_MEMORY_H */
> >  #include <trace/define_trace.h>
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index e0d00180512c..0207fc0a5b2a 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2438,6 +2438,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >                 if (!mmap_locked)
> >                         *prev = NULL;  /* Tell caller we dropped mmap_lock */
> >
> > +               trace_mm_madvise_collapse(mm, addr, result);
> > +
> >                 switch (result) {
> >                 case SCAN_SUCCEED:
> >                 case SCAN_PMD_MAPPED:
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps
  2022-07-11 21:37   ` Yang Shi
@ 2022-07-12 16:31     ` Zach O'Keefe
  2022-07-12 17:27       ` Yang Shi
  0 siblings, 1 reply; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-12 16:31 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	jthoughton

On Jul 11 14:37, Yang Shi wrote:
> On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > Add PMDMappable field to smaps output which informs the user if memory
> > in the VMA can be PMD-mapped by MADV_COLLAPSE.
> >
> > The distinction from THPeligible is needed for two reasons:
> >
> > 1) For THP, MADV_COLLAPSE is not coupled to THP sysfs controls, which
> >    THPeligible reports.
> >
> > 2) PMDMappable can also be used in HugeTLB fine-granularity mappings,
> >    which are independent from THP.
> 
> Could you please elaborate the usecase? The user checks this hint
> before calling MADV_COLLAPSE? Is it really necessary?
> 
> And, TBH it sounds confusing and we don't have to maintain both
> THPeligible and PMDMappable. We could just relax THPeligible to make
> it return 1 even though THP is disabled by sysfs but MADV_COLLAPSE
> could collapse it if such hint is useful.
> 

Hey Yang,

Thanks for taking the time to review this series again, and thanks for
challenging this.

TLDR: "Is it really necessary" - at the moment, no, probably not .. but I think
it's "useful".

Rationale:

1. IMO, I thought was was confusing seeing:

	...
	AnonHugePages:      2048 kB
	ShmemPmdMapped:        0 kB
	FilePmdMapped:         0 kB
	Shared_Hugetlb:        0 kB
	Private_Hugetlb:       0 kB
	Swap:                  0 kB
	SwapPss:               0 kB
	Locked:                0 kB
	THPeligible:    0
	...

Maybe this could simply be clarified in the docs though.  I guess we can already
get:

	...
	AnonHugePages:         0 kB
	ShmemPmdMapped:        0 kB
	FilePmdMapped:      2048 kB
	Shared_Hugetlb:        0 kB
	Private_Hugetlb:       0 kB
	Swap:                  0 kB
	SwapPss:               0 kB
	Locked:                0 kB
	THPeligible:    0
	...

today[1], so perhaps it's not a big deal.


2. It was useful for debugging - similar to rationale for including
THPeligible1[2], the logic for determining if a VMA is eligible is pretty
complicated. I.e. is this file mapped suitably? Unlike THPeligible, however,
madvise(2) has the ability to set errno on failure to help* diagnose why some
memory isn't being backed.

3. For the immediately-envisioned usecases, the user "knows" about what memory
they are acting on. However, eventually we'd like to experiment with moving THP
utilization policy to userspace. Here, it would be useful if the userspace agent
managing was made aware of what memory it should be managing. I don't have a
working prototype of what this would like yet, however.

4. I thought it was neat that this field could be reused for HugeTLB
fine-granularity mappings - but TBH I'm not sure how useful it'd be there.

I figured relaxing existing THPeligible could break existing users / tests, and
it'd be likewise confusing for them to see THPeligible:	1, but then have faults
fail and they'd then have to go check sysfs settings and vma flags ; we'd be
back in pre-commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
vma").

Thanks,
Zach

[1] https://lore.kernel.org/linux-mm/YrxbQGiwml24APCx@google.com/


> 
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  Documentation/filesystems/proc.rst | 10 ++++++++--
> >  fs/proc/task_mmu.c                 |  2 ++
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 47e95dbc820d..f207903a57a5 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -466,6 +466,7 @@ Memory Area, or VMA) there is a series of lines such as the following::
> >      MMUPageSize:           4 kB
> >      Locked:                0 kB
> >      THPeligible:           0
> > +    PMDMappable:           0
> >      VmFlags: rd ex mr mw me dw
> >
> >  The first of these lines shows the same information as is displayed for the
> > @@ -518,9 +519,14 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
> >  does not take into account swapped out page of underlying shmem objects.
> >  "Locked" indicates whether the mapping is locked in memory or not.
> >
> > +"PMDMappable" indicates if the memory can be mapped by PMDs - 1 if true, 0
> > +otherwise.  It just shows the current status. Note that this is memory
> > +operable on explicitly by MADV_COLLAPSE.
> > +
> >  "THPeligible" indicates whether the mapping is eligible for allocating THP
> > -pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
> > -It just shows the current status.
> > +pages by the kernel, as well as the THP is PMD mappable or not - 1 if true, 0
> > +otherwise. It just shows the current status.  Note this is memory the kernel can
> > +transparently provide as THPs.
> >
> >  "VmFlags" field deserves a separate description. This member represents the
> >  kernel flags associated with the particular virtual memory area in two letter
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index f8cd58846a28..29f2089456ba 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -867,6 +867,8 @@ static int show_smap(struct seq_file *m, void *v)
> >
> >         seq_printf(m, "THPeligible:    %d\n",
> >                    hugepage_vma_check(vma, vma->vm_flags, true, false, true));
> > +       seq_printf(m, "PMDMappable:    %d\n",
> > +                  hugepage_vma_check(vma, vma->vm_flags, true, false, false));
> >
> >         if (arch_pkeys_enabled())
> >                 seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 08/18] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage
  2022-07-11 21:03   ` Yang Shi
@ 2022-07-12 16:50     ` Zach O'Keefe
  0 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-12 16:50 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Jul 11 14:03, Yang Shi wrote:
> On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > When scanning an anon pmd to see if it's eligible for collapse, return
> > SCAN_PMD_MAPPED if the pmd already maps a hugepage.  Note that
> > SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
> > file-collapse path, since the latter might identify pte-mapped compound
> > pages.  This is required by MADV_COLLAPSE which necessarily needs to
> > know what hugepage-aligned/sized regions are already pmd-mapped.
> >
> > In order to determine if a pmd already maps a hugepage, refactor
> > mm_find_pmd():
> >
> > Return mm_find_pmd() to it's pre-commit f72e7dcdd252 ("mm: let mm_find_pmd
> > fix buggy race with THP fault") behavior.  ksm was the only caller that
> > explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
> > there (pmd_present() and pmd_trans_huge() checks).
> >
> > Undo revert change in commit f72e7dcdd252 ("mm: let mm_find_pmd fix buggy race
> > with THP fault") that open-coded split_huge_pmd_address() pmd lookup and
> > use mm_find_pmd() instead.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> 
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> 

Thanks for taking the time to review!

Zach

> > ---
> >  include/trace/events/huge_memory.h |  1 +
> >  mm/huge_memory.c                   | 18 +--------
> >  mm/internal.h                      |  2 +-
> >  mm/khugepaged.c                    | 60 ++++++++++++++++++++++++------
> >  mm/ksm.c                           | 10 +++++
> >  mm/rmap.c                          | 15 +++-----
> >  6 files changed, 67 insertions(+), 39 deletions(-)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index d651f3437367..55392bf30a03 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -11,6 +11,7 @@
> >         EM( SCAN_FAIL,                  "failed")                       \
> >         EM( SCAN_SUCCEED,               "succeeded")                    \
> >         EM( SCAN_PMD_NULL,              "pmd_null")                     \
> > +       EM( SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
> >         EM( SCAN_EXCEED_NONE_PTE,       "exceed_none_pte")              \
> >         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
> >         EM( SCAN_EXCEED_SHARED_PTE,     "exceed_shared_pte")            \
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 4fbe43dc1568..fb76db6c703e 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2363,25 +2363,11 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >  void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
> >                 bool freeze, struct folio *folio)
> >  {
> > -       pgd_t *pgd;
> > -       p4d_t *p4d;
> > -       pud_t *pud;
> > -       pmd_t *pmd;
> > +       pmd_t *pmd = mm_find_pmd(vma->vm_mm, address);
> >
> > -       pgd = pgd_offset(vma->vm_mm, address);
> > -       if (!pgd_present(*pgd))
> > +       if (!pmd)
> >                 return;
> >
> > -       p4d = p4d_offset(pgd, address);
> > -       if (!p4d_present(*p4d))
> > -               return;
> > -
> > -       pud = pud_offset(p4d, address);
> > -       if (!pud_present(*pud))
> > -               return;
> > -
> > -       pmd = pmd_offset(pud, address);
> > -
> >         __split_huge_pmd(vma, pmd, address, freeze, folio);
> >  }
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 6e14749ad1e5..ef8c23fb678f 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -188,7 +188,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
> >  /*
> >   * in mm/rmap.c:
> >   */
> > -extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> > +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> >
> >  /*
> >   * in mm/page_alloc.c
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b0e20db3f805..c7a09cc9a0e8 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -28,6 +28,7 @@ enum scan_result {
> >         SCAN_FAIL,
> >         SCAN_SUCCEED,
> >         SCAN_PMD_NULL,
> > +       SCAN_PMD_MAPPED,
> >         SCAN_EXCEED_NONE_PTE,
> >         SCAN_EXCEED_SWAP_PTE,
> >         SCAN_EXCEED_SHARED_PTE,
> > @@ -871,6 +872,45 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >         return SCAN_SUCCEED;
> >  }
> >
> > +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > +                                  unsigned long address,
> > +                                  pmd_t **pmd)
> > +{
> > +       pmd_t pmde;
> > +
> > +       *pmd = mm_find_pmd(mm, address);
> > +       if (!*pmd)
> > +               return SCAN_PMD_NULL;
> > +
> > +       pmde = pmd_read_atomic(*pmd);
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > +       barrier();
> > +#endif
> > +       if (!pmd_present(pmde))
> > +               return SCAN_PMD_NULL;
> > +       if (pmd_trans_huge(pmde))
> > +               return SCAN_PMD_MAPPED;
> > +       if (pmd_bad(pmde))
> > +               return SCAN_PMD_NULL;
> > +       return SCAN_SUCCEED;
> > +}
> > +
> > +static int check_pmd_still_valid(struct mm_struct *mm,
> > +                                unsigned long address,
> > +                                pmd_t *pmd)
> > +{
> > +       pmd_t *new_pmd;
> > +       int result = find_pmd_or_thp_or_none(mm, address, &new_pmd);
> > +
> > +       if (result != SCAN_SUCCEED)
> > +               return result;
> > +       if (new_pmd != pmd)
> > +               return SCAN_FAIL;
> > +       return SCAN_SUCCEED;
> > +}
> > +
> >  /*
> >   * Bring missing pages in from swap, to complete THP collapse.
> >   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> > @@ -982,9 +1022,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                 goto out_nolock;
> >         }
> >
> > -       pmd = mm_find_pmd(mm, address);
> > -       if (!pmd) {
> > -               result = SCAN_PMD_NULL;
> > +       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > +       if (result != SCAN_SUCCEED) {
> >                 mmap_read_unlock(mm);
> >                 goto out_nolock;
> >         }
> > @@ -1012,7 +1051,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >         if (result != SCAN_SUCCEED)
> >                 goto out_up_write;
> >         /* check if the pmd is still valid */
> > -       if (mm_find_pmd(mm, address) != pmd)
> > +       result = check_pmd_still_valid(mm, address, pmd);
> > +       if (result != SCAN_SUCCEED)
> >                 goto out_up_write;
> >
> >         anon_vma_lock_write(vma->anon_vma);
> > @@ -1115,11 +1155,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> >
> >         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > -       pmd = mm_find_pmd(mm, address);
> > -       if (!pmd) {
> > -               result = SCAN_PMD_NULL;
> > +       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > +       if (result != SCAN_SUCCEED)
> >                 goto out;
> > -       }
> >
> >         memset(cc->node_load, 0, sizeof(cc->node_load));
> >         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > @@ -1373,8 +1411,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> >         if (!PageHead(hpage))
> >                 goto drop_hpage;
> >
> > -       pmd = mm_find_pmd(mm, haddr);
> > -       if (!pmd)
> > +       if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED)
> >                 goto drop_hpage;
> >
> >         start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> > @@ -1492,8 +1529,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >                 if (vma->vm_end < addr + HPAGE_PMD_SIZE)
> >                         continue;
> >                 mm = vma->vm_mm;
> > -               pmd = mm_find_pmd(mm, addr);
> > -               if (!pmd)
> > +               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> >                         continue;
> >                 /*
> >                  * We need exclusive mmap_lock to retract page table.
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 075123602bd0..3e0a0a42fa1f 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1136,6 +1136,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  {
> >         struct mm_struct *mm = vma->vm_mm;
> >         pmd_t *pmd;
> > +       pmd_t pmde;
> >         pte_t *ptep;
> >         pte_t newpte;
> >         spinlock_t *ptl;
> > @@ -1150,6 +1151,15 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >         pmd = mm_find_pmd(mm, addr);
> >         if (!pmd)
> >                 goto out;
> > +       /*
> > +        * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
> > +        * without holding anon_vma lock for write.  So when looking for a
> > +        * genuine pmde (in which to find pte), test present and !THP together.
> > +        */
> > +       pmde = *pmd;
> > +       barrier();
> > +       if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> > +               goto out;
> >
> >         mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr,
> >                                 addr + PAGE_SIZE);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index edc06c52bc82..af775855e58f 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -767,13 +767,17 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> >         return vma_address(page, vma);
> >  }
> >
> > +/*
> > + * Returns the actual pmd_t* where we expect 'address' to be mapped from, or
> > + * NULL if it doesn't exist.  No guarantees / checks on what the pmd_t*
> > + * represents.
> > + */
> >  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> >  {
> >         pgd_t *pgd;
> >         p4d_t *p4d;
> >         pud_t *pud;
> >         pmd_t *pmd = NULL;
> > -       pmd_t pmde;
> >
> >         pgd = pgd_offset(mm, address);
> >         if (!pgd_present(*pgd))
> > @@ -788,15 +792,6 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> >                 goto out;
> >
> >         pmd = pmd_offset(pud, address);
> > -       /*
> > -        * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
> > -        * without holding anon_vma lock for write.  So when looking for a
> > -        * genuine pmde (in which to find pte), test present and !THP together.
> > -        */
> > -       pmde = *pmd;
> > -       barrier();
> > -       if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> > -               pmd = NULL;
> >  out:
> >         return pmd;
> >  }
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 09/18] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-07-11 21:22   ` Yang Shi
@ 2022-07-12 16:54     ` Zach O'Keefe
  0 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-12 16:54 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	linux-api

On Jul 11 14:22, Yang Shi wrote:
> On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > This idea was introduced by David Rientjes[1].
> >
> > Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
> > synchronous collapse of memory at their own expense.
> >
> > The benefits of this approach are:
> >
> > * CPU is charged to the process that wants to spend the cycles for the
> >   THP
> > * Avoid unpredictable timing of khugepaged collapse
> >
> > Semantics
> >
> > This call is independent of the system-wide THP sysfs settings, but will
> > fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
> > multiple VMAs, the semantics of the collapse over each VMA is
> > independent from the others.  This implies a hugepage cannot cross a VMA
> > boundary.  If collapse of a given hugepage-aligned/sized region fails,
> > the operation may continue to attempt collapsing the remainder of memory
> > specified.
> >
> > The memory ranges provided must be page-aligned, but are not required to
> > be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
> > start/end of the range will be clamped to the first/last
> > hugepage-aligned address covered by said range.  The memory ranges must
> > span at least one hugepage-sized region.
> >
> > All non-resident pages covered by the range will first be
> > swapped/faulted-in, before being internally copied onto a freshly
> > allocated hugepage.  Unmapped pages will have their data directly
> > initialized to 0 in the new hugepage.  However, for every eligible hugepage
> > aligned/sized region to-be collapsed, at least one page must currently be
> > backed by memory (a PMD covering the address range must already exist).
> >
> > Allocation for the new hugepage may enter direct reclaim and/or
> > compaction, regardless of VMA flags.  When the system has multiple NUMA
> > nodes, the hugepage will be allocated from the node providing the most
> > native pages.  This operation operates on the current state of the
> > specified process and makes no persistent changes or guarantees on how
> > pages will be mapped, constructed, or faulted in the future
> >
> > Return Value
> >
> > If all hugepage-sized/aligned regions covered by the provided range were
> > either successfully collapsed, or were already PMD-mapped THPs, this
> > operation will be deemed successful.  On success, process_madvise(2)
> > returns the number of bytes advised, and madvise(2) returns 0.  Else, -1
> > is returned and errno is set to indicate the error for the most-recently
> > attempted hugepage collapse.  Note that many failures might have
> > occurred, since the operation may continue to collapse in the event a
> > single hugepage-sized/aligned region fails.
> >
> >         ENOMEM  Memory allocation failed or VMA not found
> >         EBUSY   Memcg charging failed
> >         EAGAIN  Required resource temporarily unavailable.  Try again
> >                 might succeed.
> >         EINVAL  Other error: No PMD found, subpage doesn't have Present
> >                 bit set, "Special" page no backed by struct page, VMA
> >                 incorrectly sized, address not page-aligned, ...
> >
> > Most notable here is ENOMEM and EBUSY (new to madvise) which are
> > intended to provide the caller with actionable feedback so they may take
> > an appropriate fallback measure.
> 
> Don't forget to update man-pages. And cc'ed linux-api.
>

Thanks for the review, Yang. Yes, have plans to update the man-pages once things
are fully ironed out. Also, thanks for tip to cc linux-api - I did not know
about that.

Best,
Zach

> >
> > Use Cases
> >
> > An immediate user of this new functionality are malloc() implementations
> > that manage memory in hugepage-sized chunks, but sometimes subrelease
> > memory back to the system in native-sized chunks via MADV_DONTNEED;
> > zapping the pmd.  Later, when the memory is hot, the implementation
> > could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
> > hugepage coverage and dTLB performance.  TCMalloc is such an
> > implementation that could benefit from this[2].
> >
> > Only privately-mapped anon memory is supported for now, but additional
> > support for file, shmem, and HugeTLB high-granularity mappings[2] is
> > expected.  File and tmpfs/shmem support would permit:
> >
> > * Backing executable text by THPs.  Current support provided by
> >   CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which
> >   might impair services from serving at their full rated load after
> >   (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
> >   immediately realize iTLB performance prevents page sharing and demand
> >   paging, both of which increase steady state memory footprint.  With
> >   MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
> >   and lower RAM footprints.
> > * Backing guest memory by hugapages after the memory contents have been
> >   migrated in native-page-sized chunks to a new host, in a
> >   userfaultfd-based live-migration stack.
> >
> > [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> > [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
> >
> > Suggested-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> 
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> 
> > ---
> >  arch/alpha/include/uapi/asm/mman.h           |   2 +
> >  arch/mips/include/uapi/asm/mman.h            |   2 +
> >  arch/parisc/include/uapi/asm/mman.h          |   2 +
> >  arch/xtensa/include/uapi/asm/mman.h          |   2 +
> >  include/linux/huge_mm.h                      |  14 ++-
> >  include/uapi/asm-generic/mman-common.h       |   2 +
> >  mm/khugepaged.c                              | 118 ++++++++++++++++++-
> >  mm/madvise.c                                 |   5 +
> >  tools/include/uapi/asm-generic/mman-common.h |   2 +
> >  9 files changed, 146 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > index 4aa996423b0d..763929e814e9 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -76,6 +76,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > index 1be428663c10..c6e1fc77c996 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -103,6 +103,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > index a7ea3204a5fa..22133a6a506e 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -70,6 +70,8 @@
> >  #define MADV_WIPEONFORK 71             /* Zero memory on fork, child only */
> >  #define MADV_KEEPONFORK 72             /* Undo MADV_WIPEONFORK */
> >
> > +#define MADV_COLLAPSE  73              /* Synchronous hugepage collapse */
> > +
> >  #define MADV_HWPOISON     100          /* poison a page for testing */
> >  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
> >
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > index 7966a58af472..1ff0c858544f 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -111,6 +111,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 00312fc251c1..39193623442e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -218,6 +218,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >
> >  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> >                      int advice);
> > +int madvise_collapse(struct vm_area_struct *vma,
> > +                    struct vm_area_struct **prev,
> > +                    unsigned long start, unsigned long end);
> >  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> >                            unsigned long end, long adjust_next);
> >  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > @@ -361,9 +364,16 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
> >  static inline int hugepage_madvise(struct vm_area_struct *vma,
> >                                    unsigned long *vm_flags, int advice)
> >  {
> > -       BUG();
> > -       return 0;
> > +       return -EINVAL;
> >  }
> > +
> > +static inline int madvise_collapse(struct vm_area_struct *vma,
> > +                                  struct vm_area_struct **prev,
> > +                                  unsigned long start, unsigned long end)
> > +{
> > +       return -EINVAL;
> > +}
> > +
> >  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> >                                          unsigned long start,
> >                                          unsigned long end,
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 6c1aa92a92e4..6ce1f1ceb432 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -77,6 +77,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index c7a09cc9a0e8..2b2d832e44f2 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -976,7 +976,8 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> >                               struct collapse_control *cc)
> >  {
> >         /* Only allocate from the target node */
> > -       gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> > +       gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> > +                    GFP_TRANSHUGE) | __GFP_THISNODE;
> >         int node = khugepaged_find_target_node(cc);
> >
> >         if (!khugepaged_alloc_page(hpage, gfp, node))
> > @@ -2356,3 +2357,118 @@ void khugepaged_min_free_kbytes_update(void)
> >                 set_recommended_min_free_kbytes();
> >         mutex_unlock(&khugepaged_mutex);
> >  }
> > +
> > +static int madvise_collapse_errno(enum scan_result r)
> > +{
> > +       /*
> > +        * MADV_COLLAPSE breaks from existing madvise(2) conventions to provide
> > +        * actionable feedback to caller, so they may take an appropriate
> > +        * fallback measure depending on the nature of the failure.
> > +        */
> > +       switch (r) {
> > +       case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > +               return -ENOMEM;
> > +       case SCAN_CGROUP_CHARGE_FAIL:
> > +               return -EBUSY;
> > +       /* Resource temporary unavailable - trying again might succeed */
> > +       case SCAN_PAGE_LOCK:
> > +       case SCAN_PAGE_LRU:
> > +               return -EAGAIN;
> > +       /*
> > +        * Other: Trying again likely not to succeed / error intrinsic to
> > +        * specified memory range. khugepaged likely won't be able to collapse
> > +        * either.
> > +        */
> > +       default:
> > +               return -EINVAL;
> > +       }
> > +}
> > +
> > +int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > +                    unsigned long start, unsigned long end)
> > +{
> > +       struct collapse_control *cc;
> > +       struct mm_struct *mm = vma->vm_mm;
> > +       unsigned long hstart, hend, addr;
> > +       int thps = 0, last_fail = SCAN_FAIL;
> > +       bool mmap_locked = true;
> > +
> > +       BUG_ON(vma->vm_start > start);
> > +       BUG_ON(vma->vm_end < end);
> > +
> > +       cc = kmalloc(sizeof(*cc), GFP_KERNEL);
> > +       if (!cc)
> > +               return -ENOMEM;
> > +       cc->is_khugepaged = false;
> > +       cc->last_target_node = NUMA_NO_NODE;
> > +
> > +       *prev = vma;
> > +
> > +       /* TODO: Support file/shmem */
> > +       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > +               return -EINVAL;
> > +
> > +       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> > +       hend = end & HPAGE_PMD_MASK;
> > +
> > +       if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> > +               return -EINVAL;
> > +
> > +       mmgrab(mm);
> > +       lru_add_drain_all();
> > +
> > +       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> > +               int result = SCAN_FAIL;
> > +
> > +               if (!mmap_locked) {
> > +                       cond_resched();
> > +                       mmap_read_lock(mm);
> > +                       mmap_locked = true;
> > +                       result = hugepage_vma_revalidate(mm, addr, &vma, cc);
> > +                       if (result  != SCAN_SUCCEED) {
> > +                               last_fail = result;
> > +                               goto out_nolock;
> > +                       }
> > +               }
> > +               mmap_assert_locked(mm);
> > +               memset(cc->node_load, 0, sizeof(cc->node_load));
> > +               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, cc);
> > +               if (!mmap_locked)
> > +                       *prev = NULL;  /* Tell caller we dropped mmap_lock */
> > +
> > +               switch (result) {
> > +               case SCAN_SUCCEED:
> > +               case SCAN_PMD_MAPPED:
> > +                       ++thps;
> > +                       break;
> > +               /* Whitelisted set of results where continuing OK */
> > +               case SCAN_PMD_NULL:
> > +               case SCAN_PTE_NON_PRESENT:
> > +               case SCAN_PTE_UFFD_WP:
> > +               case SCAN_PAGE_RO:
> > +               case SCAN_LACK_REFERENCED_PAGE:
> > +               case SCAN_PAGE_NULL:
> > +               case SCAN_PAGE_COUNT:
> > +               case SCAN_PAGE_LOCK:
> > +               case SCAN_PAGE_COMPOUND:
> > +               case SCAN_PAGE_LRU:
> > +                       last_fail = result;
> > +                       break;
> > +               default:
> > +                       last_fail = result;
> > +                       /* Other error, exit */
> > +                       goto out_maybelock;
> > +               }
> > +       }
> > +
> > +out_maybelock:
> > +       /* Caller expects us to hold mmap_lock on return */
> > +       if (!mmap_locked)
> > +               mmap_read_lock(mm);
> > +out_nolock:
> > +       mmap_assert_locked(mm);
> > +       mmdrop(mm);
> > +
> > +       return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> > +                       : madvise_collapse_errno(last_fail);
> > +}
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 851fa4e134bc..9f08e958ea86 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
> >         case MADV_FREE:
> >         case MADV_POPULATE_READ:
> >         case MADV_POPULATE_WRITE:
> > +       case MADV_COLLAPSE:
> >                 return 0;
> >         default:
> >                 /* be safe, default to 1. list exceptions explicitly */
> > @@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> >                 if (error)
> >                         goto out;
> >                 break;
> > +       case MADV_COLLAPSE:
> > +               return madvise_collapse(vma, prev, start, end);
> >         }
> >
> >         anon_name = anon_vma_name(vma);
> > @@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >         case MADV_HUGEPAGE:
> >         case MADV_NOHUGEPAGE:
> > +       case MADV_COLLAPSE:
> >  #endif
> >         case MADV_DONTDUMP:
> >         case MADV_DODUMP:
> > @@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> >   *             transparent huge pages so the existing pages will not be
> >   *             coalesced into THP and new pages will not be allocated as THP.
> > + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> >   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> >   *             from being included in its core dump.
> >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> > index 6c1aa92a92e4..6ce1f1ceb432 100644
> > --- a/tools/include/uapi/asm-generic/mman-common.h
> > +++ b/tools/include/uapi/asm-generic/mman-common.h
> > @@ -77,6 +77,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 07/18] mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()
  2022-07-11 20:57   ` Yang Shi
@ 2022-07-12 16:58     ` Zach O'Keefe
  0 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-12 16:58 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Jul 11 13:57, Yang Shi wrote:
> On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1].
> >
> > hugepage_vma_check() is the authority on determining if a VMA is eligible
> > for THP allocation/collapse, and currently enforces the sysfs THP settings.
> > Add a flag to disable these checks.  For now, only apply this arg to anon
> > and file, which use /sys/kernel/transparent_hugepage/enabled.  We can
> > expand this to shmem, which uses
> > /sys/kernel/transparent_hugepage/shmem_enabled, later.
> >
> > Use this flag in collapse_pte_mapped_thp() where previously the VMA flags
> > passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the
> > VM_HUGEPAGE check in "madvise" THP mode. Prior to "mm: khugepaged: check
> > THP flag in hugepage_vma_check()", this check also didn't check "never" THP
> > mode.  As such, this restores the previous behavior of
> > collapse_pte_mapped_thp() where sysfs THP settings are ignored.  See
> > comment in code for justification why this is OK.
> >
> > [1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> 
> Reviewed-by: Yang Shi <shy828301@gmail.com>

Thanks for the review!

Best,
Zach

> > ---
> >  fs/proc/task_mmu.c      |  2 +-
> >  include/linux/huge_mm.h |  9 ++++-----
> >  mm/huge_memory.c        | 14 ++++++--------
> >  mm/khugepaged.c         | 25 ++++++++++++++-----------
> >  mm/memory.c             |  4 ++--
> >  5 files changed, 27 insertions(+), 27 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 34d292cec79a..f8cd58846a28 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -866,7 +866,7 @@ static int show_smap(struct seq_file *m, void *v)
> >         __show_smap(m, &mss, false);
> >
> >         seq_printf(m, "THPeligible:    %d\n",
> > -                  hugepage_vma_check(vma, vma->vm_flags, true, false));
> > +                  hugepage_vma_check(vma, vma->vm_flags, true, false, true));
> >
> >         if (arch_pkeys_enabled())
> >                 seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 37f2f11a6d7e..00312fc251c1 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -168,9 +168,8 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
> >                !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
> >  }
> >
> > -bool hugepage_vma_check(struct vm_area_struct *vma,
> > -                       unsigned long vm_flags,
> > -                       bool smaps, bool in_pf);
> > +bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
> > +                       bool smaps, bool in_pf, bool enforce_sysfs);
> >
> >  #define transparent_hugepage_use_zero_page()                           \
> >         (transparent_hugepage_flags &                                   \
> > @@ -321,8 +320,8 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
> >  }
> >
> >  static inline bool hugepage_vma_check(struct vm_area_struct *vma,
> > -                                      unsigned long vm_flags,
> > -                                      bool smaps, bool in_pf)
> > +                                     unsigned long vm_flags, bool smaps,
> > +                                     bool in_pf, bool enforce_sysfs)
> >  {
> >         return false;
> >  }
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index da300ce9dedb..4fbe43dc1568 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -69,9 +69,8 @@ static atomic_t huge_zero_refcount;
> >  struct page *huge_zero_page __read_mostly;
> >  unsigned long huge_zero_pfn __read_mostly = ~0UL;
> >
> > -bool hugepage_vma_check(struct vm_area_struct *vma,
> > -                       unsigned long vm_flags,
> > -                       bool smaps, bool in_pf)
> > +bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
> > +                       bool smaps, bool in_pf, bool enforce_sysfs)
> >  {
> >         if (!vma->vm_mm)                /* vdso */
> >                 return false;
> > @@ -120,11 +119,10 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
> >         if (!in_pf && shmem_file(vma->vm_file))
> >                 return shmem_huge_enabled(vma);
> >
> > -       if (!hugepage_flags_enabled())
> > -               return false;
> > -
> > -       /* THP settings require madvise. */
> > -       if (!(vm_flags & VM_HUGEPAGE) && !hugepage_flags_always())
> > +       /* Enforce sysfs THP requirements as necessary */
> > +       if (enforce_sysfs &&
> > +           (!hugepage_flags_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
> > +                                          !hugepage_flags_always())))
> >                 return false;
> >
> >         /* Only regular file is valid */
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index d89056d8cbad..b0e20db3f805 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -478,7 +478,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> >  {
> >         if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
> >             hugepage_flags_enabled()) {
> > -               if (hugepage_vma_check(vma, vm_flags, false, false))
> > +               if (hugepage_vma_check(vma, vm_flags, false, false, true))
> >                         __khugepaged_enter(vma->vm_mm);
> >         }
> >  }
> > @@ -844,7 +844,8 @@ static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> >   */
> >
> >  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > -               struct vm_area_struct **vmap)
> > +                                  struct vm_area_struct **vmap,
> > +                                  struct collapse_control *cc)
> >  {
> >         struct vm_area_struct *vma;
> >
> > @@ -855,7 +856,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >         if (!vma)
> >                 return SCAN_VMA_NULL;
> >
> > -       if (!hugepage_vma_check(vma, vma->vm_flags, false, false))
> > +       if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
> > +                               cc->is_khugepaged))
> >                 return SCAN_VMA_CHECK;
> >         /*
> >          * Anon VMA expected, the address may be unmapped then
> > @@ -974,7 +976,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                 goto out_nolock;
> >
> >         mmap_read_lock(mm);
> > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > +       result = hugepage_vma_revalidate(mm, address, &vma, cc);
> >         if (result != SCAN_SUCCEED) {
> >                 mmap_read_unlock(mm);
> >                 goto out_nolock;
> > @@ -1006,7 +1008,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >          * handled by the anon_vma lock + PG_lock.
> >          */
> >         mmap_write_lock(mm);
> > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > +       result = hugepage_vma_revalidate(mm, address, &vma, cc);
> >         if (result != SCAN_SUCCEED)
> >                 goto out_up_write;
> >         /* check if the pmd is still valid */
> > @@ -1350,12 +1352,13 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> >                 return;
> >
> >         /*
> > -        * This vm_flags may not have VM_HUGEPAGE if the page was not
> > -        * collapsed by this mm. But we can still collapse if the page is
> > -        * the valid THP. Add extra VM_HUGEPAGE so hugepage_vma_check()
> > -        * will not fail the vma for missing VM_HUGEPAGE
> > +        * If we are here, we've succeeded in replacing all the native pages
> > +        * in the page cache with a single hugepage. If a mm were to fault-in
> > +        * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
> > +        * and map it by a PMD, regardless of sysfs THP settings. As such, let's
> > +        * analogously elide sysfs THP settings here.
> >          */
> > -       if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE, false, false))
> > +       if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> >                 return;
> >
> >         /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> > @@ -2042,7 +2045,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                         progress++;
> >                         break;
> >                 }
> > -               if (!hugepage_vma_check(vma, vma->vm_flags, false, false)) {
> > +               if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true)) {
> >  skip:
> >                         progress++;
> >                         continue;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8917bea2f0bc..96cd776e84f1 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5001,7 +5001,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> >                 return VM_FAULT_OOM;
> >  retry_pud:
> >         if (pud_none(*vmf.pud) &&
> > -           hugepage_vma_check(vma, vm_flags, false, true)) {
> > +           hugepage_vma_check(vma, vm_flags, false, true, true)) {
> >                 ret = create_huge_pud(&vmf);
> >                 if (!(ret & VM_FAULT_FALLBACK))
> >                         return ret;
> > @@ -5035,7 +5035,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> >                 goto retry_pud;
> >
> >         if (pmd_none(*vmf.pmd) &&
> > -           hugepage_vma_check(vma, vm_flags, false, true)) {
> > +           hugepage_vma_check(vma, vm_flags, false, true, true)) {
> >                 ret = create_huge_pmd(&vmf);
> >                 if (!(ret & VM_FAULT_FALLBACK))
> >                         return ret;
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >
> >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 11/18] mm/madvise: add huge_memory:mm_madvise_collapse tracepoint
  2022-07-12 16:21     ` Zach O'Keefe
@ 2022-07-12 17:05       ` Yang Shi
  2022-07-12 17:30         ` Zach O'Keefe
  0 siblings, 1 reply; 47+ messages in thread
From: Yang Shi @ 2022-07-12 17:05 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Tue, Jul 12, 2022 at 9:21 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Hey Yang,
>
> Thanks for taking the time to review this series again.
>
> On Jul 11 14:32, Yang Shi wrote:
> > On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > Add a tracepoint to expose mm, address, and enum scan_result of each
> > > hugepage attempted to be collapsed by call to madvise(MADV_COLLAPSE).
> >
> > Is this necessary? Isn't mm_khugepaged_scan_pmd tracepoint good
> > enough? It doesn't have "address", but you should be able to calculate
> > address from it with syscall trace together.
> >
>
> I've also found this useful debugging along the file path. Perhaps the answer to
> that is: add tracepoints to the file path - and we should probably do that - but
> the other issue is that turning on these tracepoints (for the purposes of
> debugging MADV_COLLAPSE) generates a lot of noise from khugepaged that is hard
> to separate out. Augmenting existing tracepoints with .is_khugepaged data
> incurred the risks associated with altering an existing kernel API. WDYT?

Doesn't ftrace show process comm and ID? And I think you also could
trace the specific processes, right?

>
> Thanks again,
> Zach
>
>
> >
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >  include/trace/events/huge_memory.h | 22 ++++++++++++++++++++++
> > >  mm/khugepaged.c                    |  2 ++
> > >  2 files changed, 24 insertions(+)
> > >
> > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > index 55392bf30a03..38d339ffdb16 100644
> > > --- a/include/trace/events/huge_memory.h
> > > +++ b/include/trace/events/huge_memory.h
> > > @@ -167,5 +167,27 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
> > >                 __entry->ret)
> > >  );
> > >
> > > +TRACE_EVENT(mm_madvise_collapse,
> > > +
> > > +       TP_PROTO(struct mm_struct *mm, unsigned long addr, int result),
> > > +
> > > +       TP_ARGS(mm, addr, result),
> > > +
> > > +       TP_STRUCT__entry(__field(struct mm_struct *, mm)
> > > +                        __field(unsigned long, addr)
> > > +                        __field(int, result)
> > > +       ),
> > > +
> > > +       TP_fast_assign(__entry->mm = mm;
> > > +                      __entry->addr = addr;
> > > +                      __entry->result = result;
> > > +       ),
> > > +
> > > +       TP_printk("mm=%p addr=%#lx result=%s",
> > > +                 __entry->mm,
> > > +                 __entry->addr,
> > > +                 __print_symbolic(__entry->result, SCAN_STATUS))
> > > +);
> > > +
> > >  #endif /* __HUGE_MEMORY_H */
> > >  #include <trace/define_trace.h>
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index e0d00180512c..0207fc0a5b2a 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -2438,6 +2438,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > >                 if (!mmap_locked)
> > >                         *prev = NULL;  /* Tell caller we dropped mmap_lock */
> > >
> > > +               trace_mm_madvise_collapse(mm, addr, result);
> > > +
> > >                 switch (result) {
> > >                 case SCAN_SUCCEED:
> > >                 case SCAN_PMD_MAPPED:
> > > --
> > > 2.37.0.rc0.161.g10f37bed90-goog
> > >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 06/18] mm/khugepaged: add flag to predicate khugepaged-only behavior
  2022-07-11 20:43   ` Yang Shi
@ 2022-07-12 17:06     ` Zach O'Keefe
  0 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-12 17:06 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Jul 11 13:43, Yang Shi wrote:
> On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > Add .is_khugepaged flag to struct collapse_control so
> > khugepaged-specific behavior can be elided by MADV_COLLAPSE context.
> >
> > Start by protecting khugepaged-specific heuristics by this flag. In
> > MADV_COLLAPSE, the user presumably has reason to believe the collapse
> > will be beneficial and khugepaged heuristics shouldn't prevent the user
> > from doing so:
> >
> > 1) sysfs-controlled knobs khugepaged_max_ptes_[none|swap|shared]
> >
> > 2) requirement that some pages in region being collapsed be young or
> >    referenced
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >
> > v6 -> v7: There is no functional change here from v6, just a renaming of
> >           flags to explicitly be predicated on khugepaged.
> 
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> 
> Just a nit, some conditions check is_khugepaged first, some don't. Why
> not make them more consistent to check is_khugepaged first?
>

Again, thank you for taking the time to review. Agreed the inconsistency is
ugly, and have updated to check is_khugepaged consistently first. Thanks for the
suggestion.

Zach

> > ---
> >  mm/khugepaged.c | 62 ++++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 43 insertions(+), 19 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 147f5828f052..d89056d8cbad 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -73,6 +73,8 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
> >   * default collapse hugepages if there is at least one pte mapped like
> >   * it would have happened if the vma was large enough during page
> >   * fault.
> > + *
> > + * Note that these are only respected if collapse was initiated by khugepaged.
> >   */
> >  static unsigned int khugepaged_max_ptes_none __read_mostly;
> >  static unsigned int khugepaged_max_ptes_swap __read_mostly;
> > @@ -86,6 +88,8 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
> >  #define MAX_PTE_MAPPED_THP 8
> >
> >  struct collapse_control {
> > +       bool is_khugepaged;
> > +
> >         /* Num pages scanned per node */
> >         int node_load[MAX_NUMNODES];
> >
> > @@ -554,6 +558,7 @@ static bool is_refcount_suitable(struct page *page)
> >  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                                         unsigned long address,
> >                                         pte_t *pte,
> > +                                       struct collapse_control *cc,
> >                                         struct list_head *compound_pagelist)
> >  {
> >         struct page *page = NULL;
> > @@ -567,7 +572,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                 if (pte_none(pteval) || (pte_present(pteval) &&
> >                                 is_zero_pfn(pte_pfn(pteval)))) {
> >                         if (!userfaultfd_armed(vma) &&
> > -                           ++none_or_zero <= khugepaged_max_ptes_none) {
> > +                           (++none_or_zero <= khugepaged_max_ptes_none ||
> > +                            !cc->is_khugepaged)) {
> >                                 continue;
> >                         } else {
> >                                 result = SCAN_EXCEED_NONE_PTE;
> > @@ -587,8 +593,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >
> >                 VM_BUG_ON_PAGE(!PageAnon(page), page);
> >
> > -               if (page_mapcount(page) > 1 &&
> > -                               ++shared > khugepaged_max_ptes_shared) {
> > +               if (cc->is_khugepaged && page_mapcount(page) > 1 &&
> > +                   ++shared > khugepaged_max_ptes_shared) {
> >                         result = SCAN_EXCEED_SHARED_PTE;
> >                         count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> >                         goto out;
> > @@ -654,10 +660,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                 if (PageCompound(page))
> >                         list_add_tail(&page->lru, compound_pagelist);
> >  next:
> > -               /* There should be enough young pte to collapse the page */
> > -               if (pte_young(pteval) ||
> > -                   page_is_young(page) || PageReferenced(page) ||
> > -                   mmu_notifier_test_young(vma->vm_mm, address))
> > +               /*
> > +                * If collapse was initiated by khugepaged, check that there is
> > +                * enough young pte to justify collapsing the page
> > +                */
> > +               if (cc->is_khugepaged &&
> > +                   (pte_young(pteval) || page_is_young(page) ||
> > +                    PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
> > +                                                                    address)))
> >                         referenced++;
> >
> >                 if (pte_write(pteval))
> > @@ -666,7 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >
> >         if (unlikely(!writable)) {
> >                 result = SCAN_PAGE_RO;
> > -       } else if (unlikely(!referenced)) {
> > +       } else if (unlikely(cc->is_khugepaged && !referenced)) {
> >                 result = SCAN_LACK_REFERENCED_PAGE;
> >         } else {
> >                 result = SCAN_SUCCEED;
> > @@ -745,6 +755,7 @@ static void khugepaged_alloc_sleep(void)
> >
> >
> >  struct collapse_control khugepaged_collapse_control = {
> > +       .is_khugepaged = true,
> >         .last_target_node = NUMA_NO_NODE,
> >  };
> >
> > @@ -1023,7 +1034,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >         mmu_notifier_invalidate_range_end(&range);
> >
> >         spin_lock(pte_ptl);
> > -       result =  __collapse_huge_page_isolate(vma, address, pte,
> > +       result =  __collapse_huge_page_isolate(vma, address, pte, cc,
> >                                                &compound_pagelist);
> >         spin_unlock(pte_ptl);
> >
> > @@ -1114,7 +1125,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> >              _pte++, _address += PAGE_SIZE) {
> >                 pte_t pteval = *_pte;
> >                 if (is_swap_pte(pteval)) {
> > -                       if (++unmapped <= khugepaged_max_ptes_swap) {
> > +                       if (++unmapped <= khugepaged_max_ptes_swap ||
> > +                           !cc->is_khugepaged) {
> >                                 /*
> >                                  * Always be strict with uffd-wp
> >                                  * enabled swap entries.  Please see
> > @@ -1133,7 +1145,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> >                 }
> >                 if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> >                         if (!userfaultfd_armed(vma) &&
> > -                           ++none_or_zero <= khugepaged_max_ptes_none) {
> > +                           (++none_or_zero <= khugepaged_max_ptes_none ||
> > +                            !cc->is_khugepaged)) {
> >                                 continue;
> >                         } else {
> >                                 result = SCAN_EXCEED_NONE_PTE;
> > @@ -1163,8 +1176,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> >                         goto out_unmap;
> >                 }
> >
> > -               if (page_mapcount(page) > 1 &&
> > -                               ++shared > khugepaged_max_ptes_shared) {
> > +               if (cc->is_khugepaged &&
> > +                   page_mapcount(page) > 1 &&
> > +                   ++shared > khugepaged_max_ptes_shared) {
> >                         result = SCAN_EXCEED_SHARED_PTE;
> >                         count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> >                         goto out_unmap;
> > @@ -1218,14 +1232,22 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> >                         result = SCAN_PAGE_COUNT;
> >                         goto out_unmap;
> >                 }
> > -               if (pte_young(pteval) ||
> > -                   page_is_young(page) || PageReferenced(page) ||
> > -                   mmu_notifier_test_young(vma->vm_mm, address))
> > +
> > +               /*
> > +                * If collapse was initiated by khugepaged, check that there is
> > +                * enough young pte to justify collapsing the page
> > +                */
> > +               if (cc->is_khugepaged &&
> > +                   (pte_young(pteval) || page_is_young(page) ||
> > +                    PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
> > +                                                                    address)))
> >                         referenced++;
> >         }
> >         if (!writable) {
> >                 result = SCAN_PAGE_RO;
> > -       } else if (!referenced || (unmapped && referenced < HPAGE_PMD_NR/2)) {
> > +       } else if (cc->is_khugepaged &&
> > +                  (!referenced ||
> > +                   (unmapped && referenced < HPAGE_PMD_NR / 2))) {
> >                 result = SCAN_LACK_REFERENCED_PAGE;
> >         } else {
> >                 result = SCAN_SUCCEED;
> > @@ -1894,7 +1916,8 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> >                         continue;
> >
> >                 if (xa_is_value(page)) {
> > -                       if (++swap > khugepaged_max_ptes_swap) {
> > +                       if (cc->is_khugepaged &&
> > +                           ++swap > khugepaged_max_ptes_swap) {
> >                                 result = SCAN_EXCEED_SWAP_PTE;
> >                                 count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> >                                 break;
> > @@ -1945,7 +1968,8 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> >         rcu_read_unlock();
> >
> >         if (result == SCAN_SUCCEED) {
> > -               if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
> > +               if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none &&
> > +                   cc->is_khugepaged) {
> >                         result = SCAN_EXCEED_NONE_PTE;
> >                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >                 } else {
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 01/18] mm/khugepaged: remove redundant transhuge_vma_suitable() check
  2022-07-11 20:38   ` Yang Shi
@ 2022-07-12 17:14     ` Zach O'Keefe
  0 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-12 17:14 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Jul 11 13:38, Yang Shi wrote:
> On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > transhuge_vma_suitable() is called twice in hugepage_vma_revalidate()
> > path.  Remove the first check, and rely on the second check inside
> > hugepage_vma_check().
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  mm/khugepaged.c | 2 --
> >  1 file changed, 2 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index cfe231c5958f..5269d15e20f6 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -918,8 +918,6 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >         if (!vma)
> >                 return SCAN_VMA_NULL;
> >
> > -       if (!transhuge_vma_suitable(vma, address))
> > -               return SCAN_ADDRESS_RANGE;
> 
> It seems this is the only user of SCAN_ADDRESS_RANGE, so
> SCAN_ADDRESS_RANGE could be deleted as well.
>

Good catch! Was able to remove this.

Thanks again,
Zach

> >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false))
> >                 return SCAN_VMA_CHECK;
> >         /*
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps
  2022-07-12 16:31     ` Zach O'Keefe
@ 2022-07-12 17:27       ` Yang Shi
  2022-07-12 17:57         ` Zach O'Keefe
  0 siblings, 1 reply; 47+ messages in thread
From: Yang Shi @ 2022-07-12 17:27 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	James Houghton

On Tue, Jul 12, 2022 at 9:31 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Jul 11 14:37, Yang Shi wrote:
> > On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > Add PMDMappable field to smaps output which informs the user if memory
> > > in the VMA can be PMD-mapped by MADV_COLLAPSE.
> > >
> > > The distinction from THPeligible is needed for two reasons:
> > >
> > > 1) For THP, MADV_COLLAPSE is not coupled to THP sysfs controls, which
> > >    THPeligible reports.
> > >
> > > 2) PMDMappable can also be used in HugeTLB fine-granularity mappings,
> > >    which are independent from THP.
> >
> > Could you please elaborate the usecase? The user checks this hint
> > before calling MADV_COLLAPSE? Is it really necessary?
> >
> > And, TBH it sounds confusing and we don't have to maintain both
> > THPeligible and PMDMappable. We could just relax THPeligible to make
> > it return 1 even though THP is disabled by sysfs but MADV_COLLAPSE
> > could collapse it if such hint is useful.
> >
>
> Hey Yang,
>
> Thanks for taking the time to review this series again, and thanks for
> challenging this.
>
> TLDR: "Is it really necessary" - at the moment, no, probably not .. but I think
> it's "useful".
>
> Rationale:
>
> 1. IMO, I thought was was confusing seeing:
>
>         ...
>         AnonHugePages:      2048 kB
>         ShmemPmdMapped:        0 kB
>         FilePmdMapped:         0 kB
>         Shared_Hugetlb:        0 kB
>         Private_Hugetlb:       0 kB
>         Swap:                  0 kB
>         SwapPss:               0 kB
>         Locked:                0 kB
>         THPeligible:    0
>         ...
>
> Maybe this could simply be clarified in the docs though.  I guess we can already
> get:
>
>         ...
>         AnonHugePages:         0 kB
>         ShmemPmdMapped:        0 kB
>         FilePmdMapped:      2048 kB
>         Shared_Hugetlb:        0 kB
>         Private_Hugetlb:       0 kB
>         Swap:                  0 kB
>         SwapPss:               0 kB
>         Locked:                0 kB
>         THPeligible:    0
>         ...
>
> today[1], so perhaps it's not a big deal.

Not only that, if you have file PMD mapped then turn the THP sysfs
flag off, you get the same result. It is just a hint and just shows
the status at that moment when reading smaps.

>
>
> 2. It was useful for debugging - similar to rationale for including
> THPeligible1[2], the logic for determining if a VMA is eligible is pretty
> complicated. I.e. is this file mapped suitably? Unlike THPeligible, however,
> madvise(2) has the ability to set errno on failure to help* diagnose why some
> memory isn't being backed.

I don't disagree it would help for debugging. But as a user who
doesn't know too much about kernel internals, when I see THPeligible
and PMDmappable, I would get confused TBH. And do we have to maintain
another similar hint? Maybe not.

>
> 3. For the immediately-envisioned usecases, the user "knows" about what memory
> they are acting on. However, eventually we'd like to experiment with moving THP
> utilization policy to userspace. Here, it would be useful if the userspace agent
> managing was made aware of what memory it should be managing. I don't have a
> working prototype of what this would like yet, however.

It is not a strong justification to add some user visible stuff for a
future feature (not even prototyped) since things may change, it
sounds safer to add such things once the usecase is solid TBH.

>
> 4. I thought it was neat that this field could be reused for HugeTLB
> fine-granularity mappings - but TBH I'm not sure how useful it'd be there.
>
> I figured relaxing existing THPeligible could break existing users / tests, and
> it'd be likewise confusing for them to see THPeligible: 1, but then have faults
> fail and they'd then have to go check sysfs settings and vma flags ; we'd be
> back in pre-commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
> vma").

I'm not sure what applications rely on this hint, but if they are just
some test scripts, I think it should be fine. I don't think we
guarantee the test scripts won't get broken. AFAIK some test scripts
rely on the kernel dmesg text, for example, OOMs. And the meaning of
the fields do change, for example, inactive anon of /proc/meminfo,
which was changed by the patchset which put anon pages on inactive
list first instead of active list. We already noticed the abnormal
value from our monitoring tool when we adopted 5.10+ kernel. And
/proc/vmstat also had some fields renamed, for example,
workingset_refault of /proc/vmstat, it was split to
workseting_refault_anon and workingset_refault_file, so we had to
update our monitoring scripts accordingly. I think /proc/meminfo and
/proc/vmstat are more heavily used than smaps.

>
> Thanks,
> Zach
>
> [1] https://lore.kernel.org/linux-mm/YrxbQGiwml24APCx@google.com/
>
>
> >
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >  Documentation/filesystems/proc.rst | 10 ++++++++--
> > >  fs/proc/task_mmu.c                 |  2 ++
> > >  2 files changed, 10 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > index 47e95dbc820d..f207903a57a5 100644
> > > --- a/Documentation/filesystems/proc.rst
> > > +++ b/Documentation/filesystems/proc.rst
> > > @@ -466,6 +466,7 @@ Memory Area, or VMA) there is a series of lines such as the following::
> > >      MMUPageSize:           4 kB
> > >      Locked:                0 kB
> > >      THPeligible:           0
> > > +    PMDMappable:           0
> > >      VmFlags: rd ex mr mw me dw
> > >
> > >  The first of these lines shows the same information as is displayed for the
> > > @@ -518,9 +519,14 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
> > >  does not take into account swapped out page of underlying shmem objects.
> > >  "Locked" indicates whether the mapping is locked in memory or not.
> > >
> > > +"PMDMappable" indicates if the memory can be mapped by PMDs - 1 if true, 0
> > > +otherwise.  It just shows the current status. Note that this is memory
> > > +operable on explicitly by MADV_COLLAPSE.
> > > +
> > >  "THPeligible" indicates whether the mapping is eligible for allocating THP
> > > -pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
> > > -It just shows the current status.
> > > +pages by the kernel, as well as the THP is PMD mappable or not - 1 if true, 0
> > > +otherwise. It just shows the current status.  Note this is memory the kernel can
> > > +transparently provide as THPs.
> > >
> > >  "VmFlags" field deserves a separate description. This member represents the
> > >  kernel flags associated with the particular virtual memory area in two letter
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index f8cd58846a28..29f2089456ba 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -867,6 +867,8 @@ static int show_smap(struct seq_file *m, void *v)
> > >
> > >         seq_printf(m, "THPeligible:    %d\n",
> > >                    hugepage_vma_check(vma, vma->vm_flags, true, false, true));
> > > +       seq_printf(m, "PMDMappable:    %d\n",
> > > +                  hugepage_vma_check(vma, vma->vm_flags, true, false, false));
> > >
> > >         if (arch_pkeys_enabled())
> > >                 seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
> > > --
> > > 2.37.0.rc0.161.g10f37bed90-goog
> > >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 11/18] mm/madvise: add huge_memory:mm_madvise_collapse tracepoint
  2022-07-12 17:05       ` Yang Shi
@ 2022-07-12 17:30         ` Zach O'Keefe
  0 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-12 17:30 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Jul 12 10:05, Yang Shi wrote:
> On Tue, Jul 12, 2022 at 9:21 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > Hey Yang,
> >
> > Thanks for taking the time to review this series again.
> >
> > On Jul 11 14:32, Yang Shi wrote:
> > > On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > Add a tracepoint to expose mm, address, and enum scan_result of each
> > > > hugepage attempted to be collapsed by call to madvise(MADV_COLLAPSE).
> > >
> > > Is this necessary? Isn't mm_khugepaged_scan_pmd tracepoint good
> > > enough? It doesn't have "address", but you should be able to calculate
> > > address from it with syscall trace together.
> > >
> >
> > I've also found this useful debugging along the file path. Perhaps the answer to
> > that is: add tracepoints to the file path - and we should probably do that - but
> > the other issue is that turning on these tracepoints (for the purposes of
> > debugging MADV_COLLAPSE) generates a lot of noise from khugepaged that is hard
> > to separate out. Augmenting existing tracepoints with .is_khugepaged data
> > incurred the risks associated with altering an existing kernel API. WDYT?
> 
> Doesn't ftrace show process comm and ID? And I think you also could
> trace the specific processes, right?
>

That's true enough - it had been awhile since I actually did that ;  I've been
carrying some printk's for debugging that eventually I converted into a
tracepoint here. Sorry about that.

I'll drop this and add a relevant tracepoint to file collapse path that will
benefit khugepaged too.

Thanks again,
Zach

> >
> > Thanks again,
> > Zach
> >
> >
> > >
> > > >
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >  include/trace/events/huge_memory.h | 22 ++++++++++++++++++++++
> > > >  mm/khugepaged.c                    |  2 ++
> > > >  2 files changed, 24 insertions(+)
> > > >
> > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > > index 55392bf30a03..38d339ffdb16 100644
> > > > --- a/include/trace/events/huge_memory.h
> > > > +++ b/include/trace/events/huge_memory.h
> > > > @@ -167,5 +167,27 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
> > > >                 __entry->ret)
> > > >  );
> > > >
> > > > +TRACE_EVENT(mm_madvise_collapse,
> > > > +
> > > > +       TP_PROTO(struct mm_struct *mm, unsigned long addr, int result),
> > > > +
> > > > +       TP_ARGS(mm, addr, result),
> > > > +
> > > > +       TP_STRUCT__entry(__field(struct mm_struct *, mm)
> > > > +                        __field(unsigned long, addr)
> > > > +                        __field(int, result)
> > > > +       ),
> > > > +
> > > > +       TP_fast_assign(__entry->mm = mm;
> > > > +                      __entry->addr = addr;
> > > > +                      __entry->result = result;
> > > > +       ),
> > > > +
> > > > +       TP_printk("mm=%p addr=%#lx result=%s",
> > > > +                 __entry->mm,
> > > > +                 __entry->addr,
> > > > +                 __print_symbolic(__entry->result, SCAN_STATUS))
> > > > +);
> > > > +
> > > >  #endif /* __HUGE_MEMORY_H */
> > > >  #include <trace/define_trace.h>
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index e0d00180512c..0207fc0a5b2a 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -2438,6 +2438,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > >                 if (!mmap_locked)
> > > >                         *prev = NULL;  /* Tell caller we dropped mmap_lock */
> > > >
> > > > +               trace_mm_madvise_collapse(mm, addr, result);
> > > > +
> > > >                 switch (result) {
> > > >                 case SCAN_SUCCEED:
> > > >                 case SCAN_PMD_MAPPED:
> > > > --
> > > > 2.37.0.rc0.161.g10f37bed90-goog
> > > >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps
  2022-07-12 17:27       ` Yang Shi
@ 2022-07-12 17:57         ` Zach O'Keefe
  2022-07-13 18:02           ` Andrew Morton
  0 siblings, 1 reply; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-12 17:57 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	James Houghton

On Jul 12 10:27, Yang Shi wrote:
> On Tue, Jul 12, 2022 at 9:31 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Jul 11 14:37, Yang Shi wrote:
> > > On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > Add PMDMappable field to smaps output which informs the user if memory
> > > > in the VMA can be PMD-mapped by MADV_COLLAPSE.
> > > >
> > > > The distinction from THPeligible is needed for two reasons:
> > > >
> > > > 1) For THP, MADV_COLLAPSE is not coupled to THP sysfs controls, which
> > > >    THPeligible reports.
> > > >
> > > > 2) PMDMappable can also be used in HugeTLB fine-granularity mappings,
> > > >    which are independent from THP.
> > >
> > > Could you please elaborate the usecase? The user checks this hint
> > > before calling MADV_COLLAPSE? Is it really necessary?
> > >
> > > And, TBH it sounds confusing and we don't have to maintain both
> > > THPeligible and PMDMappable. We could just relax THPeligible to make
> > > it return 1 even though THP is disabled by sysfs but MADV_COLLAPSE
> > > could collapse it if such hint is useful.
> > >
> >
> > Hey Yang,
> >
> > Thanks for taking the time to review this series again, and thanks for
> > challenging this.
> >
> > TLDR: "Is it really necessary" - at the moment, no, probably not .. but I think
> > it's "useful".
> >
> > Rationale:
> >
> > 1. IMO, I thought was was confusing seeing:
> >
> >         ...
> >         AnonHugePages:      2048 kB
> >         ShmemPmdMapped:        0 kB
> >         FilePmdMapped:         0 kB
> >         Shared_Hugetlb:        0 kB
> >         Private_Hugetlb:       0 kB
> >         Swap:                  0 kB
> >         SwapPss:               0 kB
> >         Locked:                0 kB
> >         THPeligible:    0
> >         ...
> >
> > Maybe this could simply be clarified in the docs though.  I guess we can already
> > get:
> >
> >         ...
> >         AnonHugePages:         0 kB
> >         ShmemPmdMapped:        0 kB
> >         FilePmdMapped:      2048 kB
> >         Shared_Hugetlb:        0 kB
> >         Private_Hugetlb:       0 kB
> >         Swap:                  0 kB
> >         SwapPss:               0 kB
> >         Locked:                0 kB
> >         THPeligible:    0
> >         ...
> >
> > today[1], so perhaps it's not a big deal.
> 
> Not only that, if you have file PMD mapped then turn the THP sysfs
> flag off, you get the same result. It is just a hint and just shows
> the status at that moment when reading smaps.
>

Very good point.

> >
> >
> > 2. It was useful for debugging - similar to rationale for including
> > THPeligible1[2], the logic for determining if a VMA is eligible is pretty
> > complicated. I.e. is this file mapped suitably? Unlike THPeligible, however,
> > madvise(2) has the ability to set errno on failure to help* diagnose why some
> > memory isn't being backed.
> 
> I don't disagree it would help for debugging. But as a user who
> doesn't know too much about kernel internals, when I see THPeligible
> and PMDmappable, I would get confused TBH. And do we have to maintain
> another similar hint? Maybe not.
>

> >
> > 3. For the immediately-envisioned usecases, the user "knows" about what memory
> > they are acting on. However, eventually we'd like to experiment with moving THP
> > utilization policy to userspace. Here, it would be useful if the userspace agent
> > managing was made aware of what memory it should be managing. I don't have a
> > working prototype of what this would like yet, however.
> 
> It is not a strong justification to add some user visible stuff for a
> future feature (not even prototyped) since things may change, it
> sounds safer to add such things once the usecase is solid TBH.
>

Ya, this was a weaker point for inclusion *now* TBH.

> >
> > 4. I thought it was neat that this field could be reused for HugeTLB
> > fine-granularity mappings - but TBH I'm not sure how useful it'd be there.
> >
> > I figured relaxing existing THPeligible could break existing users / tests, and
> > it'd be likewise confusing for them to see THPeligible: 1, but then have faults
> > fail and they'd then have to go check sysfs settings and vma flags ; we'd be
> > back in pre-commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
> > vma").
> 
> I'm not sure what applications rely on this hint, but if they are just
> some test scripts, I think it should be fine. I don't think we
> guarantee the test scripts won't get broken. AFAIK some test scripts
> rely on the kernel dmesg text, for example, OOMs. And the meaning of
> the fields do change, for example, inactive anon of /proc/meminfo,
> which was changed by the patchset which put anon pages on inactive
> list first instead of active list. We already noticed the abnormal
> value from our monitoring tool when we adopted 5.10+ kernel. And
> /proc/vmstat also had some fields renamed, for example,
> workingset_refault of /proc/vmstat, it was split to
> workseting_refault_anon and workingset_refault_file, so we had to
> update our monitoring scripts accordingly. I think /proc/meminfo and
> /proc/vmstat are more heavily used than smaps.
>

Thanks for the great context. My guess is, right now, THPelligible is more
useful as-is than if we were to relax it to MADV_COLLAPSE eligibility. As such,
I'm fine dropping this until a stronger and more immediate usecase presents
itself. Thanks for checking my rationale here.

Best,
Zach


> >
> > Thanks,
> > Zach
> >
> > [1] https://lore.kernel.org/linux-mm/YrxbQGiwml24APCx@google.com/
> >
> >
> > >
> > > >
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >  Documentation/filesystems/proc.rst | 10 ++++++++--
> > > >  fs/proc/task_mmu.c                 |  2 ++
> > > >  2 files changed, 10 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > > index 47e95dbc820d..f207903a57a5 100644
> > > > --- a/Documentation/filesystems/proc.rst
> > > > +++ b/Documentation/filesystems/proc.rst
> > > > @@ -466,6 +466,7 @@ Memory Area, or VMA) there is a series of lines such as the following::
> > > >      MMUPageSize:           4 kB
> > > >      Locked:                0 kB
> > > >      THPeligible:           0
> > > > +    PMDMappable:           0
> > > >      VmFlags: rd ex mr mw me dw
> > > >
> > > >  The first of these lines shows the same information as is displayed for the
> > > > @@ -518,9 +519,14 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
> > > >  does not take into account swapped out page of underlying shmem objects.
> > > >  "Locked" indicates whether the mapping is locked in memory or not.
> > > >
> > > > +"PMDMappable" indicates if the memory can be mapped by PMDs - 1 if true, 0
> > > > +otherwise.  It just shows the current status. Note that this is memory
> > > > +operable on explicitly by MADV_COLLAPSE.
> > > > +
> > > >  "THPeligible" indicates whether the mapping is eligible for allocating THP
> > > > -pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
> > > > -It just shows the current status.
> > > > +pages by the kernel, as well as the THP is PMD mappable or not - 1 if true, 0
> > > > +otherwise. It just shows the current status.  Note this is memory the kernel can
> > > > +transparently provide as THPs.
> > > >
> > > >  "VmFlags" field deserves a separate description. This member represents the
> > > >  kernel flags associated with the particular virtual memory area in two letter
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index f8cd58846a28..29f2089456ba 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -867,6 +867,8 @@ static int show_smap(struct seq_file *m, void *v)
> > > >
> > > >         seq_printf(m, "THPeligible:    %d\n",
> > > >                    hugepage_vma_check(vma, vma->vm_flags, true, false, true));
> > > > +       seq_printf(m, "PMDMappable:    %d\n",
> > > > +                  hugepage_vma_check(vma, vma->vm_flags, true, false, false));
> > > >
> > > >         if (arch_pkeys_enabled())
> > > >                 seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
> > > > --
> > > > 2.37.0.rc0.161.g10f37bed90-goog
> > > >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-07-08 20:47   ` Andrew Morton
@ 2022-07-13  1:05     ` Zach O'Keefe
  0 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-13  1:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm, Andrea Arcangeli, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Jul 08 13:47, Andrew Morton wrote:
> On Wed,  6 Jul 2022 16:59:30 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> 
> > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> > CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> 
> This is maximally restrictive.  I didn't see any discussion of why this
> was chosen either here of in the [0/N].  I expect that people will be
> coming after us to relax this.
> 
> So please do add (a lot of) words explaining this decision, and
> describing what might be done in the future to relax it.

Hey Andrew,

Thanks for taking the time to look at this series. After taking a look through
capabilities(7) I think you're absolutely right to call this out - thanks for
that.

I think move_pages(2) seems to be the best comparison here. There, we use
CAP_SYS_NICE + PTRACE_MODE_READ_REALCREDS to ensure the caller is able to
copying + moving memory of an eternal process, between nodes.  This is also the
current default for process_madvise(2). However, MADV_COLLAPSE additionally is
able to:

1) Influence the RSS of a process / memory charged to a cgroup (by
  collapsing a hugepage-sized/aligned region with nonresident pages). Note that
  for file/shmem, this might cause increase in file/shmem RSS for non-target
  mm's.
2) Bypass sysfs THP settings

For (1), process_madvise(MADV_WILLNEED) could presumably be used to increase RSS
/ memcg usage, and we don't require any additional capabilities there.

For (2), I don't think there is an easy precedent. I think it makes sense that
the caller has write permission to /sys/kernel/mm/transparent_hugapage/*.
AFAICT, this means an effective user ID of 0 ... which is similarly restrictive
like CAP_SYS_ADMIN. One idea would be to use CAP_SETUID, since these threads
could always assume an real/effective user ID of 0.

That said, I'm note sure CAP_SETUID is needed, and perhaps the existing
process_madvise(2) restrictions are enough given CAP_SYS_NICE confers ability to
copy around all the same memory.. we'll just be doing some additional page table
manipulations after some of that copying - which should (mostly) be transparent
to the users. I.e. I don't think it expands CAP_SYS_NICE's "security silo" that
much. Could be wrong through.

Again, thanks for your time,
Zach



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps
  2022-07-12 17:57         ` Zach O'Keefe
@ 2022-07-13 18:02           ` Andrew Morton
  2022-07-13 18:40             ` Zach O'Keefe
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2022-07-13 18:02 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Yang Shi, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Michal Hocko, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	James Houghton

On Tue, 12 Jul 2022 10:57:07 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:

> > I'm not sure what applications rely on this hint, but if they are just
> > some test scripts, I think it should be fine. I don't think we
> > guarantee the test scripts won't get broken. AFAIK some test scripts
> > rely on the kernel dmesg text, for example, OOMs. And the meaning of
> > the fields do change, for example, inactive anon of /proc/meminfo,
> > which was changed by the patchset which put anon pages on inactive
> > list first instead of active list. We already noticed the abnormal
> > value from our monitoring tool when we adopted 5.10+ kernel. And
> > /proc/vmstat also had some fields renamed, for example,
> > workingset_refault of /proc/vmstat, it was split to
> > workseting_refault_anon and workingset_refault_file, so we had to
> > update our monitoring scripts accordingly. I think /proc/meminfo and
> > /proc/vmstat are more heavily used than smaps.
> >
> 
> Thanks for the great context. My guess is, right now, THPelligible is more
> useful as-is than if we were to relax it to MADV_COLLAPSE eligibility. As such,
> I'm fine dropping this until a stronger and more immediate usecase presents
> itself. Thanks for checking my rationale here.

So... should I drop this patch?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps
  2022-07-13 18:02           ` Andrew Morton
@ 2022-07-13 18:40             ` Zach O'Keefe
  0 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-13 18:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yang Shi, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Michal Hocko, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	James Houghton

On Wed, Jul 13, 2022 at 11:02 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Tue, 12 Jul 2022 10:57:07 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
>
> > > I'm not sure what applications rely on this hint, but if they are just
> > > some test scripts, I think it should be fine. I don't think we
> > > guarantee the test scripts won't get broken. AFAIK some test scripts
> > > rely on the kernel dmesg text, for example, OOMs. And the meaning of
> > > the fields do change, for example, inactive anon of /proc/meminfo,
> > > which was changed by the patchset which put anon pages on inactive
> > > list first instead of active list. We already noticed the abnormal
> > > value from our monitoring tool when we adopted 5.10+ kernel. And
> > > /proc/vmstat also had some fields renamed, for example,
> > > workingset_refault of /proc/vmstat, it was split to
> > > workseting_refault_anon and workingset_refault_file, so we had to
> > > update our monitoring scripts accordingly. I think /proc/meminfo and
> > > /proc/vmstat are more heavily used than smaps.
> > >
> >
> > Thanks for the great context. My guess is, right now, THPelligible is more
> > useful as-is than if we were to relax it to MADV_COLLAPSE eligibility. As such,
> > I'm fine dropping this until a stronger and more immediate usecase presents
> > itself. Thanks for checking my rationale here.
>
> So... should I drop this patch?

Ya, I don't think I have a solid argument for inclusion right now.

Thanks Andrew,

Zach


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC] mm: userspace hugepage collapse: file/shmem semantics
  2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
                   ` (17 preceding siblings ...)
  2022-07-06 23:59 ` [mm-unstable v7 18/18] selftests/vm: add selftest to verify multi THP collapse Zach O'Keefe
@ 2022-07-14 18:55 ` Zach O'Keefe
  18 siblings, 0 replies; 47+ messages in thread
From: Zach O'Keefe @ 2022-07-14 18:55 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Souptick Joarder

Hey All,

There are still a couple interface topics (capabilities for process_madvise(2),
errnos) to iron out, but for the most part the behavior and semantics of
MADV_COLLAPSE on anonymous memory seems to be ironed out. Thanks for everyone's
time and effort contributing to that effort.

Looking forward, I'd like to align on the semantics of file/shmem so seal
MADV_COLLAPSE behavior. This is what I'd propose for an initial man-page-like
description of MADV_COLLAPSE for madvise(2), to paint a full-picture view:

---8<---
Perform a best-effort synchronous collapse of the native pages mapped by the
memory range into Transparent Hugepages (THPs). MADV_COLLAPSE operates on the
current state of memory for the specified process and makes no persistent
changes or guarantees on how pages will be mapped, constructed, or faulted in
the future. However, for file/shmem memory, other mappings of this file extent
may be queued and processed later by khugepaged to attempt to update their
pagetables to map the hugepage by a PMD.

If the ranges provided span multiple VMAs, the semantics of the collapse over
each VMA is independent from the others. This implies a hugepage cannot cross a
VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the
operation may continue to attempt collapsing the remainder of the specified
memory.

All non-resident pages covered by the range will first be swapped/faulted-in,
before being copied onto a freshly allocated hugepage. If the native pages
compose the same PTE-mapped hugepage, and are suitably aligned, the collapse
may happen in-place. Unmapped pages will have their data directly initialized
to 0 in the new hugepage. However, for every eligible hugepage aligned/sized
region to-be collapsed, at least one page must currently be backed by memory.

MADV_COLLAPSE is independent of any THP sysfs setting, both in terms of
determining THP eligibility, and allocation semantics. The VMA must not be
marked VM_NOHUGEPAGE, VM_HUGETLB**, VM_IO, VM_DONTEXPAND, VM_MIXEDMAP, or
VM_PFNMAP, nor can it be stack memory or DAX-backed. The process must not have
PR_SET_THP_DISABLE set. For file-backed memory, the file must either be (1) not
open for write, and the mapping must be executable, or (2) the backing
filesystem must support large pages. Allocation for the new hugepage may enter
direct reclaim and/or compaction, regardless of VMA flags.  When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing the
most native pages.

If all hugepage-sized/aligned regions covered by the provided range were either
successfully collapsed, or were already PMD-mapped THPs, this operation will be
deemed successful. On successful return, all hugepage-aligned/sized memory
regions provided will be mapped by PMDs. Note that this doesn’t guarantee
anything about other possible mappings of the memory. Note that many failures
might have occurred, since the operation may continue to collapse in the event
collapse of a single hugepage-sized/aligned region fails.

MADV_COLLAPSE is only available if the kernel was configured with
CONFIGURE_TRANSPARENT_HUGEPAGE, and file/shmem support additionally require
CONFIG_READ_ONLY_THP_FOR_FS and CONFIG_SHMEM.
---8<---

** Might change with HugeTLB high-granularity mappings[1].


There are a few new items of note here:

1) PMD-mapped on success

MADV_COLLAPSE ultimately wants memory mapped by PMDs, and so I propose we
should always try to actually do the page table updates. For file/shmem, this
means two things: (a) adding support to handle compound pages (both pte-mapped
hugepages and non-HPAGE_PMD_ORDER compound pages), and (b) doing a final PMD
install before returning, and not relying on subsequent fault. This makes the
semantics of file/shmem the same as anonymous. I call out (a), since there was
an existing debate about this, and so I want to ensure we are aligned[1]. Note
that (b), along with presenting a consistent interface to users, also has
real-world usecases too, where relying on fault is difficult (for example,
shmem + UFFDIO_REGISTER_MODE_MINOR-managed memory). Also note that for (b), I'm
proposing to only do the synchronous PMD install for the memory range provided
- the page table collapse of other mappings of the memory can be deferred until
later (by khugepaged).

2) folio timing && file non-writable, executable mapping

I just want to align on some timing due to ongoing folio work. Currently, the
requirement to be able to collapse file/shmem memory is that the file not be
opened for write anywhere, and that the mapping is executable, but we'd
eventually like to support filesystems that claim
mapping_large_folio_support()[2]. Is it acceptable that future MADV_COLLAPSE
works for either mapping_large_folio_support() or the old conditions?
Alternatively, should MADV_COLLAPSE only support mapping_large_folio_support()
filesystems from the onset? (I believe shmem and xfs are the only current
users)

3) (shmem) sysfs settings and huge= tmpfs mount

Should we ignore /sys/kernel/mm/transparent_hugepage/shmem_enabled, similar to
how we ignore /sys/kernel/mm/transparent_hugepage/enabled for anon/file? Does
that include "deny"? This choice is (partially) coupled with tmpfs huge= mount
option. I think today, things work if we ignore this. However, I don't want to
back us into a corner if we ever want to allow MADV_COLLAPSE to work on
writeable shmem mappings one day (or any other incompatibility I'm unaware of).
One option, if in (2) we chose to allow the old conditions, then we could
ignore shmem_enabled in the non-writable, executable case - otherwise defer to
"if the filesystem supports it", where we would then respect huge=.

I think those are the important points. Am I missing anything?

Thanks again everyone for taking the time to read and discuss,

Best,
Zach


[1] https://lore.kernel.org/linux-mm/20220624173656.2033256-23-jthoughton@google.com/
[2] https://lore.kernel.org/linux-mm/YpGbnbi44JqtRg+n@casper.infradead.org/






^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2022-07-14 18:55 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 01/18] mm/khugepaged: remove redundant transhuge_vma_suitable() check Zach O'Keefe
2022-07-11 20:38   ` Yang Shi
2022-07-12 17:14     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 02/18] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-07-08 21:01   ` Andrew Morton
2022-07-11 18:29     ` Zach O'Keefe
2022-07-11 18:45       ` Andrew Morton
2022-07-12 14:17         ` Zach O'Keefe
2022-07-11 21:51       ` Yang Shi
2022-07-06 23:59 ` [mm-unstable v7 04/18] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 05/18] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 06/18] mm/khugepaged: add flag to predicate khugepaged-only behavior Zach O'Keefe
2022-07-11 20:43   ` Yang Shi
2022-07-12 17:06     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 07/18] mm/thp: add flag to enforce sysfs THP in hugepage_vma_check() Zach O'Keefe
2022-07-11 20:57   ` Yang Shi
2022-07-12 16:58     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 08/18] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage Zach O'Keefe
2022-07-11 21:03   ` Yang Shi
2022-07-12 16:50     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 09/18] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-07-11 21:22   ` Yang Shi
2022-07-12 16:54     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 10/18] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 11/18] mm/madvise: add huge_memory:mm_madvise_collapse tracepoint Zach O'Keefe
2022-07-11 21:32   ` Yang Shi
2022-07-12 16:21     ` Zach O'Keefe
2022-07-12 17:05       ` Yang Shi
2022-07-12 17:30         ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
2022-07-08 20:47   ` Andrew Morton
2022-07-13  1:05     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps Zach O'Keefe
2022-07-11 21:37   ` Yang Shi
2022-07-12 16:31     ` Zach O'Keefe
2022-07-12 17:27       ` Yang Shi
2022-07-12 17:57         ` Zach O'Keefe
2022-07-13 18:02           ` Andrew Morton
2022-07-13 18:40             ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 14/18] selftests/vm: modularize collapse selftests Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 15/18] selftests/vm: dedup hugepage allocation logic Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 16/18] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 17/18] selftests/vm: add selftest to verify recollapse of THPs Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 18/18] selftests/vm: add selftest to verify multi THP collapse Zach O'Keefe
2022-07-14 18:55 ` [RFC] mm: userspace hugepage collapse: file/shmem semantics Zach O'Keefe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).