All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/15] mm: userspace hugepage collapse
@ 2022-06-04  0:39 Zach O'Keefe
  2022-06-04  0:39 ` [PATCH v6 01/15] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Zach O'Keefe
                   ` (14 more replies)
  0 siblings, 15 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:39 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

v6 Forward
--------------------------------

v6 improves on v5[1] in 3 major ways:

1.  Changed MADV_COLLAPSE eligibility semantics.  In v5, MADV_COLLAPSE
ignored khugepaged max_ptes_* sysfs settings, as well as all sysfs defrag
settings.  v6 takes this further by also decoupling MADV_COLLAPSE from
sysfs enabled setting.  MADV_COLLAPSE can now initiate a collapse of memory
into THPs in "madvise" and "never" mode, and doesn't ever require
VM_HUGEPAGE.  MADV_COLLAPSE retains it's adherence to not operating on
VM_NOHUGEPAGE-marked VMAs.

2.  Thanks to a patch by Yang Shi to remove UMA hugepage preallocation,
hugepage allocation in khugepaged is independent of CONFIG_NUMA.  This
allows us to reuse all the allocation codepaths between collapse contexts,
greatly simplifying struct collapse_control.  Redundant khugepaged
heuristic flags have also been merged into a new enforce_page_heuristics
flag.

3.  Using MADV_COLLAPSE's new eligibility semantics, the hacks in the
selftests to disable khugepaged are no longer necessary, since we can test
MADV_COLLAPSE in "never" THP mode to prevent khugepaged interaction.

Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was introduced by David Rientjes[2].

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

process_madvise(2)

	Performs a synchronous collapse of the native pages
	mapped by the list of iovecs into transparent hugepages.

	This operation is independent of the system THP sysfs settings,
	but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail.

	THP allocation may enter direct reclaim and/or compaction.

	When a range spans multiple VMAs, the semantics of the collapse
	over of each VMA is independent from the others.

	Caller must have CAP_SYS_ADMIN if not acting on self.

	Return value follows existing process_madvise(2) conventions.  A
	“success” indicates that all hugepage-sized/aligned regions
	covered by the provided range were either successfully
	collapsed, or were already pmd-mapped THPs.

madvise(2)

	Equivalent to process_madvise(2) on self, with 0 returned on
	“success”.

Current Use-Cases
--------------------------------

(1)	Immediately back executable text by THPs.  Current support provided
	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
	system which might impair services from serving at their full rated
	load after (re)starting.  Tricks like mremap(2)'ing text onto
	anonymous memory to immediately realize iTLB performance prevents
	page sharing and demand paging, both of which increase steady state
	memory footprint.  With MADV_COLLAPSE, we get the best of both
	worlds: Peak upfront performance and lower RAM footprints.  Note
	that subsequent support for file-backed memory is required here.

(2)	malloc() implementations that manage memory in hugepage-sized
	chunks, but sometimes subrelease memory back to the system in
	native-sized chunks via MADV_DONTNEED; zapping the pmd.  Later,
	when the memory is hot, the implementation could
	madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
	hugepage coverage and dTLB performance.  TCMalloc is such an
	implementation that could benefit from this[3].  A prior study of
	Google internal workloads during evaluation of Temeraire, a
	hugepage-aware enhancement to TCMalloc, showed that nearly 20% of
	all cpu cycles were spent in dTLB stalls, and that increasing
	hugepage coverage by even small amount can help with that[4].

Future work
--------------------------------

Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.

One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps.  For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[5].

Sequence of Patches
--------------------------------

* Patch 1 (Yang Shi) removes UMA hugepage preallocation and makes
  khugepaged hugepage allocation independent of CONFIG_NUMA

* Patches 2-8 perform refactoring of collapse logic within khugepaged.c
  and introduce the notion of a collapse context.

* Patch 9 introduces MADV_COLLAPSE and is the main patch in this series.

* Patch 10 is a tidy-up.

* Patches 11 adds process_madvise(2) support.

* Patches 12-14 add selftests.

* Patch 15 adds support for user tools.

Applies against next-20220603

Changelog
--------------------------------

v5 -> v6:
* Added 'mm: khugepaged: don't carry huge page to the next loop for
  !CONFIG_NUMA'
  (Yang Shi)
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Add a pmd_bad() check for nonhuge pmds (Peter Xu)
* 'mm/khugepaged: dedup and simplify hugepage alloc and charging'
  -> Remove dependency on 'mm/khugepaged: sched to numa node when collapse
     huge page'
  -> No more !NUMA casing
* 'mm/khugepaged: make allocation semantics context-specific'
  -> Renamed from 'mm/khugepaged: make hugepage allocation
     context-specific'
  -> Removed function pointer hooks. (David Rientjes)
  -> Added gfp_t member to control allocation semantics.
* 'mm/khugepaged: add flag to ignore khugepaged heuristics'
  -> Squashed from
     'mm/khugepaged: add flag to ignore khugepaged_max_ptes_*' and
     'mm/khugepaged: add flag to ignore page young/referenced requirement'.
     (David Rientjes)
* Added 'mm/khugepaged: add flag to ignore THP sysfs enabled'
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Use hugepage_vma_check() instead of transparent_hugepage_active()
     to determine vma eligibility.
  -> Only retry collapse once per hugepage if pages aren't found on LRU
  -> Save last failed result for more accurate errno
  -> Refactored loop structure
  -> Renamed labels
* 'selftests/vm: modularize collapse selftests'
  -> Refactored into straightline code and removed loop over contexts.
* 'selftests/vm: add MADV_COLLAPSE collapse context to selftests;
  -> Removed ->init() and ->cleanup() hooks from struct collapse_context()
     (David Rientjes)
  -> MADV_COLLAPSE operates in "never" THP mode to prevent khugepaged
     interaction. Removed all the previous khugepaged hacks.
* Added 'tools headers uapi: add MADV_COLLAPSE madvise mode to tools'
* Rebased on next-20220603

v4 -> v5:
* Fix kernel test robot <lkp@intel.com> errors
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Fix khugepaged_alloc_page() UMA definition
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Add "fallthrough" pseudo keyword to fix -Wimplicit-fallthrough

v3 -> v4:
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Dropped pmd_none() check from find_pmd_or_thp_or_none()
  -> Moved SCAN_PMD_MAPPED after SCAN_PMD_NULL
  -> Dropped <lkp@intel.com> from sign-offs
* 'mm/khugepaged: add struct collapse_control'
  -> Updated commit description and some code comments
  -> Removed extra brackets added in khugepaged_find_target_node()
* Added 'mm/khugepaged: dedup hugepage allocation and charging code'
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Has been majorly reworked to replace ->gfp() and ->alloc_hpage()
     struct collapse_control hooks with a ->alloc_charge_hpage() hook
     which makes node-allocation, gfp flags, node scheduling, hpage
     allocation, and accounting/charging context-specific.
  -> Dropped <lkp@intel.com> from sign-offs
* Added 'mm/khugepaged: pipe enum scan_result codes back to callers'
  -> Replaces 'mm/khugepaged: add struct collapse_result'
* Dropped 'mm/khugepaged: add struct collapse_result'
* 'mm/khugepaged: add flag to ignore khugepaged_max_ptes_*'
  -> Moved before 'mm/madvise: introduce MADV_COLLAPSE sync hugepage
     collapse'
* 'mm/khugepaged: add flag to ignore page young/referenced requirement'
  -> Moved before 'mm/madvise: introduce MADV_COLLAPSE sync hugepage
     collapse'
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Moved struct collapse_control* argument to end of alloc_hpage()
  -> Some refactoring to rebase on top changes to struct
     collapse_control hook changes and other previous commits.
  -> Reworded commit description
  -> Dropped <lkp@intel.com> from sign-offs
* 'mm/khugepaged: rename prefix of shared collapse functions'
  -> Renamed from 'mm/khugepaged: remove khugepaged prefix from shared
     collapse functions'
  -> Instead of dropping "khugepaged_" prefix, replace with
     "hpage_collapse_"
  -> Dropped <lkp@intel.com> from sign-offs
* Rebased onto next-20220502

v2 -> v3:
* Collapse semantics have changed: the gfp flags used for hugepage
  allocation now are independent of khugepaged.
* Cover-letter: add primary use-cases and update description of collapse
  semantics.
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Added .gfp operation to struct collapse_control
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Added madvise context .gfp implementation.
  -> Set scan_result appropriately on early exit due to mm exit or vma
     vma revalidation.
  -> Reword patch description
* Rebased onto next-20220426

v1 -> v2:
* Cover-letter clarification and added RFC -> v1 notes
* Fixes issues reported by kernel test robot <lkp@intel.com>
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Fixed mixed code/declarations
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Fixed bad function signature in !NUMA && TRANSPARENT_HUGEPAGE configs
  -> Added doc comment to retract_page_tables() for "cc"
* 'mm/khugepaged: add struct collapse_result'
  -> Added doc comment to retract_page_tables() for "cr"
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Added MADV_COLLAPSE definitions for alpha, mips, parisc, xtensa
  -> Moved an "#ifdef NUMA" so that khugepaged_find_target_node() is
     defined in !NUMA && TRANSPARENT_HUGEPAGE configs.
* 'mm/khugepaged: remove khugepaged prefix from shared collapse'
  functions
  -> Removed khugepaged prefix from khugepaged_find_target_node on L914
* Rebased onto next-20220414

RFC -> v1:
* The series was significantly reworked from RFC and most patches are
  entirely new or reworked.
* Collapse eligibility criteria has changed: MADV_COLLAPSE now respects
  VM_NOHUGEPAGE.
* Collapse semantics have changed: the gfp flags used for hugepage
  allocation now match that of khugepaged for the same VMA, instead of the
  gfp flags used
  at-fault for calling process for the VMA.
* Collapse semantics have changed: The collapse semantics for multiple VMAs
  spanning a single MADV_COLLAPSE call are now independent, whereas before
  the idea was to allow direct reclaim/compaction if any spanned VMA
  permitted so.
* The process_madvise(2) flags, MADV_F_COLLAPSE_LIMITS and
  MADV_F_COLLAPSE_DEFRAG have been removed.
* Implementation change: the RFC implemented collapse over a range of
  hugepages in a batched-fashion with the aim of doing multiple page table
  updates inside a single mmap_lock write.  This has been changed, and the
  implementation now collapses each hugepage-aligned/sized region
  iteratively.  This was motivated by an experiment which showed that, when
  multiple threads were concurrently faulting during a MADV_COLLAPSE
  operation, mean and tail latency to acquire mmap_lock in read for threads
  in the fault patch was improved by using a batch size of 1 (batch sizes
  of 1, 8, 16, 32 were tested)[6].
* Added: If a collapse operation fails because a page isn't found on the
  LRU, do a lru_add_drain_all() and retry.
* Added: selftests

[1] https://lore.kernel.org/linux-mm/20220504214437.2850685-1-zokeefe@google.com/
[2] https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[3] https://github.com/google/tcmalloc/tree/master/tcmalloc
[4] https://research.google/pubs/pub50370/
[5] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/
[6] https://lore.kernel.org/linux-mm/CAAa6QmRc76n-dspGT7UK8DkaqZAOz-CkCsME1V7KGtQ6Yt2FqA@mail.gmail.com/
Zach O'Keefe (15):
  mm: khugepaged: don't carry huge page to the next loop for
    !CONFIG_NUMA
  mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  mm/khugepaged: add struct collapse_control
  mm/khugepaged: dedup and simplify hugepage alloc and charging
  mm/khugepaged: make allocation semantics context-specific
  mm/khugepaged: pipe enum scan_result codes back to callers
  mm/khugepaged: add flag to ignore khugepaged heuristics
  mm/khugepaged: add flag to ignore THP sysfs enabled
  mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  mm/khugepaged: rename prefix of shared collapse functions
  mm/madvise: add MADV_COLLAPSE to process_madvise()
  selftests/vm: modularize collapse selftests
  selftests/vm: add MADV_COLLAPSE collapse context to selftests
  selftests/vm: add selftest to verify recollapse of THPs
  tools headers uapi: add MADV_COLLAPSE madvise mode to tools

 arch/alpha/include/uapi/asm/mman.h           |   2 +
 arch/mips/include/uapi/asm/mman.h            |   2 +
 arch/parisc/include/uapi/asm/mman.h          |   2 +
 arch/xtensa/include/uapi/asm/mman.h          |   2 +
 include/linux/huge_mm.h                      |  12 +
 include/trace/events/huge_memory.h           |   3 +-
 include/uapi/asm-generic/mman-common.h       |   2 +
 mm/internal.h                                |   1 +
 mm/khugepaged.c                              | 673 +++++++++++--------
 mm/madvise.c                                 |  11 +-
 mm/rmap.c                                    |  15 +-
 tools/include/uapi/asm-generic/mman-common.h |   2 +
 tools/testing/selftests/vm/khugepaged.c      | 401 ++++++-----
 13 files changed, 679 insertions(+), 449 deletions(-)

--
2.36.1.255.ge46751e96f-goog



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v6 01/15] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
@ 2022-06-04  0:39 ` Zach O'Keefe
  2022-06-06 18:25   ` Yang Shi
  2022-06-29 20:49   ` Peter Xu
  2022-06-04  0:39 ` [PATCH v6 02/15] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:39 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

From: Yang Shi <shy828301@gmail.com>

The khugepaged has optimization to reduce huge page allocation calls for
!CONFIG_NUMA by carrying the allocated but failed to collapse huge page to
the next loop.  CONFIG_NUMA doesn't do so since the next loop may try to
collapse huge page from a different node, so it doesn't make too much sense
to carry it.

But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page()
before scanning the address space, so it means huge page may be allocated
even though there is no suitable range for collapsing.  Then the page would
be just freed if khugepaged already made enough progress.  This could make
NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y run.  This
problem actually makes things worse due to the way more pointless THP
allocations and makes the optimization pointless.

This could be fixed by carrying the huge page across scans, but it will
complicate the code further and the huge page may be carried
indefinitely.  But if we take one step back,  the optimization itself seems
not worth keeping nowadays since:
  * Not too many users build NUMA=n kernel nowadays even though the kernel is
    actually running on a non-NUMA machine. Some small devices may run NUMA=n
    kernel, but I don't think they actually use THP.
  * Since commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
    stored on the per-cpu lists"), THP could be cached by pcp.  This actually
    somehow does the job done by the optimization.

Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 100 ++++++++----------------------------------------
 1 file changed, 17 insertions(+), 83 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 476d79360101..cc3d6fb446d5 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -833,29 +833,30 @@ static int khugepaged_find_target_node(void)
 	last_khugepaged_target_node = target_node;
 	return target_node;
 }
+#else
+static int khugepaged_find_target_node(void)
+{
+	return 0;
+}
+#endif
 
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
+/* Sleep for the first alloc fail, break the loop for the second fail */
+static bool alloc_fail_should_sleep(struct page **hpage, bool *wait)
 {
 	if (IS_ERR(*hpage)) {
 		if (!*wait)
-			return false;
+			return true;
 
 		*wait = false;
 		*hpage = NULL;
 		khugepaged_alloc_sleep();
-	} else if (*hpage) {
-		put_page(*hpage);
-		*hpage = NULL;
 	}
-
-	return true;
+	return false;
 }
 
 static struct page *
 khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 {
-	VM_BUG_ON_PAGE(*hpage, *hpage);
-
 	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
@@ -867,74 +868,6 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 	count_vm_event(THP_COLLAPSE_ALLOC);
 	return *hpage;
 }
-#else
-static int khugepaged_find_target_node(void)
-{
-	return 0;
-}
-
-static inline struct page *alloc_khugepaged_hugepage(void)
-{
-	struct page *page;
-
-	page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
-			   HPAGE_PMD_ORDER);
-	if (page)
-		prep_transhuge_page(page);
-	return page;
-}
-
-static struct page *khugepaged_alloc_hugepage(bool *wait)
-{
-	struct page *hpage;
-
-	do {
-		hpage = alloc_khugepaged_hugepage();
-		if (!hpage) {
-			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-			if (!*wait)
-				return NULL;
-
-			*wait = false;
-			khugepaged_alloc_sleep();
-		} else
-			count_vm_event(THP_COLLAPSE_ALLOC);
-	} while (unlikely(!hpage) && likely(khugepaged_enabled()));
-
-	return hpage;
-}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
-{
-	/*
-	 * If the hpage allocated earlier was briefly exposed in page cache
-	 * before collapse_file() failed, it is possible that racing lookups
-	 * have not yet completed, and would then be unpleasantly surprised by
-	 * finding the hpage reused for the same mapping at a different offset.
-	 * Just release the previous allocation if there is any danger of that.
-	 */
-	if (*hpage && page_count(*hpage) > 1) {
-		put_page(*hpage);
-		*hpage = NULL;
-	}
-
-	if (!*hpage)
-		*hpage = khugepaged_alloc_hugepage(wait);
-
-	if (unlikely(!*hpage))
-		return false;
-
-	return true;
-}
-
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
-{
-	VM_BUG_ON(!*hpage);
-
-	return  *hpage;
-}
-#endif
 
 /*
  * If mmap_lock temporarily dropped, revalidate vma
@@ -1188,8 +1121,10 @@ static void collapse_huge_page(struct mm_struct *mm,
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
-	if (!IS_ERR_OR_NULL(*hpage))
+	if (!IS_ERR_OR_NULL(*hpage)) {
 		mem_cgroup_uncharge(page_folio(*hpage));
+		put_page(*hpage);
+	}
 	trace_mm_collapse_huge_page(mm, isolated, result);
 	return;
 }
@@ -1992,8 +1927,10 @@ static void collapse_file(struct mm_struct *mm,
 	unlock_page(new_page);
 out:
 	VM_BUG_ON(!list_empty(&pagelist));
-	if (!IS_ERR_OR_NULL(*hpage))
+	if (!IS_ERR_OR_NULL(*hpage)) {
 		mem_cgroup_uncharge(page_folio(*hpage));
+		put_page(*hpage);
+	}
 	/* TODO: tracepoints */
 }
 
@@ -2243,7 +2180,7 @@ static void khugepaged_do_scan(void)
 	lru_add_drain_all();
 
 	while (progress < pages) {
-		if (!khugepaged_prealloc_page(&hpage, &wait))
+		if (alloc_fail_should_sleep(&hpage, &wait))
 			break;
 
 		cond_resched();
@@ -2262,9 +2199,6 @@ static void khugepaged_do_scan(void)
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
 	}
-
-	if (!IS_ERR_OR_NULL(hpage))
-		put_page(hpage);
 }
 
 static bool khugepaged_should_wakeup(void)
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 02/15] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
  2022-06-04  0:39 ` [PATCH v6 01/15] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Zach O'Keefe
@ 2022-06-04  0:39 ` Zach O'Keefe
  2022-06-06 20:45   ` Yang Shi
  2022-06-04  0:39 ` [PATCH v6 03/15] mm/khugepaged: add struct collapse_control Zach O'Keefe
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:39 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

When scanning an anon pmd to see if it's eligible for collapse, return
SCAN_PMD_MAPPED if the pmd already maps a THP.  Note that
SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
file-collapse path, since the latter might identify pte-mapped compound
pages.  This is required by MADV_COLLAPSE which necessarily needs to
know what hugepage-aligned/sized regions are already pmd-mapped.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/trace/events/huge_memory.h |  1 +
 mm/internal.h                      |  1 +
 mm/khugepaged.c                    | 32 ++++++++++++++++++++++++++----
 mm/rmap.c                          | 15 ++++++++++++--
 4 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index d651f3437367..55392bf30a03 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -11,6 +11,7 @@
 	EM( SCAN_FAIL,			"failed")			\
 	EM( SCAN_SUCCEED,		"succeeded")			\
 	EM( SCAN_PMD_NULL,		"pmd_null")			\
+	EM( SCAN_PMD_MAPPED,		"page_pmd_mapped")		\
 	EM( SCAN_EXCEED_NONE_PTE,	"exceed_none_pte")		\
 	EM( SCAN_EXCEED_SWAP_PTE,	"exceed_swap_pte")		\
 	EM( SCAN_EXCEED_SHARED_PTE,	"exceed_shared_pte")		\
diff --git a/mm/internal.h b/mm/internal.h
index 6e14749ad1e5..f768c7fae668 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -188,6 +188,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
 /*
  * in mm/rmap.c:
  */
+pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address);
 extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
 
 /*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cc3d6fb446d5..7a914ca19e96 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -28,6 +28,7 @@ enum scan_result {
 	SCAN_FAIL,
 	SCAN_SUCCEED,
 	SCAN_PMD_NULL,
+	SCAN_PMD_MAPPED,
 	SCAN_EXCEED_NONE_PTE,
 	SCAN_EXCEED_SWAP_PTE,
 	SCAN_EXCEED_SHARED_PTE,
@@ -901,6 +902,31 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	return 0;
 }
 
+static int find_pmd_or_thp_or_none(struct mm_struct *mm,
+				   unsigned long address,
+				   pmd_t **pmd)
+{
+	pmd_t pmde;
+
+	*pmd = mm_find_pmd_raw(mm, address);
+	if (!*pmd)
+		return SCAN_PMD_NULL;
+
+	pmde = pmd_read_atomic(*pmd);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/* See comments in pmd_none_or_trans_huge_or_clear_bad() */
+	barrier();
+#endif
+	if (!pmd_present(pmde))
+		return SCAN_PMD_NULL;
+	if (pmd_trans_huge(pmde))
+		return SCAN_PMD_MAPPED;
+	if (pmd_bad(pmde))
+		return SCAN_FAIL;
+	return SCAN_SUCCEED;
+}
+
 /*
  * Bring missing pages in from swap, to complete THP collapse.
  * Only done if khugepaged_scan_pmd believes it is worthwhile.
@@ -1146,11 +1172,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd) {
-		result = SCAN_PMD_NULL;
+	result = find_pmd_or_thp_or_none(mm, address, &pmd);
+	if (result != SCAN_SUCCEED)
 		goto out;
-	}
 
 	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
diff --git a/mm/rmap.c b/mm/rmap.c
index 04fac1af870b..c9979c6ad7a1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -767,13 +767,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 	return vma_address(page, vma);
 }
 
-pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
+pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd = NULL;
-	pmd_t pmde;
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
@@ -788,6 +787,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+out:
+	return pmd;
+}
+
+pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
+{
+	pmd_t pmde;
+	pmd_t *pmd;
+
+	pmd = mm_find_pmd_raw(mm, address);
+	if (!pmd)
+		goto out;
 	/*
 	 * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
 	 * without holding anon_vma lock for write.  So when looking for a
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
  2022-06-04  0:39 ` [PATCH v6 01/15] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Zach O'Keefe
  2022-06-04  0:39 ` [PATCH v6 02/15] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
@ 2022-06-04  0:39 ` Zach O'Keefe
  2022-06-06  2:41   ` kernel test robot
  2022-06-04  0:39 ` [PATCH v6 04/15] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:39 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Modularize hugepage collapse by introducing struct collapse_control.
This structure serves to describe the properties of the requested
collapse, as well as serve as a local scratch pad to use during the
collapse itself.

Start by moving global per-node khugepaged statistics into this
new structure, and stack allocate one for khugepaged collapse
context.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 87 ++++++++++++++++++++++++++++---------------------
 1 file changed, 49 insertions(+), 38 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7a914ca19e96..907d0b2bd4bd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -86,6 +86,14 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
 
 #define MAX_PTE_MAPPED_THP 8
 
+struct collapse_control {
+	/* Num pages scanned per node */
+	int node_load[MAX_NUMNODES];
+
+	/* Last target selected in khugepaged_find_target_node() */
+	int last_target_node;
+};
+
 /**
  * struct mm_slot - hash lookup from mm to mm_slot
  * @hash: hash collision list
@@ -777,9 +785,7 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-static int khugepaged_node_load[MAX_NUMNODES];
-
-static bool khugepaged_scan_abort(int nid)
+static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -791,11 +797,11 @@ static bool khugepaged_scan_abort(int nid)
 		return false;
 
 	/* If there is a count for this node already, it must be acceptable */
-	if (khugepaged_node_load[nid])
+	if (cc->node_load[nid])
 		return false;
 
 	for (i = 0; i < MAX_NUMNODES; i++) {
-		if (!khugepaged_node_load[i])
+		if (!cc->node_load[i])
 			continue;
 		if (node_distance(nid, i) > node_reclaim_distance)
 			return true;
@@ -810,32 +816,31 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(void)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
-	static int last_khugepaged_target_node = NUMA_NO_NODE;
 	int nid, target_node = 0, max_value = 0;
 
 	/* find first node with max normal pages hit */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
-		if (khugepaged_node_load[nid] > max_value) {
-			max_value = khugepaged_node_load[nid];
+		if (cc->node_load[nid] > max_value) {
+			max_value = cc->node_load[nid];
 			target_node = nid;
 		}
 
 	/* do some balance if several nodes have the same hit record */
-	if (target_node <= last_khugepaged_target_node)
-		for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
-				nid++)
-			if (max_value == khugepaged_node_load[nid]) {
+	if (target_node <= cc->last_target_node)
+		for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
+		     nid++)
+			if (max_value == cc->node_load[nid]) {
 				target_node = nid;
 				break;
 			}
 
-	last_khugepaged_target_node = target_node;
+	cc->last_target_node = target_node;
 	return target_node;
 }
 #else
-static int khugepaged_find_target_node(void)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -1155,10 +1160,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 	return;
 }
 
-static int khugepaged_scan_pmd(struct mm_struct *mm,
-			       struct vm_area_struct *vma,
-			       unsigned long address,
-			       struct page **hpage)
+static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, struct page **hpage,
+			       struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1176,7 +1180,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+	memset(cc->node_load, 0, sizeof(cc->node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
@@ -1242,16 +1246,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 
 		/*
 		 * Record which node the original page is from and save this
-		 * information to khugepaged_node_load[].
+		 * information to cc->node_load[].
 		 * Khugepaged will allocate hugepage from the node has the max
 		 * hit record.
 		 */
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
-		khugepaged_node_load[node]++;
+		cc->node_load[node]++;
 		if (!PageLRU(page)) {
 			result = SCAN_PAGE_LRU;
 			goto out_unmap;
@@ -1302,7 +1306,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
-		node = khugepaged_find_target_node();
+		node = khugepaged_find_target_node(cc);
 		/* collapse_huge_page will return with the mmap_lock released */
 		collapse_huge_page(mm, address, hpage, node,
 				referenced, unmapped);
@@ -1958,8 +1962,9 @@ static void collapse_file(struct mm_struct *mm,
 	/* TODO: tracepoints */
 }
 
-static void khugepaged_scan_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start, struct page **hpage)
+static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
+				 pgoff_t start, struct page **hpage,
+				 struct collapse_control *cc)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -1970,7 +1975,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 
 	present = 0;
 	swap = 0;
-	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+	memset(cc->node_load, 0, sizeof(cc->node_load));
 	rcu_read_lock();
 	xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
 		if (xas_retry(&xas, page))
@@ -1995,11 +2000,11 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 		}
 
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			break;
 		}
-		khugepaged_node_load[node]++;
+		cc->node_load[node]++;
 
 		if (!PageLRU(page)) {
 			result = SCAN_PAGE_LRU;
@@ -2032,7 +2037,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			node = khugepaged_find_target_node();
+			node = khugepaged_find_target_node(cc);
 			collapse_file(mm, file, start, hpage, node);
 		}
 	}
@@ -2040,8 +2045,9 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 	/* TODO: tracepoints */
 }
 #else
-static void khugepaged_scan_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start, struct page **hpage)
+static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
+				 pgoff_t start, struct page **hpage,
+				 struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -2052,7 +2058,8 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
-					    struct page **hpage)
+					    struct page **hpage,
+					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
 {
@@ -2133,12 +2140,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 
 				mmap_read_unlock(mm);
 				ret = 1;
-				khugepaged_scan_file(mm, file, pgoff, hpage);
+				khugepaged_scan_file(mm, file, pgoff, hpage,
+						     cc);
 				fput(file);
 			} else {
 				ret = khugepaged_scan_pmd(mm, vma,
 						khugepaged_scan.address,
-						hpage);
+						hpage, cc);
 			}
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
@@ -2194,7 +2202,7 @@ static int khugepaged_wait_event(void)
 		kthread_should_stop();
 }
 
-static void khugepaged_do_scan(void)
+static void khugepaged_do_scan(struct collapse_control *cc)
 {
 	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
@@ -2218,7 +2226,7 @@ static void khugepaged_do_scan(void)
 		if (khugepaged_has_work() &&
 		    pass_through_head < 2)
 			progress += khugepaged_scan_mm_slot(pages - progress,
-							    &hpage);
+							    &hpage, cc);
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
@@ -2254,12 +2262,15 @@ static void khugepaged_wait_work(void)
 static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
+	struct collapse_control cc = {
+		.last_target_node = NUMA_NO_NODE,
+	};
 
 	set_freezable();
 	set_user_nice(current, MAX_NICE);
 
 	while (!kthread_should_stop()) {
-		khugepaged_do_scan();
+		khugepaged_do_scan(&cc);
 		khugepaged_wait_work();
 	}
 
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 04/15] mm/khugepaged: dedup and simplify hugepage alloc and charging
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
                   ` (2 preceding siblings ...)
  2022-06-04  0:39 ` [PATCH v6 03/15] mm/khugepaged: add struct collapse_control Zach O'Keefe
@ 2022-06-04  0:39 ` Zach O'Keefe
  2022-06-06 20:50   ` Yang Shi
  2022-06-29 21:58   ` Peter Xu
  2022-06-04  0:39 ` [PATCH v6 05/15] mm/khugepaged: make allocation semantics context-specific Zach O'Keefe
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:39 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

The following code is duplicated in collapse_huge_page() and
collapse_file():

        gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;

	new_page = khugepaged_alloc_page(hpage, gfp, node);
        if (!new_page) {
                result = SCAN_ALLOC_HUGE_PAGE_FAIL;
                goto out;
        }

        if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
                result = SCAN_CGROUP_CHARGE_FAIL;
                goto out;
        }
        count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);

Also, "node" is passed as an argument to both collapse_huge_page() and
collapse_file() and obtained the same way, via
khugepaged_find_target_node().

Move all this into a new helper, alloc_charge_hpage(), and remove the
duplicate code from collapse_huge_page() and collapse_file().  Also,
simplify khugepaged_alloc_page() by returning a bool indicating
allocation success instead of a copy of the allocated struct page.

Suggested-by: Peter Xu <peterx@redhat.com>

---

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 77 ++++++++++++++++++++++---------------------------
 1 file changed, 34 insertions(+), 43 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 907d0b2bd4bd..38488d114073 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -860,19 +860,18 @@ static bool alloc_fail_should_sleep(struct page **hpage, bool *wait)
 	return false;
 }
 
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
+static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 {
 	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 		*hpage = ERR_PTR(-ENOMEM);
-		return NULL;
+		return false;
 	}
 
 	prep_transhuge_page(*hpage);
 	count_vm_event(THP_COLLAPSE_ALLOC);
-	return *hpage;
+	return true;
 }
 
 /*
@@ -995,10 +994,23 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 	return true;
 }
 
-static void collapse_huge_page(struct mm_struct *mm,
-				   unsigned long address,
-				   struct page **hpage,
-				   int node, int referenced, int unmapped)
+static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
+			      struct collapse_control *cc)
+{
+	gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
+	int node = khugepaged_find_target_node(cc);
+
+	if (!khugepaged_alloc_page(hpage, gfp, node))
+		return SCAN_ALLOC_HUGE_PAGE_FAIL;
+	if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
+		return SCAN_CGROUP_CHARGE_FAIL;
+	count_memcg_page_event(*hpage, THP_COLLAPSE_ALLOC);
+	return SCAN_SUCCEED;
+}
+
+static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
+			       struct page **hpage, int referenced,
+			       int unmapped, struct collapse_control *cc)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -1009,13 +1021,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 	int isolated = 0, result = 0;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
-	gfp_t gfp;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-	/* Only allocate from the target node */
-	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
-
 	/*
 	 * Before allocating the hugepage, release the mmap_lock read lock.
 	 * The allocation can take potentially a long time if it involves
@@ -1023,17 +1031,12 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * that. We will recheck the vma after taking it again in write mode.
 	 */
 	mmap_read_unlock(mm);
-	new_page = khugepaged_alloc_page(hpage, gfp, node);
-	if (!new_page) {
-		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
-		goto out_nolock;
-	}
 
-	if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
-		result = SCAN_CGROUP_CHARGE_FAIL;
+	result = alloc_charge_hpage(hpage, mm, cc);
+	if (result != SCAN_SUCCEED)
 		goto out_nolock;
-	}
-	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
+
+	new_page = *hpage;
 
 	mmap_read_lock(mm);
 	result = hugepage_vma_revalidate(mm, address, &vma);
@@ -1306,10 +1309,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
-		node = khugepaged_find_target_node(cc);
 		/* collapse_huge_page will return with the mmap_lock released */
-		collapse_huge_page(mm, address, hpage, node,
-				referenced, unmapped);
+		collapse_huge_page(mm, address, hpage, referenced, unmapped,
+				   cc);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
@@ -1578,7 +1580,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  * @file: file that collapse on
  * @start: collapse start address
  * @hpage: new allocated huge page for collapse
- * @node: appointed node the new huge page allocate from
+ * @cc: collapse context and scratchpad
  *
  * Basic scheme is simple, details are more complex:
  *  - allocate and lock a new huge page;
@@ -1595,12 +1597,11 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  *    + restore gaps in the page cache;
  *    + unlock and free huge page;
  */
-static void collapse_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start,
-		struct page **hpage, int node)
+static void collapse_file(struct mm_struct *mm, struct file *file,
+			  pgoff_t start, struct page **hpage,
+			  struct collapse_control *cc)
 {
 	struct address_space *mapping = file->f_mapping;
-	gfp_t gfp;
 	struct page *new_page;
 	pgoff_t index, end = start + HPAGE_PMD_NR;
 	LIST_HEAD(pagelist);
@@ -1612,20 +1613,11 @@ static void collapse_file(struct mm_struct *mm,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	/* Only allocate from the target node */
-	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
-
-	new_page = khugepaged_alloc_page(hpage, gfp, node);
-	if (!new_page) {
-		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+	result = alloc_charge_hpage(hpage, mm, cc);
+	if (result != SCAN_SUCCEED)
 		goto out;
-	}
 
-	if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
-		result = SCAN_CGROUP_CHARGE_FAIL;
-		goto out;
-	}
-	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
+	new_page = *hpage;
 
 	/*
 	 * Ensure we have slots for all the pages in the range.  This is
@@ -2037,8 +2029,7 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			node = khugepaged_find_target_node(cc);
-			collapse_file(mm, file, start, hpage, node);
+			collapse_file(mm, file, start, hpage, cc);
 		}
 	}
 
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 05/15] mm/khugepaged: make allocation semantics context-specific
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
                   ` (3 preceding siblings ...)
  2022-06-04  0:39 ` [PATCH v6 04/15] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
@ 2022-06-04  0:39 ` Zach O'Keefe
  2022-06-06 20:58   ` Yang Shi
  2022-06-04  0:39 ` [PATCH v6 06/15] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:39 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add a gfp_t flags member to struct collapse_control that allows contexts
to specify their own allocation semantics.  This decouples the
allocation semantics from
/sys/kernel/mm/transparent_hugepage/khugepaged/defrag.

khugepaged updates this member for every hugepage processed, since the
sysfs setting might change at any time.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 38488d114073..ba722347bebd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,6 +92,9 @@ struct collapse_control {
 
 	/* Last target selected in khugepaged_find_target_node() */
 	int last_target_node;
+
+	/* gfp used for allocation and memcg charging */
+	gfp_t gfp;
 };
 
 /**
@@ -994,15 +997,14 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 	return true;
 }
 
-static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
+static int alloc_charge_hpage(struct mm_struct *mm, struct page **hpage,
 			      struct collapse_control *cc)
 {
-	gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
 	int node = khugepaged_find_target_node(cc);
 
-	if (!khugepaged_alloc_page(hpage, gfp, node))
+	if (!khugepaged_alloc_page(hpage, cc->gfp, node))
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
-	if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
+	if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, cc->gfp)))
 		return SCAN_CGROUP_CHARGE_FAIL;
 	count_memcg_page_event(*hpage, THP_COLLAPSE_ALLOC);
 	return SCAN_SUCCEED;
@@ -1032,7 +1034,7 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_hpage(hpage, mm, cc);
+	result = alloc_charge_hpage(mm, hpage, cc);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
@@ -1613,7 +1615,7 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_hpage(hpage, mm, cc);
+	result = alloc_charge_hpage(mm, hpage, cc);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
@@ -2037,8 +2039,7 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 }
 #else
 static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				 pgoff_t start, struct page **hpage,
-				 struct collapse_control *cc)
+				 pgoff_t start, struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -2121,6 +2122,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 			if (unlikely(khugepaged_test_exit(mm)))
 				goto breakouterloop;
 
+			/* reset gfp flags since sysfs settings might change */
+			cc->gfp = alloc_hugepage_khugepaged_gfpmask() |
+					__GFP_THISNODE;
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
@@ -2255,6 +2259,7 @@ static int khugepaged(void *none)
 	struct mm_slot *mm_slot;
 	struct collapse_control cc = {
 		.last_target_node = NUMA_NO_NODE,
+		/* .gfp set later  */
 	};
 
 	set_freezable();
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 06/15] mm/khugepaged: pipe enum scan_result codes back to callers
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
                   ` (4 preceding siblings ...)
  2022-06-04  0:39 ` [PATCH v6 05/15] mm/khugepaged: make allocation semantics context-specific Zach O'Keefe
@ 2022-06-04  0:39 ` Zach O'Keefe
  2022-06-06 22:39   ` Yang Shi
  2022-06-04  0:39 ` [PATCH v6 07/15] mm/khugepaged: add flag to ignore khugepaged heuristics Zach O'Keefe
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:39 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Pipe enum scan_result codes back through return values of functions
downstream of khugepaged_scan_file() and khugepaged_scan_pmd() to
inform callers if the operation was successful, and if not, why.

Since khugepaged_scan_pmd()'s return value already has a specific
meaning (whether mmap_lock was unlocked or not), add a bool* argument
to khugepaged_scan_pmd() to retrieve this information.

Change khugepaged to take action based on the return values of
khugepaged_scan_file() and khugepaged_scan_pmd() instead of acting
deep within the collapsing functions themselves.

Remove dependency on error pointers to communicate to khugepaged that
allocation failed and it should sleep; instead just use the result of
the scan (SCAN_ALLOC_HUGE_PAGE_FAIL if allocation fails).

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 192 ++++++++++++++++++++++++------------------------
 1 file changed, 96 insertions(+), 96 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ba722347bebd..03e0da0008f1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -722,13 +722,13 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(page, none_or_zero,
 						    referenced, writable, result);
-		return 1;
+		return SCAN_SUCCEED;
 	}
 out:
 	release_pte_pages(pte, _pte, compound_pagelist);
 	trace_mm_collapse_huge_page_isolate(page, none_or_zero,
 					    referenced, writable, result);
-	return 0;
+	return result;
 }
 
 static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
@@ -850,14 +850,13 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
 #endif
 
 /* Sleep for the first alloc fail, break the loop for the second fail */
-static bool alloc_fail_should_sleep(struct page **hpage, bool *wait)
+static bool alloc_fail_should_sleep(int result, bool *wait)
 {
-	if (IS_ERR(*hpage)) {
+	if (result == SCAN_ALLOC_HUGE_PAGE_FAIL) {
 		if (!*wait)
 			return true;
 
 		*wait = false;
-		*hpage = NULL;
 		khugepaged_alloc_sleep();
 	}
 	return false;
@@ -868,7 +867,6 @@ static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-		*hpage = ERR_PTR(-ENOMEM);
 		return false;
 	}
 
@@ -1010,17 +1008,17 @@ static int alloc_charge_hpage(struct mm_struct *mm, struct page **hpage,
 	return SCAN_SUCCEED;
 }
 
-static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
-			       struct page **hpage, int referenced,
-			       int unmapped, struct collapse_control *cc)
+static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
+			      int referenced, int unmapped,
+			      struct collapse_control *cc)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
 	pte_t *pte;
 	pgtable_t pgtable;
-	struct page *new_page;
+	struct page *hpage;
 	spinlock_t *pmd_ptl, *pte_ptl;
-	int isolated = 0, result = 0;
+	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
 
@@ -1034,12 +1032,10 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_hpage(mm, hpage, cc);
+	result = alloc_charge_hpage(mm, &hpage, cc);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
-	new_page = *hpage;
-
 	mmap_read_lock(mm);
 	result = hugepage_vma_revalidate(mm, address, &vma);
 	if (result) {
@@ -1100,11 +1096,11 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 
 	spin_lock(pte_ptl);
-	isolated = __collapse_huge_page_isolate(vma, address, pte,
-			&compound_pagelist);
+	result =  __collapse_huge_page_isolate(vma, address, pte,
+					       &compound_pagelist);
 	spin_unlock(pte_ptl);
 
-	if (unlikely(!isolated)) {
+	if (unlikely(result != SCAN_SUCCEED)) {
 		pte_unmap(pte);
 		spin_lock(pmd_ptl);
 		BUG_ON(!pmd_none(*pmd));
@@ -1116,7 +1112,6 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
 		spin_unlock(pmd_ptl);
 		anon_vma_unlock_write(vma->anon_vma);
-		result = SCAN_FAIL;
 		goto out_up_write;
 	}
 
@@ -1126,8 +1121,8 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	anon_vma_unlock_write(vma->anon_vma);
 
-	__collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl,
-			&compound_pagelist);
+	__collapse_huge_page_copy(pte, hpage, vma, address, pte_ptl,
+				  &compound_pagelist);
 	pte_unmap(pte);
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), but
@@ -1135,43 +1130,42 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * avoid the copy_huge_page writes to become visible after
 	 * the set_pmd_at() write.
 	 */
-	__SetPageUptodate(new_page);
+	__SetPageUptodate(hpage);
 	pgtable = pmd_pgtable(_pmd);
 
-	_pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+	_pmd = mk_huge_pmd(hpage, vma->vm_page_prot);
 	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address);
-	lru_cache_add_inactive_or_unevictable(new_page, vma);
+	page_add_new_anon_rmap(hpage, vma, address);
+	lru_cache_add_inactive_or_unevictable(hpage, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
 	spin_unlock(pmd_ptl);
 
-	*hpage = NULL;
+	hpage = NULL;
 
-	khugepaged_pages_collapsed++;
 	result = SCAN_SUCCEED;
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
-	if (!IS_ERR_OR_NULL(*hpage)) {
-		mem_cgroup_uncharge(page_folio(*hpage));
-		put_page(*hpage);
+	if (hpage) {
+		mem_cgroup_uncharge(page_folio(hpage));
+		put_page(hpage);
 	}
-	trace_mm_collapse_huge_page(mm, isolated, result);
-	return;
+	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+	return result;
 }
 
 static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-			       unsigned long address, struct page **hpage,
+			       unsigned long address, bool *mmap_locked,
 			       struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
-	int ret = 0, result = 0, referenced = 0;
+	int result = SCAN_FAIL, referenced = 0;
 	int none_or_zero = 0, shared = 0;
 	struct page *page = NULL;
 	unsigned long _address;
@@ -1306,19 +1300,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
-		ret = 1;
 	}
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
-	if (ret) {
+	if (result == SCAN_SUCCEED) {
 		/* collapse_huge_page will return with the mmap_lock released */
-		collapse_huge_page(mm, address, hpage, referenced, unmapped,
-				   cc);
+		*mmap_locked = false;
+		result = collapse_huge_page(mm, address, referenced,
+					    unmapped, cc);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
 				     none_or_zero, result, unmapped);
-	return ret;
+	return result;
 }
 
 static void collect_mm_slot(struct mm_slot *mm_slot)
@@ -1581,7 +1575,6 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  * @mm: process address space where collapse happens
  * @file: file that collapse on
  * @start: collapse start address
- * @hpage: new allocated huge page for collapse
  * @cc: collapse context and scratchpad
  *
  * Basic scheme is simple, details are more complex:
@@ -1599,12 +1592,11 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  *    + restore gaps in the page cache;
  *    + unlock and free huge page;
  */
-static void collapse_file(struct mm_struct *mm, struct file *file,
-			  pgoff_t start, struct page **hpage,
-			  struct collapse_control *cc)
+static int collapse_file(struct mm_struct *mm, struct file *file,
+			 pgoff_t start, struct collapse_control *cc)
 {
 	struct address_space *mapping = file->f_mapping;
-	struct page *new_page;
+	struct page *hpage;
 	pgoff_t index, end = start + HPAGE_PMD_NR;
 	LIST_HEAD(pagelist);
 	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
@@ -1615,12 +1607,10 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_hpage(mm, hpage, cc);
+	result = alloc_charge_hpage(mm, &hpage, cc);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-	new_page = *hpage;
-
 	/*
 	 * Ensure we have slots for all the pages in the range.  This is
 	 * almost certainly a no-op because most of the pages must be present
@@ -1637,14 +1627,14 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 		}
 	} while (1);
 
-	__SetPageLocked(new_page);
+	__SetPageLocked(hpage);
 	if (is_shmem)
-		__SetPageSwapBacked(new_page);
-	new_page->index = start;
-	new_page->mapping = mapping;
+		__SetPageSwapBacked(hpage);
+	hpage->index = start;
+	hpage->mapping = mapping;
 
 	/*
-	 * At this point the new_page is locked and not up-to-date.
+	 * At this point the hpage is locked and not up-to-date.
 	 * It's safe to insert it into the page cache, because nobody would
 	 * be able to map it or use it in another way until we unlock it.
 	 */
@@ -1672,7 +1662,7 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 					result = SCAN_FAIL;
 					goto xa_locked;
 				}
-				xas_store(&xas, new_page);
+				xas_store(&xas, hpage);
 				nr_none++;
 				continue;
 			}
@@ -1814,19 +1804,19 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 		list_add_tail(&page->lru, &pagelist);
 
 		/* Finally, replace with the new page. */
-		xas_store(&xas, new_page);
+		xas_store(&xas, hpage);
 		continue;
 out_unlock:
 		unlock_page(page);
 		put_page(page);
 		goto xa_unlocked;
 	}
-	nr = thp_nr_pages(new_page);
+	nr = thp_nr_pages(hpage);
 
 	if (is_shmem)
-		__mod_lruvec_page_state(new_page, NR_SHMEM_THPS, nr);
+		__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
 	else {
-		__mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
+		__mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
 		filemap_nr_thps_inc(mapping);
 		/*
 		 * Paired with smp_mb() in do_dentry_open() to ensure
@@ -1837,21 +1827,21 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 		smp_mb();
 		if (inode_is_open_for_write(mapping->host)) {
 			result = SCAN_FAIL;
-			__mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
+			__mod_lruvec_page_state(hpage, NR_FILE_THPS, -nr);
 			filemap_nr_thps_dec(mapping);
 			goto xa_locked;
 		}
 	}
 
 	if (nr_none) {
-		__mod_lruvec_page_state(new_page, NR_FILE_PAGES, nr_none);
+		__mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
 		if (is_shmem)
-			__mod_lruvec_page_state(new_page, NR_SHMEM, nr_none);
+			__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
 	}
 
 	/* Join all the small entries into a single multi-index entry */
 	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
-	xas_store(&xas, new_page);
+	xas_store(&xas, hpage);
 xa_locked:
 	xas_unlock_irq(&xas);
 xa_unlocked:
@@ -1873,11 +1863,11 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 		index = start;
 		list_for_each_entry_safe(page, tmp, &pagelist, lru) {
 			while (index < page->index) {
-				clear_highpage(new_page + (index % HPAGE_PMD_NR));
+				clear_highpage(hpage + (index % HPAGE_PMD_NR));
 				index++;
 			}
-			copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
-					page);
+			copy_highpage(hpage + (page->index % HPAGE_PMD_NR),
+				      page);
 			list_del(&page->lru);
 			page->mapping = NULL;
 			page_ref_unfreeze(page, 1);
@@ -1888,23 +1878,23 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 			index++;
 		}
 		while (index < end) {
-			clear_highpage(new_page + (index % HPAGE_PMD_NR));
+			clear_highpage(hpage + (index % HPAGE_PMD_NR));
 			index++;
 		}
 
-		SetPageUptodate(new_page);
-		page_ref_add(new_page, HPAGE_PMD_NR - 1);
+		SetPageUptodate(hpage);
+		page_ref_add(hpage, HPAGE_PMD_NR - 1);
 		if (is_shmem)
-			set_page_dirty(new_page);
-		lru_cache_add(new_page);
+			set_page_dirty(hpage);
+		lru_cache_add(hpage);
 
 		/*
 		 * Remove pte page tables, so we can re-fault the page as huge.
 		 */
 		retract_page_tables(mapping, start);
-		*hpage = NULL;
-
-		khugepaged_pages_collapsed++;
+		unlock_page(hpage);
+		hpage = NULL;
+		goto out;
 	} else {
 		struct page *page;
 
@@ -1943,22 +1933,22 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 		VM_BUG_ON(nr_none);
 		xas_unlock_irq(&xas);
 
-		new_page->mapping = NULL;
+		hpage->mapping = NULL;
 	}
 
-	unlock_page(new_page);
+	unlock_page(hpage);
 out:
 	VM_BUG_ON(!list_empty(&pagelist));
-	if (!IS_ERR_OR_NULL(*hpage)) {
-		mem_cgroup_uncharge(page_folio(*hpage));
-		put_page(*hpage);
+	if (hpage) {
+		mem_cgroup_uncharge(page_folio(hpage));
+		put_page(hpage);
 	}
 	/* TODO: tracepoints */
+	return result;
 }
 
-static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				 pgoff_t start, struct page **hpage,
-				 struct collapse_control *cc)
+static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
+				pgoff_t start, struct collapse_control *cc)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2031,15 +2021,16 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			collapse_file(mm, file, start, hpage, cc);
+			result = collapse_file(mm, file, start, cc);
 		}
 	}
 
 	/* TODO: tracepoints */
+	return result;
 }
 #else
-static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				 pgoff_t start, struct collapse_control *cc)
+static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, pgoff_t start,
+				struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -2049,8 +2040,7 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 }
 #endif
 
-static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
-					    struct page **hpage,
+static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
@@ -2064,6 +2054,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 
 	VM_BUG_ON(!pages);
 	lockdep_assert_held(&khugepaged_mm_lock);
+	*result = SCAN_FAIL;
 
 	if (khugepaged_scan.mm_slot)
 		mm_slot = khugepaged_scan.mm_slot;
@@ -2117,7 +2108,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 			goto skip;
 
 		while (khugepaged_scan.address < hend) {
-			int ret;
+			bool mmap_locked = true;
+
 			cond_resched();
 			if (unlikely(khugepaged_test_exit(mm)))
 				goto breakouterloop;
@@ -2134,20 +2126,28 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 						khugepaged_scan.address);
 
 				mmap_read_unlock(mm);
-				ret = 1;
-				khugepaged_scan_file(mm, file, pgoff, hpage,
-						     cc);
+				mmap_locked = false;
+				*result = khugepaged_scan_file(mm, file, pgoff,
+							       cc);
 				fput(file);
 			} else {
-				ret = khugepaged_scan_pmd(mm, vma,
-						khugepaged_scan.address,
-						hpage, cc);
+				*result = khugepaged_scan_pmd(mm, vma,
+							      khugepaged_scan.address,
+							      &mmap_locked, cc);
 			}
+			if (*result == SCAN_SUCCEED)
+				++khugepaged_pages_collapsed;
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
 			progress += HPAGE_PMD_NR;
-			if (ret)
-				/* we released mmap_lock so break loop */
+			if (!mmap_locked)
+				/*
+				 * We released mmap_lock so break loop.  Note
+				 * that we drop mmap_lock before all hugepage
+				 * allocations, so if allocation fails, we are
+				 * guaranteed to break here and report the
+				 * correct result back to caller.
+				 */
 				goto breakouterloop_mmap_lock;
 			if (progress >= pages)
 				goto breakouterloop;
@@ -2199,15 +2199,15 @@ static int khugepaged_wait_event(void)
 
 static void khugepaged_do_scan(struct collapse_control *cc)
 {
-	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
 	unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
 	bool wait = true;
+	int result = SCAN_SUCCEED;
 
 	lru_add_drain_all();
 
 	while (progress < pages) {
-		if (alloc_fail_should_sleep(&hpage, &wait))
+		if (alloc_fail_should_sleep(result, &wait))
 			break;
 
 		cond_resched();
@@ -2221,7 +2221,7 @@ static void khugepaged_do_scan(struct collapse_control *cc)
 		if (khugepaged_has_work() &&
 		    pass_through_head < 2)
 			progress += khugepaged_scan_mm_slot(pages - progress,
-							    &hpage, cc);
+							    &result, cc);
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 07/15] mm/khugepaged: add flag to ignore khugepaged heuristics
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
                   ` (5 preceding siblings ...)
  2022-06-04  0:39 ` [PATCH v6 06/15] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
@ 2022-06-04  0:39 ` Zach O'Keefe
  2022-06-06 22:51   ` Yang Shi
  2022-06-04  0:39 ` [PATCH v6 08/15] mm/khugepaged: add flag to ignore THP sysfs enabled Zach O'Keefe
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:39 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add enforce_page_heuristics flag to struct collapse_control that allows
context to ignore heuristics originally designed to guide khugepaged:

1) sysfs-controlled knobs khugepaged_max_ptes_[none|swap|shared]
2) requirement that some pages in region being collapsed be young or
   referenced

This flag is set in khugepaged collapse context to preserve existing
khugepaged behavior.

This flag will be used (unset) when introducing madvise collapse
context since here, the user presumably has reason to believe the
collapse will be beneficial and khugepaged heuristics shouldn't tell
the user they are wrong.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 55 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 18 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 03e0da0008f1..c3589b3e238d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -87,6 +87,13 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
 #define MAX_PTE_MAPPED_THP 8
 
 struct collapse_control {
+	/*
+	 * Heuristics:
+	 * - khugepaged_max_ptes_[none|swap|shared]
+	 * - require memory to be young / referenced
+	 */
+	bool enforce_page_heuristics;
+
 	/* Num pages scanned per node */
 	int node_load[MAX_NUMNODES];
 
@@ -604,6 +611,7 @@ static bool is_refcount_suitable(struct page *page)
 static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
+					struct collapse_control *cc,
 					struct list_head *compound_pagelist)
 {
 	struct page *page = NULL;
@@ -617,7 +625,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		if (pte_none(pteval) || (pte_present(pteval) &&
 				is_zero_pfn(pte_pfn(pteval)))) {
 			if (!userfaultfd_armed(vma) &&
-			    ++none_or_zero <= khugepaged_max_ptes_none) {
+			    (++none_or_zero <= khugepaged_max_ptes_none ||
+			     !cc->enforce_page_heuristics)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -637,8 +646,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		VM_BUG_ON_PAGE(!PageAnon(page), page);
 
-		if (page_mapcount(page) > 1 &&
-				++shared > khugepaged_max_ptes_shared) {
+		if (cc->enforce_page_heuristics && page_mapcount(page) > 1 &&
+		    ++shared > khugepaged_max_ptes_shared) {
 			result = SCAN_EXCEED_SHARED_PTE;
 			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 			goto out;
@@ -705,9 +714,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			list_add_tail(&page->lru, compound_pagelist);
 next:
 		/* There should be enough young pte to collapse the page */
-		if (pte_young(pteval) ||
-		    page_is_young(page) || PageReferenced(page) ||
-		    mmu_notifier_test_young(vma->vm_mm, address))
+		if (cc->enforce_page_heuristics &&
+		    (pte_young(pteval) || page_is_young(page) ||
+		     PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
+								     address)))
 			referenced++;
 
 		if (pte_write(pteval))
@@ -716,7 +726,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 	if (unlikely(!writable)) {
 		result = SCAN_PAGE_RO;
-	} else if (unlikely(!referenced)) {
+	} else if (unlikely(cc->enforce_page_heuristics && !referenced)) {
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
@@ -1096,7 +1106,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 
 	spin_lock(pte_ptl);
-	result =  __collapse_huge_page_isolate(vma, address, pte,
+	result =  __collapse_huge_page_isolate(vma, address, pte, cc,
 					       &compound_pagelist);
 	spin_unlock(pte_ptl);
 
@@ -1185,7 +1195,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (is_swap_pte(pteval)) {
-			if (++unmapped <= khugepaged_max_ptes_swap) {
+			if (++unmapped <= khugepaged_max_ptes_swap ||
+			    !cc->enforce_page_heuristics) {
 				/*
 				 * Always be strict with uffd-wp
 				 * enabled swap entries.  Please see
@@ -1204,7 +1215,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			if (!userfaultfd_armed(vma) &&
-			    ++none_or_zero <= khugepaged_max_ptes_none) {
+			    (++none_or_zero <= khugepaged_max_ptes_none ||
+			     !cc->enforce_page_heuristics)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1234,8 +1246,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto out_unmap;
 		}
 
-		if (page_mapcount(page) > 1 &&
-				++shared > khugepaged_max_ptes_shared) {
+		if (cc->enforce_page_heuristics &&
+		    page_mapcount(page) > 1 &&
+		    ++shared > khugepaged_max_ptes_shared) {
 			result = SCAN_EXCEED_SHARED_PTE;
 			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 			goto out_unmap;
@@ -1289,14 +1302,17 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 			result = SCAN_PAGE_COUNT;
 			goto out_unmap;
 		}
-		if (pte_young(pteval) ||
-		    page_is_young(page) || PageReferenced(page) ||
-		    mmu_notifier_test_young(vma->vm_mm, address))
+		if (cc->enforce_page_heuristics &&
+		    (pte_young(pteval) || page_is_young(page) ||
+		     PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
+								     address)))
 			referenced++;
 	}
 	if (!writable) {
 		result = SCAN_PAGE_RO;
-	} else if (!referenced || (unmapped && referenced < HPAGE_PMD_NR/2)) {
+	} else if (cc->enforce_page_heuristics &&
+		   (!referenced ||
+		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
@@ -1966,7 +1982,8 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 			continue;
 
 		if (xa_is_value(page)) {
-			if (++swap > khugepaged_max_ptes_swap) {
+			if (cc->enforce_page_heuristics &&
+			    ++swap > khugepaged_max_ptes_swap) {
 				result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				break;
@@ -2017,7 +2034,8 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 	rcu_read_unlock();
 
 	if (result == SCAN_SUCCEED) {
-		if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+		if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none &&
+		    cc->enforce_page_heuristics) {
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
@@ -2258,6 +2276,7 @@ static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
 	struct collapse_control cc = {
+		.enforce_page_heuristics = true,
 		.last_target_node = NUMA_NO_NODE,
 		/* .gfp set later  */
 	};
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 08/15] mm/khugepaged: add flag to ignore THP sysfs enabled
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
                   ` (6 preceding siblings ...)
  2022-06-04  0:39 ` [PATCH v6 07/15] mm/khugepaged: add flag to ignore khugepaged heuristics Zach O'Keefe
@ 2022-06-04  0:39 ` Zach O'Keefe
  2022-06-06 23:02   ` Yang Shi
       [not found]   ` <YrzehlUoo2iMMLC2@xz-m1.local>
  2022-06-04  0:39 ` [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:39 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add enforce_thp_enabled flag to struct collapse_control that allows context
to ignore constraints imposed by /sys/kernel/transparent_hugepage/enabled.

This flag is set in khugepaged collapse context to preserve existing
khugepaged behavior.

This flag will be used (unset) when introducing madvise collapse
context since the desired THP semantics of MADV_COLLAPSE aren't coupled
to sysfs THP settings.  Most notably, for the purpose of eventual
madvise_collapse(2) support, this allows userspace to trigger THP collapse
on behalf of another processes, without adding support to meddle with
the VMA flags of said process, or change sysfs THP settings.

For now, limit this flag to /sys/kernel/transparent_hugepage/enabled,
but it can be expanded to include
/sys/kernel/transparent_hugepage/shmem_enabled later.

Link: https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c3589b3e238d..4ad04f552347 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -94,6 +94,11 @@ struct collapse_control {
 	 */
 	bool enforce_page_heuristics;
 
+	/* Enforce constraints of
+	 * /sys/kernel/mm/transparent_hugepage/enabled
+	 */
+	bool enforce_thp_enabled;
+
 	/* Num pages scanned per node */
 	int node_load[MAX_NUMNODES];
 
@@ -893,10 +898,12 @@ static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
  */
 
 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
-		struct vm_area_struct **vmap)
+				   struct vm_area_struct **vmap,
+				   struct collapse_control *cc)
 {
 	struct vm_area_struct *vma;
 	unsigned long hstart, hend;
+	unsigned long vma_flags;
 
 	if (unlikely(khugepaged_test_exit(mm)))
 		return SCAN_ANY_PROCESS;
@@ -909,7 +916,18 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	hend = vma->vm_end & HPAGE_PMD_MASK;
 	if (address < hstart || address + HPAGE_PMD_SIZE > hend)
 		return SCAN_ADDRESS_RANGE;
-	if (!hugepage_vma_check(vma, vma->vm_flags))
+
+	/*
+	 * If !cc->enforce_thp_enabled, set VM_HUGEPAGE so that
+	 * hugepage_vma_check() can pass even if
+	 * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
+	 * Note that hugepage_vma_check() doesn't enforce that
+	 * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
+	 * must be set (i.e. "never" mode).
+	 */
+	vma_flags = cc->enforce_thp_enabled ?  vma->vm_flags
+			: vma->vm_flags | VM_HUGEPAGE;
+	if (!hugepage_vma_check(vma, vma_flags))
 		return SCAN_VMA_CHECK;
 	/* Anon VMA expected */
 	if (!vma->anon_vma || !vma_is_anonymous(vma))
@@ -953,7 +971,8 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
-					int referenced)
+					int referenced,
+					struct collapse_control *cc)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
@@ -980,7 +999,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 		/* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
 		if (ret & VM_FAULT_RETRY) {
 			mmap_read_lock(mm);
-			if (hugepage_vma_revalidate(mm, haddr, &vma)) {
+			if (hugepage_vma_revalidate(mm, haddr, &vma, cc)) {
 				/* vma is no longer available, don't continue to swapin */
 				trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
 				return false;
@@ -1047,7 +1066,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma);
+	result = hugepage_vma_revalidate(mm, address, &vma, cc);
 	if (result) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1066,7 +1085,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * Continuing to collapse causes inconsistency.
 	 */
 	if (unmapped && !__collapse_huge_page_swapin(mm, vma, address,
-						     pmd, referenced)) {
+						     pmd, referenced, cc)) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
 	}
@@ -1078,7 +1097,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * handled by the anon_vma lock + PG_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma);
+	result = hugepage_vma_revalidate(mm, address, &vma, cc);
 	if (result)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -2277,6 +2296,7 @@ static int khugepaged(void *none)
 	struct mm_slot *mm_slot;
 	struct collapse_control cc = {
 		.enforce_page_heuristics = true,
+		.enforce_thp_enabled = true,
 		.last_target_node = NUMA_NO_NODE,
 		/* .gfp set later  */
 	};
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
                   ` (7 preceding siblings ...)
  2022-06-04  0:39 ` [PATCH v6 08/15] mm/khugepaged: add flag to ignore THP sysfs enabled Zach O'Keefe
@ 2022-06-04  0:39 ` Zach O'Keefe
  2022-06-06 23:53   ` Yang Shi
  2022-06-04  0:39 ` [PATCH v6 10/15] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:39 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

This idea was introduced by David Rientjes[1].

Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
synchronous collapse of memory at their own expense.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the
  THP
* Avoid unpredictable timing of khugepaged collapse

An immediate user of this new functionality are malloc() implementations
that manage memory in hugepage-sized chunks, but sometimes subrelease
memory back to the system in native-sized chunks via MADV_DONTNEED;
zapping the pmd.  Later, when the memory is hot, the implementation
could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
hugepage coverage and dTLB performance.  TCMalloc is such an
implementation that could benefit from this[2].

Only privately-mapped anon memory is supported for now, but it is
expected that file and shmem support will be added later to support the
use-case of backing executable text by THPs.  Current support provided
by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
which might impair services from serving at their full rated load after
(re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
immediately realize iTLB performance prevents page sharing and demand
paging, both of which increase steady state memory footprint.  With
MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
and lower RAM footprints.

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.

THP allocation may enter direct reclaim and/or compaction.

[1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[2] https://github.com/google/tcmalloc/tree/master/tcmalloc

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 arch/alpha/include/uapi/asm/mman.h     |   2 +
 arch/mips/include/uapi/asm/mman.h      |   2 +
 arch/parisc/include/uapi/asm/mman.h    |   2 +
 arch/xtensa/include/uapi/asm/mman.h    |   2 +
 include/linux/huge_mm.h                |  12 +++
 include/uapi/asm-generic/mman-common.h |   2 +
 mm/khugepaged.c                        | 124 +++++++++++++++++++++++++
 mm/madvise.c                           |   5 +
 8 files changed, 151 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 4aa996423b0d..763929e814e9 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -76,6 +76,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 1be428663c10..c6e1fc77c996 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -103,6 +103,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index a7ea3204a5fa..22133a6a506e 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -70,6 +70,8 @@
 #define MADV_WIPEONFORK 71		/* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 72		/* Undo MADV_WIPEONFORK */
 
+#define MADV_COLLAPSE	73		/* Synchronous hugepage collapse */
+
 #define MADV_HWPOISON     100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 7966a58af472..1ff0c858544f 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -111,6 +111,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 648cb3ce7099..2ca2f3b41fc8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -240,6 +240,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
+int madvise_collapse(struct vm_area_struct *vma,
+		     struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -395,6 +398,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
 	BUG();
 	return 0;
 }
+
+static inline int madvise_collapse(struct vm_area_struct *vma,
+				   struct vm_area_struct **prev,
+				   unsigned long start, unsigned long end)
+{
+	BUG();
+	return 0;
+}
+
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6c1aa92a92e4..6ce1f1ceb432 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -77,6 +77,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4ad04f552347..073d6bb03b37 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2404,3 +2404,127 @@ void khugepaged_min_free_kbytes_update(void)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
+
+static int madvise_collapse_errno(enum scan_result r)
+{
+	switch (r) {
+	case SCAN_PMD_NULL:
+	case SCAN_ADDRESS_RANGE:
+	case SCAN_VMA_NULL:
+	case SCAN_PTE_NON_PRESENT:
+	case SCAN_PAGE_NULL:
+		/*
+		 * Addresses in the specified range are not currently mapped,
+		 * or are outside the AS of the process.
+		 */
+		return -ENOMEM;
+	case SCAN_ALLOC_HUGE_PAGE_FAIL:
+	case SCAN_CGROUP_CHARGE_FAIL:
+		/* A kernel resource was temporarily unavailable. */
+		return -EAGAIN;
+	default:
+		return -EINVAL;
+	}
+}
+
+int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end)
+{
+	struct collapse_control cc = {
+		.enforce_page_heuristics = false,
+		.enforce_thp_enabled = false,
+		.last_target_node = NUMA_NO_NODE,
+		.gfp = GFP_TRANSHUGE | __GFP_THISNODE,
+	};
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long hstart, hend, addr;
+	int thps = 0, last_fail = SCAN_FAIL;
+	bool mmap_locked = true;
+
+	BUG_ON(vma->vm_start > start);
+	BUG_ON(vma->vm_end < end);
+
+	*prev = vma;
+
+	/* TODO: Support file/shmem */
+	if (!vma->anon_vma || !vma_is_anonymous(vma))
+		return -EINVAL;
+
+	hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = end & HPAGE_PMD_MASK;
+
+	/*
+	 * Set VM_HUGEPAGE so that hugepage_vma_check() can pass even if
+	 * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
+	 * Note that hugepage_vma_check() doesn't enforce that
+	 * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
+	 * must be set (i.e. "never" mode)
+	 */
+	if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
+		return -EINVAL;
+
+	mmgrab(mm);
+	lru_add_drain();
+
+	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
+		int result = SCAN_FAIL;
+		bool retry = true;  /* Allow one retry per hugepage */
+retry:
+		if (!mmap_locked) {
+			cond_resched();
+			mmap_read_lock(mm);
+			mmap_locked = true;
+			result = hugepage_vma_revalidate(mm, addr, &vma, &cc);
+			if (result) {
+				last_fail = result;
+				goto out_nolock;
+			}
+		}
+		mmap_assert_locked(mm);
+		memset(cc.node_load, 0, sizeof(cc.node_load));
+		result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
+		if (!mmap_locked)
+			*prev = NULL;  /* Tell caller we dropped mmap_lock */
+
+		switch (result) {
+		case SCAN_SUCCEED:
+		case SCAN_PMD_MAPPED:
+			++thps;
+			break;
+		/* Whitelisted set of results where continuing OK */
+		case SCAN_PMD_NULL:
+		case SCAN_PTE_NON_PRESENT:
+		case SCAN_PTE_UFFD_WP:
+		case SCAN_PAGE_RO:
+		case SCAN_LACK_REFERENCED_PAGE:
+		case SCAN_PAGE_NULL:
+		case SCAN_PAGE_COUNT:
+		case SCAN_PAGE_LOCK:
+		case SCAN_PAGE_COMPOUND:
+			last_fail = result;
+			break;
+		case SCAN_PAGE_LRU:
+			if (retry) {
+				lru_add_drain_all();
+				retry = false;
+				goto retry;
+			}
+			fallthrough;
+		default:
+			last_fail = result;
+			/* Other error, exit */
+			goto out_maybelock;
+		}
+	}
+
+out_maybelock:
+	/* Caller expects us to hold mmap_lock on return */
+	if (!mmap_locked)
+		mmap_read_lock(mm);
+out_nolock:
+	mmap_assert_locked(mm);
+	mmdrop(mm);
+
+	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
+			: madvise_collapse_errno(last_fail);
+}
diff --git a/mm/madvise.c b/mm/madvise.c
index 46feb62ce163..eccac2620226 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_FREE:
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
+	case MADV_COLLAPSE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_COLLAPSE:
+		return madvise_collapse(vma, prev, start, end);
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
+	case MADV_COLLAPSE:
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
@@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
+ *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 10/15] mm/khugepaged: rename prefix of shared collapse functions
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
                   ` (8 preceding siblings ...)
  2022-06-04  0:39 ` [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
@ 2022-06-04  0:39 ` Zach O'Keefe
  2022-06-06 23:56   ` Yang Shi
  2022-06-04  0:40 ` [PATCH v6 11/15] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:39 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

The following functions/tracepoints are shared between khugepaged and
madvise collapse contexts.  Replace the "khugepaged_" prefix with
generic "hpage_collapse_" prefix in such cases:

khugepaged_test_exit() -> hpage_collapse_test_exit()
khugepaged_scan_abort() -> hpage_collapse_scan_abort()
khugepaged_scan_pmd() -> hpage_collapse_scan_pmd()
khugepaged_find_target_node() -> hpage_collapse_find_target_node()
khugepaged_alloc_page() -> hpage_collapse_alloc_page()
huge_memory:mm_khugepaged_scan_pmd ->
	huge_memory:mm_hpage_collapse_scan_pmd

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/trace/events/huge_memory.h |  2 +-
 mm/khugepaged.c                    | 71 ++++++++++++++++--------------
 2 files changed, 38 insertions(+), 35 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 55392bf30a03..fb6c73632ff3 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -48,7 +48,7 @@ SCAN_STATUS
 #define EM(a, b)	{a, b},
 #define EMe(a, b)	{a, b}
 
-TRACE_EVENT(mm_khugepaged_scan_pmd,
+TRACE_EVENT(mm_hpage_collapse_scan_pmd,
 
 	TP_PROTO(struct mm_struct *mm, struct page *page, bool writable,
 		 int referenced, int none_or_zero, int status, int unmapped),
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 073d6bb03b37..119c1bc84af7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -102,7 +102,7 @@ struct collapse_control {
 	/* Num pages scanned per node */
 	int node_load[MAX_NUMNODES];
 
-	/* Last target selected in khugepaged_find_target_node() */
+	/* Last target selected in hpage_collapse_find_target_node() */
 	int last_target_node;
 
 	/* gfp used for allocation and memcg charging */
@@ -456,7 +456,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
 	hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
 }
 
-static inline int khugepaged_test_exit(struct mm_struct *mm)
+static inline int hpage_collapse_test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
@@ -508,7 +508,7 @@ void __khugepaged_enter(struct mm_struct *mm)
 		return;
 
 	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
+	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
 		free_mm_slot(mm_slot);
 		return;
@@ -562,11 +562,10 @@ void __khugepaged_exit(struct mm_struct *mm)
 	} else if (mm_slot) {
 		/*
 		 * This is required to serialize against
-		 * khugepaged_test_exit() (which is guaranteed to run
-		 * under mmap sem read mode). Stop here (after we
-		 * return all pagetables will be destroyed) until
-		 * khugepaged has finished working on the pagetables
-		 * under the mmap_lock.
+		 * hpage_collapse_test_exit() (which is guaranteed to run
+		 * under mmap sem read mode). Stop here (after we return all
+		 * pagetables will be destroyed) until khugepaged has finished
+		 * working on the pagetables under the mmap_lock.
 		 */
 		mmap_write_lock(mm);
 		mmap_write_unlock(mm);
@@ -803,7 +802,7 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
+static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -834,7 +833,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(struct collapse_control *cc)
+static int hpage_collapse_find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
 
@@ -858,7 +857,7 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
 	return target_node;
 }
 #else
-static int khugepaged_find_target_node(struct collapse_control *cc)
+static int hpage_collapse_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -877,7 +876,7 @@ static bool alloc_fail_should_sleep(int result, bool *wait)
 	return false;
 }
 
-static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
+static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node)
 {
 	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
 	if (unlikely(!*hpage)) {
@@ -905,7 +904,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	unsigned long hstart, hend;
 	unsigned long vma_flags;
 
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(hpage_collapse_test_exit(mm)))
 		return SCAN_ANY_PROCESS;
 
 	*vmap = vma = find_vma(mm, address);
@@ -962,7 +961,7 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 
 /*
  * Bring missing pages in from swap, to complete THP collapse.
- * Only done if khugepaged_scan_pmd believes it is worthwhile.
+ * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held,
  * but with mmap_lock held to protect against vma changes.
@@ -1027,9 +1026,9 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 static int alloc_charge_hpage(struct mm_struct *mm, struct page **hpage,
 			      struct collapse_control *cc)
 {
-	int node = khugepaged_find_target_node(cc);
+	int node = hpage_collapse_find_target_node(cc);
 
-	if (!khugepaged_alloc_page(hpage, cc->gfp, node))
+	if (!hpage_collapse_alloc_page(hpage, cc->gfp, node))
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 	if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, cc->gfp)))
 		return SCAN_CGROUP_CHARGE_FAIL;
@@ -1188,9 +1187,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	return result;
 }
 
-static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-			       unsigned long address, bool *mmap_locked,
-			       struct collapse_control *cc)
+static int hpage_collapse_scan_pmd(struct mm_struct *mm,
+				   struct vm_area_struct *vma,
+				   unsigned long address, bool *mmap_locked,
+				   struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1282,7 +1282,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * hit record.
 		 */
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node, cc)) {
+		if (hpage_collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
@@ -1345,8 +1345,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 					    unmapped, cc);
 	}
 out:
-	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
-				     none_or_zero, result, unmapped);
+	trace_mm_hpage_collapse_scan_pmd(mm, page, writable, referenced,
+					 none_or_zero, result, unmapped);
 	return result;
 }
 
@@ -1356,7 +1356,7 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (khugepaged_test_exit(mm)) {
+	if (hpage_collapse_test_exit(mm)) {
 		/* free mm_slot */
 		hash_del(&mm_slot->hash);
 		list_del(&mm_slot->mm_node);
@@ -1530,7 +1530,7 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 	if (!mmap_write_trylock(mm))
 		return;
 
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(hpage_collapse_test_exit(mm)))
 		goto out;
 
 	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
@@ -1593,7 +1593,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			 * it'll always mapped in small page size for uffd-wp
 			 * registered ranges.
 			 */
-			if (!khugepaged_test_exit(mm) && !userfaultfd_wp(vma))
+			if (!hpage_collapse_test_exit(mm) &&
+			    !userfaultfd_wp(vma))
 				collapse_and_free_pmd(mm, vma, addr, pmd);
 			mmap_write_unlock(mm);
 		} else {
@@ -2020,7 +2021,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 		}
 
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node, cc)) {
+		if (hpage_collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			break;
 		}
@@ -2114,7 +2115,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		goto breakouterloop_mmap_lock;
 
 	progress++;
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(hpage_collapse_test_exit(mm)))
 		goto breakouterloop;
 
 	address = khugepaged_scan.address;
@@ -2123,7 +2124,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		unsigned long hstart, hend;
 
 		cond_resched();
-		if (unlikely(khugepaged_test_exit(mm))) {
+		if (unlikely(hpage_collapse_test_exit(mm))) {
 			progress++;
 			break;
 		}
@@ -2148,7 +2149,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			bool mmap_locked = true;
 
 			cond_resched();
-			if (unlikely(khugepaged_test_exit(mm)))
+			if (unlikely(hpage_collapse_test_exit(mm)))
 				goto breakouterloop;
 
 			/* reset gfp flags since sysfs settings might change */
@@ -2168,9 +2169,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 							       cc);
 				fput(file);
 			} else {
-				*result = khugepaged_scan_pmd(mm, vma,
-							      khugepaged_scan.address,
-							      &mmap_locked, cc);
+				*result = hpage_collapse_scan_pmd(mm, vma,
+								  khugepaged_scan.address,
+								  &mmap_locked,
+								  cc);
 			}
 			if (*result == SCAN_SUCCEED)
 				++khugepaged_pages_collapsed;
@@ -2200,7 +2202,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (khugepaged_test_exit(mm) || !vma) {
+	if (hpage_collapse_test_exit(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
@@ -2482,7 +2484,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		}
 		mmap_assert_locked(mm);
 		memset(cc.node_load, 0, sizeof(cc.node_load));
-		result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
+		result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
+						 &cc);
 		if (!mmap_locked)
 			*prev = NULL;  /* Tell caller we dropped mmap_lock */
 
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 11/15] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
                   ` (9 preceding siblings ...)
  2022-06-04  0:39 ` [PATCH v6 10/15] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
@ 2022-06-04  0:40 ` Zach O'Keefe
  2022-06-07 19:14   ` Yang Shi
  2022-06-04  0:40 ` [PATCH v6 12/15] selftests/vm: modularize collapse selftests Zach O'Keefe
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:40 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
CAP_SYS_ADMIN or is requesting collapse of it's own memory.

This is useful for the development of userspace agents that seek to
optimize THP utilization system-wide by using userspace signals to
prioritize what memory is most deserving of being THP-backed.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/madvise.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index eccac2620226..b19e2f4b924c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1171,13 +1171,15 @@ madvise_behavior_valid(int behavior)
 }
 
 static bool
-process_madvise_behavior_valid(int behavior)
+process_madvise_behavior_valid(int behavior, struct task_struct *task)
 {
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
 	case MADV_WILLNEED:
 		return true;
+	case MADV_COLLAPSE:
+		return task == current || capable(CAP_SYS_ADMIN);
 	default:
 		return false;
 	}
@@ -1455,7 +1457,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto free_iov;
 	}
 
-	if (!process_madvise_behavior_valid(behavior)) {
+	if (!process_madvise_behavior_valid(behavior, task)) {
 		ret = -EINVAL;
 		goto release_task;
 	}
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 12/15] selftests/vm: modularize collapse selftests
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
                   ` (10 preceding siblings ...)
  2022-06-04  0:40 ` [PATCH v6 11/15] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
@ 2022-06-04  0:40 ` Zach O'Keefe
  2022-06-04  0:40 ` [PATCH v6 13/15] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:40 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Modularize the collapse action of khugepaged collapse selftests by
introducing a struct collapse_context which specifies how to collapse a
given memory range and the expected semantics of the collapse.  This
can be reused later to test other collapse contexts.

Additionally, all tests have logic that checks if a collapse occurred
via reading /proc/self/smaps, and report if this is different than
expected.  Move this logic into the per-context ->collapse() hook
instead of repeating it in every test.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 315 +++++++++++-------------
 1 file changed, 142 insertions(+), 173 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 155120b67a16..24a8715363be 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -23,6 +23,11 @@ static int hpage_pmd_nr;
 #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
 #define PID_SMAPS "/proc/self/smaps"
 
+struct collapse_context {
+	void (*collapse)(const char *msg, char *p, bool expect);
+	bool enforce_pte_scan_limits;
+};
+
 enum thp_enabled {
 	THP_ALWAYS,
 	THP_MADVISE,
@@ -469,38 +474,6 @@ static void validate_memory(int *p, unsigned long start, unsigned long end)
 	}
 }
 
-#define TICK 500000
-static bool wait_for_scan(const char *msg, char *p)
-{
-	int full_scans;
-	int timeout = 6; /* 3 seconds */
-
-	/* Sanity check */
-	if (check_huge(p)) {
-		printf("Unexpected huge page\n");
-		exit(EXIT_FAILURE);
-	}
-
-	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
-
-	/* Wait until the second full_scan completed */
-	full_scans = read_num("khugepaged/full_scans") + 2;
-
-	printf("%s...", msg);
-	while (timeout--) {
-		if (check_huge(p))
-			break;
-		if (read_num("khugepaged/full_scans") >= full_scans)
-			break;
-		printf(".");
-		usleep(TICK);
-	}
-
-	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
-
-	return timeout == -1;
-}
-
 static void alloc_at_fault(void)
 {
 	struct settings settings = default_settings;
@@ -528,53 +501,39 @@ static void alloc_at_fault(void)
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_full(void)
+static void collapse_full(struct collapse_context *c)
 {
 	void *p;
 
 	p = alloc_mapping();
 	fill_memory(p, 0, hpage_pmd_size);
-	if (wait_for_scan("Collapse fully populated PTE table", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	c->collapse("Collapse fully populated PTE table", p, true);
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_empty(void)
+static void collapse_empty(struct collapse_context *c)
 {
 	void *p;
 
 	p = alloc_mapping();
-	if (wait_for_scan("Do not collapse empty PTE table", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		fail("Fail");
-	else
-		success("OK");
+	c->collapse("Do not collapse empty PTE table", p, false);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_single_pte_entry(void)
+static void collapse_single_pte_entry(struct collapse_context *c)
 {
 	void *p;
 
 	p = alloc_mapping();
 	fill_memory(p, 0, page_size);
-	if (wait_for_scan("Collapse PTE table with single PTE entry present", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	c->collapse("Collapse PTE table with single PTE entry present", p,
+		    true);
 	validate_memory(p, 0, page_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_none(void)
+static void collapse_max_ptes_none(struct collapse_context *c)
 {
 	int max_ptes_none = hpage_pmd_nr / 2;
 	struct settings settings = default_settings;
@@ -586,28 +545,22 @@ static void collapse_max_ptes_none(void)
 	p = alloc_mapping();
 
 	fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
-	if (wait_for_scan("Do not collapse with max_ptes_none exceeded", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		fail("Fail");
-	else
-		success("OK");
+	c->collapse("Maybe collapse with max_ptes_none exceeded", p,
+		    !c->enforce_pte_scan_limits);
 	validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
 
-	fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
-	if (wait_for_scan("Collapse with max_ptes_none PTEs empty", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
-	validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
+	if (c->enforce_pte_scan_limits) {
+		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
+		c->collapse("Collapse with max_ptes_none PTEs empty", p, true);
+		validate_memory(p, 0,
+				(hpage_pmd_nr - max_ptes_none) * page_size);
+	}
 
 	munmap(p, hpage_pmd_size);
 	write_settings(&default_settings);
 }
 
-static void collapse_swapin_single_pte(void)
+static void collapse_swapin_single_pte(struct collapse_context *c)
 {
 	void *p;
 	p = alloc_mapping();
@@ -625,18 +578,13 @@ static void collapse_swapin_single_pte(void)
 		goto out;
 	}
 
-	if (wait_for_scan("Collapse with swapping in single PTE entry", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	c->collapse("Collapse with swapping in single PTE entry", p, true);
 	validate_memory(p, 0, hpage_pmd_size);
 out:
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_swap(void)
+static void collapse_max_ptes_swap(struct collapse_context *c)
 {
 	int max_ptes_swap = read_num("khugepaged/max_ptes_swap");
 	void *p;
@@ -656,39 +604,34 @@ static void collapse_max_ptes_swap(void)
 		goto out;
 	}
 
-	if (wait_for_scan("Do not collapse with max_ptes_swap exceeded", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		fail("Fail");
-	else
-		success("OK");
+	c->collapse("Maybe collapse with max_ptes_swap exceeded", p,
+		    !c->enforce_pte_scan_limits);
 	validate_memory(p, 0, hpage_pmd_size);
 
-	fill_memory(p, 0, hpage_pmd_size);
-	printf("Swapout %d of %d pages...", max_ptes_swap, hpage_pmd_nr);
-	if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) {
-		perror("madvise(MADV_PAGEOUT)");
-		exit(EXIT_FAILURE);
-	}
-	if (check_swap(p, max_ptes_swap * page_size)) {
-		success("OK");
-	} else {
-		fail("Fail");
-		goto out;
-	}
+	if (c->enforce_pte_scan_limits) {
+		fill_memory(p, 0, hpage_pmd_size);
+		printf("Swapout %d of %d pages...", max_ptes_swap,
+		       hpage_pmd_nr);
+		if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) {
+			perror("madvise(MADV_PAGEOUT)");
+			exit(EXIT_FAILURE);
+		}
+		if (check_swap(p, max_ptes_swap * page_size)) {
+			success("OK");
+		} else {
+			fail("Fail");
+			goto out;
+		}
 
-	if (wait_for_scan("Collapse with max_ptes_swap pages swapped out", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
-	validate_memory(p, 0, hpage_pmd_size);
+		c->collapse("Collapse with max_ptes_swap pages swapped out", p,
+			    true);
+		validate_memory(p, 0, hpage_pmd_size);
+	}
 out:
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_single_pte_entry_compound(void)
+static void collapse_single_pte_entry_compound(struct collapse_context *c)
 {
 	void *p;
 
@@ -710,17 +653,13 @@ static void collapse_single_pte_entry_compound(void)
 	else
 		fail("Fail");
 
-	if (wait_for_scan("Collapse PTE table with single PTE mapping compound page", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	c->collapse("Collapse PTE table with single PTE mapping compound page",
+		    p, true);
 	validate_memory(p, 0, page_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_full_of_compound(void)
+static void collapse_full_of_compound(struct collapse_context *c)
 {
 	void *p;
 
@@ -742,17 +681,12 @@ static void collapse_full_of_compound(void)
 	else
 		fail("Fail");
 
-	if (wait_for_scan("Collapse PTE table full of compound pages", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	c->collapse("Collapse PTE table full of compound pages", p, true);
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_compound_extreme(void)
+static void collapse_compound_extreme(struct collapse_context *c)
 {
 	void *p;
 	int i;
@@ -798,18 +732,14 @@ static void collapse_compound_extreme(void)
 	else
 		fail("Fail");
 
-	if (wait_for_scan("Collapse PTE table full of different compound pages", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	c->collapse("Collapse PTE table full of different compound pages", p,
+		    true);
 
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_fork(void)
+static void collapse_fork(struct collapse_context *c)
 {
 	int wstatus;
 	void *p;
@@ -835,13 +765,8 @@ static void collapse_fork(void)
 			fail("Fail");
 
 		fill_memory(p, page_size, 2 * page_size);
-
-		if (wait_for_scan("Collapse PTE table with single page shared with parent process", p))
-			fail("Timeout");
-		else if (check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
+		c->collapse("Collapse PTE table with single page shared with parent process",
+			    p, true);
 
 		validate_memory(p, 0, page_size);
 		munmap(p, hpage_pmd_size);
@@ -860,7 +785,7 @@ static void collapse_fork(void)
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_fork_compound(void)
+static void collapse_fork_compound(struct collapse_context *c)
 {
 	int wstatus;
 	void *p;
@@ -896,14 +821,10 @@ static void collapse_fork_compound(void)
 		fill_memory(p, 0, page_size);
 
 		write_num("khugepaged/max_ptes_shared", hpage_pmd_nr - 1);
-		if (wait_for_scan("Collapse PTE table full of compound pages in child", p))
-			fail("Timeout");
-		else if (check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
+		c->collapse("Collapse PTE table full of compound pages in child",
+			    p, true);
 		write_num("khugepaged/max_ptes_shared",
-				default_settings.khugepaged.max_ptes_shared);
+			  default_settings.khugepaged.max_ptes_shared);
 
 		validate_memory(p, 0, hpage_pmd_size);
 		munmap(p, hpage_pmd_size);
@@ -922,7 +843,7 @@ static void collapse_fork_compound(void)
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_shared()
+static void collapse_max_ptes_shared(struct collapse_context *c)
 {
 	int max_ptes_shared = read_num("khugepaged/max_ptes_shared");
 	int wstatus;
@@ -957,28 +878,22 @@ static void collapse_max_ptes_shared()
 		else
 			fail("Fail");
 
-		if (wait_for_scan("Do not collapse with max_ptes_shared exceeded", p))
-			fail("Timeout");
-		else if (!check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
-
-		printf("Trigger CoW on page %d of %d...",
-				hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
-		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) * page_size);
-		if (!check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
-
-
-		if (wait_for_scan("Collapse with max_ptes_shared PTEs shared", p))
-			fail("Timeout");
-		else if (check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
+		c->collapse("Maybe collapse with max_ptes_shared exceeded", p,
+			    !c->enforce_pte_scan_limits);
+
+		if (c->enforce_pte_scan_limits) {
+			printf("Trigger CoW on page %d of %d...",
+			       hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
+			fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) *
+				    page_size);
+			if (!check_huge(p))
+				success("OK");
+			else
+				fail("Fail");
+
+			c->collapse("Collapse with max_ptes_shared PTEs shared",
+				    p, true);
+		}
 
 		validate_memory(p, 0, hpage_pmd_size);
 		munmap(p, hpage_pmd_size);
@@ -997,8 +912,57 @@ static void collapse_max_ptes_shared()
 	munmap(p, hpage_pmd_size);
 }
 
+#define TICK 500000
+static bool wait_for_scan(const char *msg, char *p)
+{
+	int full_scans;
+	int timeout = 6; /* 3 seconds */
+
+	/* Sanity check */
+	if (check_huge(p)) {
+		printf("Unexpected huge page\n");
+		exit(EXIT_FAILURE);
+	}
+
+	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+
+	/* Wait until the second full_scan completed */
+	full_scans = read_num("khugepaged/full_scans") + 2;
+
+	printf("%s...", msg);
+	while (timeout--) {
+		if (check_huge(p))
+			break;
+		if (read_num("khugepaged/full_scans") >= full_scans)
+			break;
+		printf(".");
+		usleep(TICK);
+	}
+
+	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
+
+	return timeout == -1;
+}
+
+static void khugepaged_collapse(const char *msg, char *p, bool expect)
+{
+	if (wait_for_scan(msg, p)) {
+		if (expect)
+			fail("Timeout");
+		else
+			success("OK");
+		return;
+	} else if (check_huge(p) == expect) {
+		success("OK");
+	} else {
+		fail("Fail");
+	}
+}
+
 int main(void)
 {
+	struct collapse_context c;
+
 	setbuf(stdout, NULL);
 
 	page_size = getpagesize();
@@ -1014,18 +978,23 @@ int main(void)
 	adjust_settings();
 
 	alloc_at_fault();
-	collapse_full();
-	collapse_empty();
-	collapse_single_pte_entry();
-	collapse_max_ptes_none();
-	collapse_swapin_single_pte();
-	collapse_max_ptes_swap();
-	collapse_single_pte_entry_compound();
-	collapse_full_of_compound();
-	collapse_compound_extreme();
-	collapse_fork();
-	collapse_fork_compound();
-	collapse_max_ptes_shared();
+
+	printf("\n*** Testing context: khugepaged ***\n");
+	c.collapse = &khugepaged_collapse;
+	c.enforce_pte_scan_limits = true;
+
+	collapse_full(&c);
+	collapse_empty(&c);
+	collapse_single_pte_entry(&c);
+	collapse_max_ptes_none(&c);
+	collapse_swapin_single_pte(&c);
+	collapse_max_ptes_swap(&c);
+	collapse_single_pte_entry_compound(&c);
+	collapse_full_of_compound(&c);
+	collapse_compound_extreme(&c);
+	collapse_fork(&c);
+	collapse_fork_compound(&c);
+	collapse_max_ptes_shared(&c);
 
 	restore_settings(0);
 }
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 13/15] selftests/vm: add MADV_COLLAPSE collapse context to selftests
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
                   ` (11 preceding siblings ...)
  2022-06-04  0:40 ` [PATCH v6 12/15] selftests/vm: modularize collapse selftests Zach O'Keefe
@ 2022-06-04  0:40 ` Zach O'Keefe
  2022-06-04  0:40 ` [PATCH v6 14/15] selftests/vm: add selftest to verify recollapse of THPs Zach O'Keefe
  2022-06-04  0:40 ` [PATCH v6 15/15] tools headers uapi: add MADV_COLLAPSE madvise mode to tools Zach O'Keefe
  14 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:40 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add madvise collapse context to hugepage collapse selftests.  This
context is tested with /sys/kernel/mm/transparent_hugepage/enabled set
to "never" in order to avoid unwanted interaction with khugepaged during
testing.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 55 +++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 24a8715363be..5207930b34a4 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -14,6 +14,9 @@
 #ifndef MADV_PAGEOUT
 #define MADV_PAGEOUT 21
 #endif
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
 
 #define BASE_ADDR ((void *)(1UL << 30))
 static unsigned long hpage_pmd_size;
@@ -108,6 +111,7 @@ static struct settings default_settings = {
 };
 
 static struct settings saved_settings;
+static struct settings current_settings;
 static bool skip_settings_restore;
 
 static int exit_status;
@@ -282,6 +286,8 @@ static void write_settings(struct settings *settings)
 	write_num("khugepaged/max_ptes_swap", khugepaged->max_ptes_swap);
 	write_num("khugepaged/max_ptes_shared", khugepaged->max_ptes_shared);
 	write_num("khugepaged/pages_to_scan", khugepaged->pages_to_scan);
+
+	current_settings = *settings;
 }
 
 static void restore_settings(int sig)
@@ -912,6 +918,38 @@ static void collapse_max_ptes_shared(struct collapse_context *c)
 	munmap(p, hpage_pmd_size);
 }
 
+static void madvise_collapse(const char *msg, char *p, bool expect)
+{
+	int ret;
+	struct settings old_settings = current_settings;
+	struct settings settings = old_settings;
+
+	printf("%s...", msg);
+	/* Sanity check */
+	if (check_huge(p)) {
+		printf("Unexpected huge page\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/*
+	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
+	 * ignores /sys/kernel/mm/transparent_hugepage/enabled
+	 */
+	settings.thp_enabled = THP_NEVER;
+	write_settings(&settings);
+
+	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+	ret = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+	if (((bool)ret) == expect)
+		fail("Fail: Bad return value");
+	else if (check_huge(p) != expect)
+		fail("Fail: check_huge()");
+	else
+		success("OK");
+
+	write_settings(&old_settings);
+}
+
 #define TICK 500000
 static bool wait_for_scan(const char *msg, char *p)
 {
@@ -996,5 +1034,22 @@ int main(void)
 	collapse_fork_compound(&c);
 	collapse_max_ptes_shared(&c);
 
+	printf("\n*** Testing context: madvise ***\n");
+	c.collapse = &madvise_collapse;
+	c.enforce_pte_scan_limits = false;
+
+	collapse_full(&c);
+	collapse_empty(&c);
+	collapse_single_pte_entry(&c);
+	collapse_max_ptes_none(&c);
+	collapse_swapin_single_pte(&c);
+	collapse_max_ptes_swap(&c);
+	collapse_single_pte_entry_compound(&c);
+	collapse_full_of_compound(&c);
+	collapse_compound_extreme(&c);
+	collapse_fork(&c);
+	collapse_fork_compound(&c);
+	collapse_max_ptes_shared(&c);
+
 	restore_settings(0);
 }
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 14/15] selftests/vm: add selftest to verify recollapse of THPs
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
                   ` (12 preceding siblings ...)
  2022-06-04  0:40 ` [PATCH v6 13/15] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
@ 2022-06-04  0:40 ` Zach O'Keefe
  2022-06-04  0:40 ` [PATCH v6 15/15] tools headers uapi: add MADV_COLLAPSE madvise mode to tools Zach O'Keefe
  14 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:40 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add selftest specific to madvise collapse context that tests
MADV_COLLAPSE is "successful" if a hugepage-aligned/sized region is
already pmd-mapped.

This test also verifies that MADV_COLLAPSE can collapse memory into THPs
even in "madvise" THP mode and the memory isn't marked VM_HUGEPAGE.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 31 +++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 5207930b34a4..eeea84b0cd35 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -918,6 +918,36 @@ static void collapse_max_ptes_shared(struct collapse_context *c)
 	munmap(p, hpage_pmd_size);
 }
 
+static void madvise_collapse_existing_thps(void)
+{
+	void *p;
+	int err;
+
+	p = alloc_mapping();
+	fill_memory(p, 0, hpage_pmd_size);
+
+	printf("Collapse fully populated PTE table...");
+	/*
+	 * Note that we don't set MADV_HUGEPAGE here, which
+	 * also tests that VM_HUGEPAGE isn't required for
+	 * MADV_COLLAPSE in "madvise" mode.
+	 */
+	err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+	if (err == 0 && check_huge(p)) {
+		success("OK");
+		printf("Re-collapse PMD-mapped hugepage");
+		err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+		if (err == 0 && check_huge(p))
+			success("OK");
+		else
+			fail("Fail");
+	} else {
+		fail("Fail");
+	}
+	validate_memory(p, 0, hpage_pmd_size);
+	munmap(p, hpage_pmd_size);
+}
+
 static void madvise_collapse(const char *msg, char *p, bool expect)
 {
 	int ret;
@@ -1050,6 +1080,7 @@ int main(void)
 	collapse_fork(&c);
 	collapse_fork_compound(&c);
 	collapse_max_ptes_shared(&c);
+	madvise_collapse_existing_thps();
 
 	restore_settings(0);
 }
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v6 15/15] tools headers uapi: add MADV_COLLAPSE madvise mode to tools
  2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
                   ` (13 preceding siblings ...)
  2022-06-04  0:40 ` [PATCH v6 14/15] selftests/vm: add selftest to verify recollapse of THPs Zach O'Keefe
@ 2022-06-04  0:40 ` Zach O'Keefe
  2022-06-06 23:58   ` Yang Shi
  14 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-04  0:40 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Tools able to translate MADV_COLLAPSE advice to human readable string:

$ tools/perf/trace/beauty/madvise_behavior.sh
static const char *madvise_advices[] = {
        [0] = "NORMAL",
        [1] = "RANDOM",
        [2] = "SEQUENTIAL",
        [3] = "WILLNEED",
        [4] = "DONTNEED",
        [8] = "FREE",
        [9] = "REMOVE",
        [10] = "DONTFORK",
        [11] = "DOFORK",
        [12] = "MERGEABLE",
        [13] = "UNMERGEABLE",
        [14] = "HUGEPAGE",
        [15] = "NOHUGEPAGE",
        [16] = "DONTDUMP",
        [17] = "DODUMP",
        [18] = "WIPEONFORK",
        [19] = "KEEPONFORK",
        [20] = "COLD",
        [21] = "PAGEOUT",
        [22] = "POPULATE_READ",
        [23] = "POPULATE_WRITE",
        [24] = "DONTNEED_LOCKED",
        [25] = "COLLAPSE",
        [100] = "HWPOISON",
        [101] = "SOFT_OFFLINE",
};

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/include/uapi/asm-generic/mman-common.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 6c1aa92a92e4..6ce1f1ceb432 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -77,6 +77,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
-- 
2.36.1.255.ge46751e96f-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
  2022-06-04  0:39 ` [PATCH v6 03/15] mm/khugepaged: add struct collapse_control Zach O'Keefe
@ 2022-06-06  2:41   ` kernel test robot
  2022-06-06 16:40       ` Zach O'Keefe
  0 siblings, 1 reply; 63+ messages in thread
From: kernel test robot @ 2022-06-06  2:41 UTC (permalink / raw)
  To: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Michal Hocko, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi,
	Zi Yan, linux-mm
  Cc: kbuild-all, Andrea Arcangeli, Andrew Morton,
	Linux Memory Management List, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin

Hi Zach,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
        git checkout d87b6065d6050b89930cca0814921aca7c269286
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   mm/khugepaged.c: In function 'khugepaged':
>> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
    2284 | }
         | ^


vim +2284 mm/khugepaged.c

b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2261  
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2262  static int khugepaged(void *none)
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2263  {
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2264  	struct mm_slot *mm_slot;
d87b6065d6050b Zach O'Keefe       2022-06-03  2265  	struct collapse_control cc = {
d87b6065d6050b Zach O'Keefe       2022-06-03  2266  		.last_target_node = NUMA_NO_NODE,
d87b6065d6050b Zach O'Keefe       2022-06-03  2267  	};
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2268  
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2269  	set_freezable();
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2270  	set_user_nice(current, MAX_NICE);
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2271  
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2272  	while (!kthread_should_stop()) {
d87b6065d6050b Zach O'Keefe       2022-06-03  2273  		khugepaged_do_scan(&cc);
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2274  		khugepaged_wait_work();
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2275  	}
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2276  
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2277  	spin_lock(&khugepaged_mm_lock);
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2278  	mm_slot = khugepaged_scan.mm_slot;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2279  	khugepaged_scan.mm_slot = NULL;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2280  	if (mm_slot)
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2281  		collect_mm_slot(mm_slot);
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2282  	spin_unlock(&khugepaged_mm_lock);
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2283  	return 0;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 @2284  }
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2285  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
  2022-06-06  2:41   ` kernel test robot
@ 2022-06-06 16:40       ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-06 16:40 UTC (permalink / raw)
  To: kernel test robot
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm, kbuild-all, Andrea Arcangeli, Andrew Morton,
	Arnd Bergmann, Axel Rasmussen, Chris Kennelly, Chris Zankel,
	Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin

On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi Zach,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on akpm-mm/mm-everything]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp@intel.com/config)
> compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> reproduce (this is a W=1 build):
>         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
>         git checkout d87b6065d6050b89930cca0814921aca7c269286
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
>
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>):
>
>    mm/khugepaged.c: In function 'khugepaged':
> >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
>     2284 | }
>          | ^

Thanks lkp@intel.com.

This is due to config with:

CONFIG_FRAME_WARN=2048
CONFIG_NODES_SHIFT=10

Where struct collapse_control has a member int
node_load[MAX_NUMNODES], and we stack allocate one.

Is this a configuration that needs to be supported? 1024 nodes seems
like a lot and I'm not sure if these configs are randomly generated or
are reminiscent of real systems.

Thanks,
Zach

>
> vim +2284 mm/khugepaged.c
>
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2261
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2262  static int khugepaged(void *none)
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2263  {
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2264      struct mm_slot *mm_slot;
> d87b6065d6050b Zach O'Keefe       2022-06-03  2265      struct collapse_control cc = {
> d87b6065d6050b Zach O'Keefe       2022-06-03  2266              .last_target_node = NUMA_NO_NODE,
> d87b6065d6050b Zach O'Keefe       2022-06-03  2267      };
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2268
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2269      set_freezable();
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2270      set_user_nice(current, MAX_NICE);
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2271
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2272      while (!kthread_should_stop()) {
> d87b6065d6050b Zach O'Keefe       2022-06-03  2273              khugepaged_do_scan(&cc);
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2274              khugepaged_wait_work();
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2275      }
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2276
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2277      spin_lock(&khugepaged_mm_lock);
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2278      mm_slot = khugepaged_scan.mm_slot;
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2279      khugepaged_scan.mm_slot = NULL;
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2280      if (mm_slot)
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2281              collect_mm_slot(mm_slot);
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2282      spin_unlock(&khugepaged_mm_lock);
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2283      return 0;
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26 @2284  }
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2285
>
> --
> 0-DAY CI Kernel Test Service
> https://01.org/lkp
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
@ 2022-06-06 16:40       ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-06 16:40 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3906 bytes --]

On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi Zach,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on akpm-mm/mm-everything]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp(a)intel.com/config)
> compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> reproduce (this is a W=1 build):
>         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
>         git checkout d87b6065d6050b89930cca0814921aca7c269286
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
>
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>):
>
>    mm/khugepaged.c: In function 'khugepaged':
> >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
>     2284 | }
>          | ^

Thanks lkp(a)intel.com.

This is due to config with:

CONFIG_FRAME_WARN=2048
CONFIG_NODES_SHIFT=10

Where struct collapse_control has a member int
node_load[MAX_NUMNODES], and we stack allocate one.

Is this a configuration that needs to be supported? 1024 nodes seems
like a lot and I'm not sure if these configs are randomly generated or
are reminiscent of real systems.

Thanks,
Zach

>
> vim +2284 mm/khugepaged.c
>
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2261
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2262  static int khugepaged(void *none)
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2263  {
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2264      struct mm_slot *mm_slot;
> d87b6065d6050b Zach O'Keefe       2022-06-03  2265      struct collapse_control cc = {
> d87b6065d6050b Zach O'Keefe       2022-06-03  2266              .last_target_node = NUMA_NO_NODE,
> d87b6065d6050b Zach O'Keefe       2022-06-03  2267      };
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2268
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2269      set_freezable();
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2270      set_user_nice(current, MAX_NICE);
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2271
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2272      while (!kthread_should_stop()) {
> d87b6065d6050b Zach O'Keefe       2022-06-03  2273              khugepaged_do_scan(&cc);
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2274              khugepaged_wait_work();
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2275      }
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2276
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2277      spin_lock(&khugepaged_mm_lock);
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2278      mm_slot = khugepaged_scan.mm_slot;
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2279      khugepaged_scan.mm_slot = NULL;
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2280      if (mm_slot)
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2281              collect_mm_slot(mm_slot);
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2282      spin_unlock(&khugepaged_mm_lock);
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2283      return 0;
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26 @2284  }
> b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2285
>
> --
> 0-DAY CI Kernel Test Service
> https://01.org/lkp
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/15] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA
  2022-06-04  0:39 ` [PATCH v6 01/15] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Zach O'Keefe
@ 2022-06-06 18:25   ` Yang Shi
  2022-06-29 20:49   ` Peter Xu
  1 sibling, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-06 18:25 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> From: Yang Shi <shy828301@gmail.com>
>
> The khugepaged has optimization to reduce huge page allocation calls for
> !CONFIG_NUMA by carrying the allocated but failed to collapse huge page to
> the next loop.  CONFIG_NUMA doesn't do so since the next loop may try to
> collapse huge page from a different node, so it doesn't make too much sense
> to carry it.
>
> But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page()
> before scanning the address space, so it means huge page may be allocated
> even though there is no suitable range for collapsing.  Then the page would
> be just freed if khugepaged already made enough progress.  This could make
> NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y run.  This
> problem actually makes things worse due to the way more pointless THP
> allocations and makes the optimization pointless.
>
> This could be fixed by carrying the huge page across scans, but it will
> complicate the code further and the huge page may be carried
> indefinitely.  But if we take one step back,  the optimization itself seems
> not worth keeping nowadays since:
>   * Not too many users build NUMA=n kernel nowadays even though the kernel is
>     actually running on a non-NUMA machine. Some small devices may run NUMA=n
>     kernel, but I don't think they actually use THP.
>   * Since commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
>     stored on the per-cpu lists"), THP could be cached by pcp.  This actually
>     somehow does the job done by the optimization.
>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Thanks for bringing the patch into the series. You could have my SOB
on this patch.

> ---
>  mm/khugepaged.c | 100 ++++++++----------------------------------------
>  1 file changed, 17 insertions(+), 83 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 476d79360101..cc3d6fb446d5 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -833,29 +833,30 @@ static int khugepaged_find_target_node(void)
>         last_khugepaged_target_node = target_node;
>         return target_node;
>  }
> +#else
> +static int khugepaged_find_target_node(void)
> +{
> +       return 0;
> +}
> +#endif
>
> -static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
> +/* Sleep for the first alloc fail, break the loop for the second fail */
> +static bool alloc_fail_should_sleep(struct page **hpage, bool *wait)
>  {
>         if (IS_ERR(*hpage)) {
>                 if (!*wait)
> -                       return false;
> +                       return true;
>
>                 *wait = false;
>                 *hpage = NULL;
>                 khugepaged_alloc_sleep();
> -       } else if (*hpage) {
> -               put_page(*hpage);
> -               *hpage = NULL;
>         }
> -
> -       return true;
> +       return false;
>  }
>
>  static struct page *
>  khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>  {
> -       VM_BUG_ON_PAGE(*hpage, *hpage);
> -
>         *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
>         if (unlikely(!*hpage)) {
>                 count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> @@ -867,74 +868,6 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>         count_vm_event(THP_COLLAPSE_ALLOC);
>         return *hpage;
>  }
> -#else
> -static int khugepaged_find_target_node(void)
> -{
> -       return 0;
> -}
> -
> -static inline struct page *alloc_khugepaged_hugepage(void)
> -{
> -       struct page *page;
> -
> -       page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
> -                          HPAGE_PMD_ORDER);
> -       if (page)
> -               prep_transhuge_page(page);
> -       return page;
> -}
> -
> -static struct page *khugepaged_alloc_hugepage(bool *wait)
> -{
> -       struct page *hpage;
> -
> -       do {
> -               hpage = alloc_khugepaged_hugepage();
> -               if (!hpage) {
> -                       count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> -                       if (!*wait)
> -                               return NULL;
> -
> -                       *wait = false;
> -                       khugepaged_alloc_sleep();
> -               } else
> -                       count_vm_event(THP_COLLAPSE_ALLOC);
> -       } while (unlikely(!hpage) && likely(khugepaged_enabled()));
> -
> -       return hpage;
> -}
> -
> -static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
> -{
> -       /*
> -        * If the hpage allocated earlier was briefly exposed in page cache
> -        * before collapse_file() failed, it is possible that racing lookups
> -        * have not yet completed, and would then be unpleasantly surprised by
> -        * finding the hpage reused for the same mapping at a different offset.
> -        * Just release the previous allocation if there is any danger of that.
> -        */
> -       if (*hpage && page_count(*hpage) > 1) {
> -               put_page(*hpage);
> -               *hpage = NULL;
> -       }
> -
> -       if (!*hpage)
> -               *hpage = khugepaged_alloc_hugepage(wait);
> -
> -       if (unlikely(!*hpage))
> -               return false;
> -
> -       return true;
> -}
> -
> -static struct page *
> -khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> -{
> -       VM_BUG_ON(!*hpage);
> -
> -       return  *hpage;
> -}
> -#endif
>
>  /*
>   * If mmap_lock temporarily dropped, revalidate vma
> @@ -1188,8 +1121,10 @@ static void collapse_huge_page(struct mm_struct *mm,
>  out_up_write:
>         mmap_write_unlock(mm);
>  out_nolock:
> -       if (!IS_ERR_OR_NULL(*hpage))
> +       if (!IS_ERR_OR_NULL(*hpage)) {
>                 mem_cgroup_uncharge(page_folio(*hpage));
> +               put_page(*hpage);
> +       }
>         trace_mm_collapse_huge_page(mm, isolated, result);
>         return;
>  }
> @@ -1992,8 +1927,10 @@ static void collapse_file(struct mm_struct *mm,
>         unlock_page(new_page);
>  out:
>         VM_BUG_ON(!list_empty(&pagelist));
> -       if (!IS_ERR_OR_NULL(*hpage))
> +       if (!IS_ERR_OR_NULL(*hpage)) {
>                 mem_cgroup_uncharge(page_folio(*hpage));
> +               put_page(*hpage);
> +       }
>         /* TODO: tracepoints */
>  }
>
> @@ -2243,7 +2180,7 @@ static void khugepaged_do_scan(void)
>         lru_add_drain_all();
>
>         while (progress < pages) {
> -               if (!khugepaged_prealloc_page(&hpage, &wait))
> +               if (alloc_fail_should_sleep(&hpage, &wait))
>                         break;
>
>                 cond_resched();
> @@ -2262,9 +2199,6 @@ static void khugepaged_do_scan(void)
>                         progress = pages;
>                 spin_unlock(&khugepaged_mm_lock);
>         }
> -
> -       if (!IS_ERR_OR_NULL(hpage))
> -               put_page(hpage);
>  }
>
>  static bool khugepaged_should_wakeup(void)
> --
> 2.36.1.255.ge46751e96f-goog
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
  2022-06-06 16:40       ` Zach O'Keefe
@ 2022-06-06 20:20         ` Yang Shi
  -1 siblings, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-06 20:20 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: kernel test robot, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Michal Hocko, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, kbuild-all, Andrea Arcangeli, Andrew Morton,
	Arnd Bergmann, Axel Rasmussen, Chris Kennelly, Chris Zankel,
	Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin

On Mon, Jun 6, 2022 at 9:40 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> >
> > Hi Zach,
> >
> > Thank you for the patch! Perhaps something to improve:
> >
> > [auto build test WARNING on akpm-mm/mm-everything]
> >
> > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp@intel.com/config)
> > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > reproduce (this is a W=1 build):
> >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> >         # save the config file
> >         mkdir build_dir && cp config build_dir/.config
> >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> >
> > If you fix the issue, kindly add following tag where applicable
> > Reported-by: kernel test robot <lkp@intel.com>
> >
> > All warnings (new ones prefixed by >>):
> >
> >    mm/khugepaged.c: In function 'khugepaged':
> > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> >     2284 | }
> >          | ^
>
> Thanks lkp@intel.com.
>
> This is due to config with:
>
> CONFIG_FRAME_WARN=2048
> CONFIG_NODES_SHIFT=10
>
> Where struct collapse_control has a member int
> node_load[MAX_NUMNODES], and we stack allocate one.
>
> Is this a configuration that needs to be supported? 1024 nodes seems
> like a lot and I'm not sure if these configs are randomly generated or
> are reminiscent of real systems.

I don't have a better idea other than moving it out of the
collapse_control struct. You may consider changing node_load to two
dimensions, for example:

node_load[2][MAX_NUMNODES], then define:
enum {
    /* khugepaged */
    COLLAPSE_ASYNC,
    /* MADV_COLLAPSE */
    COLLAPSE_SYNC
}

Then khugepaged and MADV_COLLAPSE get their dedicated node_load respectively.

The more aggressive approach may be just killing node_load, but I'm
not sure what impact it may incur.

>
> Thanks,
> Zach
>
> >
> > vim +2284 mm/khugepaged.c
> >
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2261
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2262  static int khugepaged(void *none)
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2263  {
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2264      struct mm_slot *mm_slot;
> > d87b6065d6050b Zach O'Keefe       2022-06-03  2265      struct collapse_control cc = {
> > d87b6065d6050b Zach O'Keefe       2022-06-03  2266              .last_target_node = NUMA_NO_NODE,
> > d87b6065d6050b Zach O'Keefe       2022-06-03  2267      };
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2268
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2269      set_freezable();
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2270      set_user_nice(current, MAX_NICE);
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2271
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2272      while (!kthread_should_stop()) {
> > d87b6065d6050b Zach O'Keefe       2022-06-03  2273              khugepaged_do_scan(&cc);
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2274              khugepaged_wait_work();
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2275      }
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2276
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2277      spin_lock(&khugepaged_mm_lock);
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2278      mm_slot = khugepaged_scan.mm_slot;
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2279      khugepaged_scan.mm_slot = NULL;
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2280      if (mm_slot)
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2281              collect_mm_slot(mm_slot);
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2282      spin_unlock(&khugepaged_mm_lock);
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2283      return 0;
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26 @2284  }
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2285
> >
> > --
> > 0-DAY CI Kernel Test Service
> > https://01.org/lkp
> >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
@ 2022-06-06 20:20         ` Yang Shi
  0 siblings, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-06 20:20 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4626 bytes --]

On Mon, Jun 6, 2022 at 9:40 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> >
> > Hi Zach,
> >
> > Thank you for the patch! Perhaps something to improve:
> >
> > [auto build test WARNING on akpm-mm/mm-everything]
> >
> > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp(a)intel.com/config)
> > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > reproduce (this is a W=1 build):
> >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> >         # save the config file
> >         mkdir build_dir && cp config build_dir/.config
> >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> >
> > If you fix the issue, kindly add following tag where applicable
> > Reported-by: kernel test robot <lkp@intel.com>
> >
> > All warnings (new ones prefixed by >>):
> >
> >    mm/khugepaged.c: In function 'khugepaged':
> > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> >     2284 | }
> >          | ^
>
> Thanks lkp(a)intel.com.
>
> This is due to config with:
>
> CONFIG_FRAME_WARN=2048
> CONFIG_NODES_SHIFT=10
>
> Where struct collapse_control has a member int
> node_load[MAX_NUMNODES], and we stack allocate one.
>
> Is this a configuration that needs to be supported? 1024 nodes seems
> like a lot and I'm not sure if these configs are randomly generated or
> are reminiscent of real systems.

I don't have a better idea other than moving it out of the
collapse_control struct. You may consider changing node_load to two
dimensions, for example:

node_load[2][MAX_NUMNODES], then define:
enum {
    /* khugepaged */
    COLLAPSE_ASYNC,
    /* MADV_COLLAPSE */
    COLLAPSE_SYNC
}

Then khugepaged and MADV_COLLAPSE get their dedicated node_load respectively.

The more aggressive approach may be just killing node_load, but I'm
not sure what impact it may incur.

>
> Thanks,
> Zach
>
> >
> > vim +2284 mm/khugepaged.c
> >
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2261
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2262  static int khugepaged(void *none)
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2263  {
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2264      struct mm_slot *mm_slot;
> > d87b6065d6050b Zach O'Keefe       2022-06-03  2265      struct collapse_control cc = {
> > d87b6065d6050b Zach O'Keefe       2022-06-03  2266              .last_target_node = NUMA_NO_NODE,
> > d87b6065d6050b Zach O'Keefe       2022-06-03  2267      };
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2268
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2269      set_freezable();
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2270      set_user_nice(current, MAX_NICE);
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2271
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2272      while (!kthread_should_stop()) {
> > d87b6065d6050b Zach O'Keefe       2022-06-03  2273              khugepaged_do_scan(&cc);
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2274              khugepaged_wait_work();
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2275      }
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2276
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2277      spin_lock(&khugepaged_mm_lock);
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2278      mm_slot = khugepaged_scan.mm_slot;
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2279      khugepaged_scan.mm_slot = NULL;
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2280      if (mm_slot)
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2281              collect_mm_slot(mm_slot);
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2282      spin_unlock(&khugepaged_mm_lock);
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2283      return 0;
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26 @2284  }
> > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2285
> >
> > --
> > 0-DAY CI Kernel Test Service
> > https://01.org/lkp
> >

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 02/15] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-06-04  0:39 ` [PATCH v6 02/15] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
@ 2022-06-06 20:45   ` Yang Shi
  2022-06-07 16:01     ` Zach O'Keefe
  0 siblings, 1 reply; 63+ messages in thread
From: Yang Shi @ 2022-06-06 20:45 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> When scanning an anon pmd to see if it's eligible for collapse, return
> SCAN_PMD_MAPPED if the pmd already maps a THP.  Note that
> SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
> file-collapse path, since the latter might identify pte-mapped compound
> pages.  This is required by MADV_COLLAPSE which necessarily needs to
> know what hugepage-aligned/sized regions are already pmd-mapped.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  include/trace/events/huge_memory.h |  1 +
>  mm/internal.h                      |  1 +
>  mm/khugepaged.c                    | 32 ++++++++++++++++++++++++++----
>  mm/rmap.c                          | 15 ++++++++++++--
>  4 files changed, 43 insertions(+), 6 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index d651f3437367..55392bf30a03 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -11,6 +11,7 @@
>         EM( SCAN_FAIL,                  "failed")                       \
>         EM( SCAN_SUCCEED,               "succeeded")                    \
>         EM( SCAN_PMD_NULL,              "pmd_null")                     \
> +       EM( SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
>         EM( SCAN_EXCEED_NONE_PTE,       "exceed_none_pte")              \
>         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
>         EM( SCAN_EXCEED_SHARED_PTE,     "exceed_shared_pte")            \
> diff --git a/mm/internal.h b/mm/internal.h
> index 6e14749ad1e5..f768c7fae668 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -188,6 +188,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
>  /*
>   * in mm/rmap.c:
>   */
> +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address);
>  extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
>
>  /*
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index cc3d6fb446d5..7a914ca19e96 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -28,6 +28,7 @@ enum scan_result {
>         SCAN_FAIL,
>         SCAN_SUCCEED,
>         SCAN_PMD_NULL,
> +       SCAN_PMD_MAPPED,
>         SCAN_EXCEED_NONE_PTE,
>         SCAN_EXCEED_SWAP_PTE,
>         SCAN_EXCEED_SHARED_PTE,
> @@ -901,6 +902,31 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>         return 0;
>  }
>
> +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> +                                  unsigned long address,
> +                                  pmd_t **pmd)
> +{
> +       pmd_t pmde;
> +
> +       *pmd = mm_find_pmd_raw(mm, address);
> +       if (!*pmd)
> +               return SCAN_PMD_NULL;
> +
> +       pmde = pmd_read_atomic(*pmd);
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> +       barrier();
> +#endif
> +       if (!pmd_present(pmde))
> +               return SCAN_PMD_NULL;
> +       if (pmd_trans_huge(pmde))
> +               return SCAN_PMD_MAPPED;
> +       if (pmd_bad(pmde))
> +               return SCAN_FAIL;

khugepaged doesn't handle pmd_bad before, IIRC it may just return
SCAN_SUCCEED if everything else is good? It is fine to add it, but it
may be better to return SCAN_PMD_NULL?


> +       return SCAN_SUCCEED;
> +}
> +
>  /*
>   * Bring missing pages in from swap, to complete THP collapse.
>   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> @@ -1146,11 +1172,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>
>         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> -       pmd = mm_find_pmd(mm, address);
> -       if (!pmd) {
> -               result = SCAN_PMD_NULL;
> +       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> +       if (result != SCAN_SUCCEED)

There are a couple of other callsites for mm_find_pmd(), you may need
to change all of them to find_pmd_or_thp_or_none() for MADV_COLLAPSE
since khugepaged may collapse the area before MADV_COLLAPSE
reacquiring mmap_lock IIUC and MADV_COLLAPSE does care this case. It
is fine w/o MADV_COLLAPSE since khupaged doesn't care if it is PMD
mapped or not.

So it may be better to move this patch right before MADV_COLLAPSE is introduced?

>                 goto out;
> -       }
>
>         memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
>         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 04fac1af870b..c9979c6ad7a1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -767,13 +767,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
>         return vma_address(page, vma);
>  }
>
> -pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address)

May be better to have some notes for mm_find_pmd_raw() and mm_find_pmd().

>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
>         pud_t *pud;
>         pmd_t *pmd = NULL;
> -       pmd_t pmde;
>
>         pgd = pgd_offset(mm, address);
>         if (!pgd_present(*pgd))
> @@ -788,6 +787,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>                 goto out;
>
>         pmd = pmd_offset(pud, address);
> +out:
> +       return pmd;
> +}
> +
> +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> +{
> +       pmd_t pmde;
> +       pmd_t *pmd;
> +
> +       pmd = mm_find_pmd_raw(mm, address);
> +       if (!pmd)
> +               goto out;
>         /*
>          * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
>          * without holding anon_vma lock for write.  So when looking for a
> --
> 2.36.1.255.ge46751e96f-goog
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 04/15] mm/khugepaged: dedup and simplify hugepage alloc and charging
  2022-06-04  0:39 ` [PATCH v6 04/15] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
@ 2022-06-06 20:50   ` Yang Shi
  2022-06-29 21:58   ` Peter Xu
  1 sibling, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-06 20:50 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> The following code is duplicated in collapse_huge_page() and
> collapse_file():
>
>         gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
>
>         new_page = khugepaged_alloc_page(hpage, gfp, node);
>         if (!new_page) {
>                 result = SCAN_ALLOC_HUGE_PAGE_FAIL;
>                 goto out;
>         }
>
>         if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
>                 result = SCAN_CGROUP_CHARGE_FAIL;
>                 goto out;
>         }
>         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
>
> Also, "node" is passed as an argument to both collapse_huge_page() and
> collapse_file() and obtained the same way, via
> khugepaged_find_target_node().
>
> Move all this into a new helper, alloc_charge_hpage(), and remove the
> duplicate code from collapse_huge_page() and collapse_file().  Also,
> simplify khugepaged_alloc_page() by returning a bool indicating
> allocation success instead of a copy of the allocated struct page.
>
> Suggested-by: Peter Xu <peterx@redhat.com>
>
> ---
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  mm/khugepaged.c | 77 ++++++++++++++++++++++---------------------------
>  1 file changed, 34 insertions(+), 43 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 907d0b2bd4bd..38488d114073 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -860,19 +860,18 @@ static bool alloc_fail_should_sleep(struct page **hpage, bool *wait)
>         return false;
>  }
>
> -static struct page *
> -khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> +static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>  {
>         *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
>         if (unlikely(!*hpage)) {
>                 count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
>                 *hpage = ERR_PTR(-ENOMEM);
> -               return NULL;
> +               return false;
>         }
>
>         prep_transhuge_page(*hpage);
>         count_vm_event(THP_COLLAPSE_ALLOC);
> -       return *hpage;
> +       return true;
>  }
>
>  /*
> @@ -995,10 +994,23 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>         return true;
>  }
>
> -static void collapse_huge_page(struct mm_struct *mm,
> -                                  unsigned long address,
> -                                  struct page **hpage,
> -                                  int node, int referenced, int unmapped)
> +static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> +                             struct collapse_control *cc)
> +{
> +       gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> +       int node = khugepaged_find_target_node(cc);
> +
> +       if (!khugepaged_alloc_page(hpage, gfp, node))
> +               return SCAN_ALLOC_HUGE_PAGE_FAIL;
> +       if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
> +               return SCAN_CGROUP_CHARGE_FAIL;
> +       count_memcg_page_event(*hpage, THP_COLLAPSE_ALLOC);
> +       return SCAN_SUCCEED;
> +}
> +
> +static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
> +                              struct page **hpage, int referenced,
> +                              int unmapped, struct collapse_control *cc)
>  {
>         LIST_HEAD(compound_pagelist);
>         pmd_t *pmd, _pmd;
> @@ -1009,13 +1021,9 @@ static void collapse_huge_page(struct mm_struct *mm,
>         int isolated = 0, result = 0;
>         struct vm_area_struct *vma;
>         struct mmu_notifier_range range;
> -       gfp_t gfp;
>
>         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> -       /* Only allocate from the target node */
> -       gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> -
>         /*
>          * Before allocating the hugepage, release the mmap_lock read lock.
>          * The allocation can take potentially a long time if it involves
> @@ -1023,17 +1031,12 @@ static void collapse_huge_page(struct mm_struct *mm,
>          * that. We will recheck the vma after taking it again in write mode.
>          */
>         mmap_read_unlock(mm);
> -       new_page = khugepaged_alloc_page(hpage, gfp, node);
> -       if (!new_page) {
> -               result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> -               goto out_nolock;
> -       }
>
> -       if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
> -               result = SCAN_CGROUP_CHARGE_FAIL;
> +       result = alloc_charge_hpage(hpage, mm, cc);
> +       if (result != SCAN_SUCCEED)
>                 goto out_nolock;
> -       }
> -       count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> +
> +       new_page = *hpage;
>
>         mmap_read_lock(mm);
>         result = hugepage_vma_revalidate(mm, address, &vma);
> @@ -1306,10 +1309,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>  out_unmap:
>         pte_unmap_unlock(pte, ptl);
>         if (ret) {
> -               node = khugepaged_find_target_node(cc);
>                 /* collapse_huge_page will return with the mmap_lock released */
> -               collapse_huge_page(mm, address, hpage, node,
> -                               referenced, unmapped);
> +               collapse_huge_page(mm, address, hpage, referenced, unmapped,
> +                                  cc);
>         }
>  out:
>         trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
> @@ -1578,7 +1580,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>   * @file: file that collapse on
>   * @start: collapse start address
>   * @hpage: new allocated huge page for collapse
> - * @node: appointed node the new huge page allocate from
> + * @cc: collapse context and scratchpad
>   *
>   * Basic scheme is simple, details are more complex:
>   *  - allocate and lock a new huge page;
> @@ -1595,12 +1597,11 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>   *    + restore gaps in the page cache;
>   *    + unlock and free huge page;
>   */
> -static void collapse_file(struct mm_struct *mm,
> -               struct file *file, pgoff_t start,
> -               struct page **hpage, int node)
> +static void collapse_file(struct mm_struct *mm, struct file *file,
> +                         pgoff_t start, struct page **hpage,
> +                         struct collapse_control *cc)
>  {
>         struct address_space *mapping = file->f_mapping;
> -       gfp_t gfp;
>         struct page *new_page;
>         pgoff_t index, end = start + HPAGE_PMD_NR;
>         LIST_HEAD(pagelist);
> @@ -1612,20 +1613,11 @@ static void collapse_file(struct mm_struct *mm,
>         VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>         VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>
> -       /* Only allocate from the target node */
> -       gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> -
> -       new_page = khugepaged_alloc_page(hpage, gfp, node);
> -       if (!new_page) {
> -               result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> +       result = alloc_charge_hpage(hpage, mm, cc);
> +       if (result != SCAN_SUCCEED)
>                 goto out;
> -       }
>
> -       if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
> -               result = SCAN_CGROUP_CHARGE_FAIL;
> -               goto out;
> -       }
> -       count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> +       new_page = *hpage;
>
>         /*
>          * Ensure we have slots for all the pages in the range.  This is
> @@ -2037,8 +2029,7 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
>                         result = SCAN_EXCEED_NONE_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>                 } else {
> -                       node = khugepaged_find_target_node(cc);
> -                       collapse_file(mm, file, start, hpage, node);
> +                       collapse_file(mm, file, start, hpage, cc);
>                 }
>         }
>
> --
> 2.36.1.255.ge46751e96f-goog
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 05/15] mm/khugepaged: make allocation semantics context-specific
  2022-06-04  0:39 ` [PATCH v6 05/15] mm/khugepaged: make allocation semantics context-specific Zach O'Keefe
@ 2022-06-06 20:58   ` Yang Shi
  2022-06-07 19:56     ` Zach O'Keefe
  0 siblings, 1 reply; 63+ messages in thread
From: Yang Shi @ 2022-06-06 20:58 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Add a gfp_t flags member to struct collapse_control that allows contexts
> to specify their own allocation semantics.  This decouples the
> allocation semantics from
> /sys/kernel/mm/transparent_hugepage/khugepaged/defrag.
>
> khugepaged updates this member for every hugepage processed, since the
> sysfs setting might change at any time.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/khugepaged.c | 21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 38488d114073..ba722347bebd 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -92,6 +92,9 @@ struct collapse_control {
>
>         /* Last target selected in khugepaged_find_target_node() */
>         int last_target_node;
> +
> +       /* gfp used for allocation and memcg charging */
> +       gfp_t gfp;
>  };
>
>  /**
> @@ -994,15 +997,14 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>         return true;
>  }
>
> -static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> +static int alloc_charge_hpage(struct mm_struct *mm, struct page **hpage,

Why did you have to reverse the order of mm and hpage? It seems
pointless and you could save a couple of changed lines.

>                               struct collapse_control *cc)
>  {
> -       gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
>         int node = khugepaged_find_target_node(cc);
>
> -       if (!khugepaged_alloc_page(hpage, gfp, node))
> +       if (!khugepaged_alloc_page(hpage, cc->gfp, node))
>                 return SCAN_ALLOC_HUGE_PAGE_FAIL;
> -       if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
> +       if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, cc->gfp)))
>                 return SCAN_CGROUP_CHARGE_FAIL;
>         count_memcg_page_event(*hpage, THP_COLLAPSE_ALLOC);
>         return SCAN_SUCCEED;
> @@ -1032,7 +1034,7 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
>          */
>         mmap_read_unlock(mm);
>
> -       result = alloc_charge_hpage(hpage, mm, cc);
> +       result = alloc_charge_hpage(mm, hpage, cc);
>         if (result != SCAN_SUCCEED)
>                 goto out_nolock;
>
> @@ -1613,7 +1615,7 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
>         VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>         VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>
> -       result = alloc_charge_hpage(hpage, mm, cc);
> +       result = alloc_charge_hpage(mm, hpage, cc);
>         if (result != SCAN_SUCCEED)
>                 goto out;
>
> @@ -2037,8 +2039,7 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
>  }
>  #else
>  static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> -                                pgoff_t start, struct page **hpage,
> -                                struct collapse_control *cc)
> +                                pgoff_t start, struct collapse_control *cc)

Why was the !CONFIG_SHMEM version definition changed, but CONFIG_SHMEM
version was not?

>  {
>         BUILD_BUG();
>  }
> @@ -2121,6 +2122,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
>                         if (unlikely(khugepaged_test_exit(mm)))
>                                 goto breakouterloop;
>
> +                       /* reset gfp flags since sysfs settings might change */
> +                       cc->gfp = alloc_hugepage_khugepaged_gfpmask() |
> +                                       __GFP_THISNODE;
>                         VM_BUG_ON(khugepaged_scan.address < hstart ||
>                                   khugepaged_scan.address + HPAGE_PMD_SIZE >
>                                   hend);
> @@ -2255,6 +2259,7 @@ static int khugepaged(void *none)
>         struct mm_slot *mm_slot;
>         struct collapse_control cc = {
>                 .last_target_node = NUMA_NO_NODE,
> +               /* .gfp set later  */

Seems pointless to me.

>         };
>
>         set_freezable();
> --
> 2.36.1.255.ge46751e96f-goog
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
  2022-06-06 20:20         ` Yang Shi
@ 2022-06-06 21:22           ` Yang Shi
  -1 siblings, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-06 21:22 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: kernel test robot, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Michal Hocko, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, kbuild-all, Andrea Arcangeli, Andrew Morton,
	Arnd Bergmann, Axel Rasmussen, Chris Kennelly, Chris Zankel,
	Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin

On Mon, Jun 6, 2022 at 1:20 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Jun 6, 2022 at 9:40 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> > >
> > > Hi Zach,
> > >
> > > Thank you for the patch! Perhaps something to improve:
> > >
> > > [auto build test WARNING on akpm-mm/mm-everything]
> > >
> > > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp@intel.com/config)
> > > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > > reproduce (this is a W=1 build):
> > >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> > >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> > >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> > >         # save the config file
> > >         mkdir build_dir && cp config build_dir/.config
> > >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> > >
> > > If you fix the issue, kindly add following tag where applicable
> > > Reported-by: kernel test robot <lkp@intel.com>
> > >
> > > All warnings (new ones prefixed by >>):
> > >
> > >    mm/khugepaged.c: In function 'khugepaged':
> > > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> > >     2284 | }
> > >          | ^
> >
> > Thanks lkp@intel.com.
> >
> > This is due to config with:
> >
> > CONFIG_FRAME_WARN=2048
> > CONFIG_NODES_SHIFT=10
> >
> > Where struct collapse_control has a member int
> > node_load[MAX_NUMNODES], and we stack allocate one.
> >
> > Is this a configuration that needs to be supported? 1024 nodes seems
> > like a lot and I'm not sure if these configs are randomly generated or
> > are reminiscent of real systems.
>
> I don't have a better idea other than moving it out of the
> collapse_control struct. You may consider changing node_load to two
> dimensions, for example:
>
> node_load[2][MAX_NUMNODES], then define:
> enum {
>     /* khugepaged */
>     COLLAPSE_ASYNC,
>     /* MADV_COLLAPSE */
>     COLLAPSE_SYNC
> }
>
> Then khugepaged and MADV_COLLAPSE get their dedicated node_load respectively.

Sorry, I just realized this won't work for MADV_COLLAPSE since
multiple processes may call it at the same time. We may consider
allocating it dynamically. Have node_load the last element of
collapse_control struct, then do:
for_each_node(node)
    kmalloc(sizeof(int), GFP_KERNEL);

MADV_COLLAPSE or khugepaged could just fail if it fails since THP
allocation is unlikely to succeed in this case. But not sure if it is
worth the complexity rather than just killing it.

>
> The more aggressive approach may be just killing node_load, but I'm
> not sure what impact it may incur.
>
> >
> > Thanks,
> > Zach
> >
> > >
> > > vim +2284 mm/khugepaged.c
> > >
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2261
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2262  static int khugepaged(void *none)
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2263  {
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2264      struct mm_slot *mm_slot;
> > > d87b6065d6050b Zach O'Keefe       2022-06-03  2265      struct collapse_control cc = {
> > > d87b6065d6050b Zach O'Keefe       2022-06-03  2266              .last_target_node = NUMA_NO_NODE,
> > > d87b6065d6050b Zach O'Keefe       2022-06-03  2267      };
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2268
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2269      set_freezable();
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2270      set_user_nice(current, MAX_NICE);
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2271
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2272      while (!kthread_should_stop()) {
> > > d87b6065d6050b Zach O'Keefe       2022-06-03  2273              khugepaged_do_scan(&cc);
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2274              khugepaged_wait_work();
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2275      }
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2276
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2277      spin_lock(&khugepaged_mm_lock);
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2278      mm_slot = khugepaged_scan.mm_slot;
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2279      khugepaged_scan.mm_slot = NULL;
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2280      if (mm_slot)
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2281              collect_mm_slot(mm_slot);
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2282      spin_unlock(&khugepaged_mm_lock);
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2283      return 0;
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26 @2284  }
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2285
> > >
> > > --
> > > 0-DAY CI Kernel Test Service
> > > https://01.org/lkp
> > >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
@ 2022-06-06 21:22           ` Yang Shi
  0 siblings, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-06 21:22 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 5379 bytes --]

On Mon, Jun 6, 2022 at 1:20 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Jun 6, 2022 at 9:40 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> > >
> > > Hi Zach,
> > >
> > > Thank you for the patch! Perhaps something to improve:
> > >
> > > [auto build test WARNING on akpm-mm/mm-everything]
> > >
> > > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp(a)intel.com/config)
> > > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > > reproduce (this is a W=1 build):
> > >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> > >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> > >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> > >         # save the config file
> > >         mkdir build_dir && cp config build_dir/.config
> > >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> > >
> > > If you fix the issue, kindly add following tag where applicable
> > > Reported-by: kernel test robot <lkp@intel.com>
> > >
> > > All warnings (new ones prefixed by >>):
> > >
> > >    mm/khugepaged.c: In function 'khugepaged':
> > > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> > >     2284 | }
> > >          | ^
> >
> > Thanks lkp(a)intel.com.
> >
> > This is due to config with:
> >
> > CONFIG_FRAME_WARN=2048
> > CONFIG_NODES_SHIFT=10
> >
> > Where struct collapse_control has a member int
> > node_load[MAX_NUMNODES], and we stack allocate one.
> >
> > Is this a configuration that needs to be supported? 1024 nodes seems
> > like a lot and I'm not sure if these configs are randomly generated or
> > are reminiscent of real systems.
>
> I don't have a better idea other than moving it out of the
> collapse_control struct. You may consider changing node_load to two
> dimensions, for example:
>
> node_load[2][MAX_NUMNODES], then define:
> enum {
>     /* khugepaged */
>     COLLAPSE_ASYNC,
>     /* MADV_COLLAPSE */
>     COLLAPSE_SYNC
> }
>
> Then khugepaged and MADV_COLLAPSE get their dedicated node_load respectively.

Sorry, I just realized this won't work for MADV_COLLAPSE since
multiple processes may call it at the same time. We may consider
allocating it dynamically. Have node_load the last element of
collapse_control struct, then do:
for_each_node(node)
    kmalloc(sizeof(int), GFP_KERNEL);

MADV_COLLAPSE or khugepaged could just fail if it fails since THP
allocation is unlikely to succeed in this case. But not sure if it is
worth the complexity rather than just killing it.

>
> The more aggressive approach may be just killing node_load, but I'm
> not sure what impact it may incur.
>
> >
> > Thanks,
> > Zach
> >
> > >
> > > vim +2284 mm/khugepaged.c
> > >
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2261
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2262  static int khugepaged(void *none)
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2263  {
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2264      struct mm_slot *mm_slot;
> > > d87b6065d6050b Zach O'Keefe       2022-06-03  2265      struct collapse_control cc = {
> > > d87b6065d6050b Zach O'Keefe       2022-06-03  2266              .last_target_node = NUMA_NO_NODE,
> > > d87b6065d6050b Zach O'Keefe       2022-06-03  2267      };
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2268
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2269      set_freezable();
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2270      set_user_nice(current, MAX_NICE);
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2271
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2272      while (!kthread_should_stop()) {
> > > d87b6065d6050b Zach O'Keefe       2022-06-03  2273              khugepaged_do_scan(&cc);
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2274              khugepaged_wait_work();
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2275      }
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2276
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2277      spin_lock(&khugepaged_mm_lock);
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2278      mm_slot = khugepaged_scan.mm_slot;
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2279      khugepaged_scan.mm_slot = NULL;
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2280      if (mm_slot)
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2281              collect_mm_slot(mm_slot);
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2282      spin_unlock(&khugepaged_mm_lock);
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2283      return 0;
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26 @2284  }
> > > b46e756f5e4703 Kirill A. Shutemov 2016-07-26  2285
> > >
> > > --
> > > 0-DAY CI Kernel Test Service
> > > https://01.org/lkp
> > >

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
  2022-06-06 16:40       ` Zach O'Keefe
@ 2022-06-06 22:23         ` Andrew Morton
  -1 siblings, 0 replies; 63+ messages in thread
From: Andrew Morton @ 2022-06-06 22:23 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: kernel test robot, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Michal Hocko, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi,
	Zi Yan, linux-mm, kbuild-all, Andrea Arcangeli, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin

On Mon, 6 Jun 2022 09:40:20 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:

> On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> >
> > Hi Zach,
> >
> > Thank you for the patch! Perhaps something to improve:
> >
> > [auto build test WARNING on akpm-mm/mm-everything]
> >
> > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp@intel.com/config)
> > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > reproduce (this is a W=1 build):
> >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> >         # save the config file
> >         mkdir build_dir && cp config build_dir/.config
> >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> >
> > If you fix the issue, kindly add following tag where applicable
> > Reported-by: kernel test robot <lkp@intel.com>
> >
> > All warnings (new ones prefixed by >>):
> >
> >    mm/khugepaged.c: In function 'khugepaged':
> > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> >     2284 | }
> >          | ^
> 
> Thanks lkp@intel.com.
> 
> This is due to config with:
> 
> CONFIG_FRAME_WARN=2048
> CONFIG_NODES_SHIFT=10
> 
> Where struct collapse_control has a member int
> node_load[MAX_NUMNODES], and we stack allocate one.
> 
> Is this a configuration that needs to be supported? 1024 nodes seems
> like a lot and I'm not sure if these configs are randomly generated or
> are reminiscent of real systems.

Adding 4k to the stack isn't a good thing to do.  It's trivial to
kmalloc the thing, so why not do that?

I'll await some reviewer input (hopefully positive ;)) before merging
this series.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
@ 2022-06-06 22:23         ` Andrew Morton
  0 siblings, 0 replies; 63+ messages in thread
From: Andrew Morton @ 2022-06-06 22:23 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2232 bytes --]

On Mon, 6 Jun 2022 09:40:20 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:

> On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> >
> > Hi Zach,
> >
> > Thank you for the patch! Perhaps something to improve:
> >
> > [auto build test WARNING on akpm-mm/mm-everything]
> >
> > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp(a)intel.com/config)
> > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > reproduce (this is a W=1 build):
> >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> >         # save the config file
> >         mkdir build_dir && cp config build_dir/.config
> >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> >
> > If you fix the issue, kindly add following tag where applicable
> > Reported-by: kernel test robot <lkp@intel.com>
> >
> > All warnings (new ones prefixed by >>):
> >
> >    mm/khugepaged.c: In function 'khugepaged':
> > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> >     2284 | }
> >          | ^
> 
> Thanks lkp(a)intel.com.
> 
> This is due to config with:
> 
> CONFIG_FRAME_WARN=2048
> CONFIG_NODES_SHIFT=10
> 
> Where struct collapse_control has a member int
> node_load[MAX_NUMNODES], and we stack allocate one.
> 
> Is this a configuration that needs to be supported? 1024 nodes seems
> like a lot and I'm not sure if these configs are randomly generated or
> are reminiscent of real systems.

Adding 4k to the stack isn't a good thing to do.  It's trivial to
kmalloc the thing, so why not do that?

I'll await some reviewer input (hopefully positive ;)) before merging
this series.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 06/15] mm/khugepaged: pipe enum scan_result codes back to callers
  2022-06-04  0:39 ` [PATCH v6 06/15] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
@ 2022-06-06 22:39   ` Yang Shi
  2022-06-07  0:17     ` Zach O'Keefe
  0 siblings, 1 reply; 63+ messages in thread
From: Yang Shi @ 2022-06-06 22:39 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Pipe enum scan_result codes back through return values of functions
> downstream of khugepaged_scan_file() and khugepaged_scan_pmd() to
> inform callers if the operation was successful, and if not, why.
>
> Since khugepaged_scan_pmd()'s return value already has a specific
> meaning (whether mmap_lock was unlocked or not), add a bool* argument
> to khugepaged_scan_pmd() to retrieve this information.
>
> Change khugepaged to take action based on the return values of
> khugepaged_scan_file() and khugepaged_scan_pmd() instead of acting
> deep within the collapsing functions themselves.
>
> Remove dependency on error pointers to communicate to khugepaged that
> allocation failed and it should sleep; instead just use the result of
> the scan (SCAN_ALLOC_HUGE_PAGE_FAIL if allocation fails).
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

A couple of minor nits below...

> ---
>  mm/khugepaged.c | 192 ++++++++++++++++++++++++------------------------
>  1 file changed, 96 insertions(+), 96 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ba722347bebd..03e0da0008f1 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -722,13 +722,13 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                 result = SCAN_SUCCEED;
>                 trace_mm_collapse_huge_page_isolate(page, none_or_zero,
>                                                     referenced, writable, result);
> -               return 1;
> +               return SCAN_SUCCEED;

You could do "return result" too.

>         }
>  out:
>         release_pte_pages(pte, _pte, compound_pagelist);
>         trace_mm_collapse_huge_page_isolate(page, none_or_zero,
>                                             referenced, writable, result);
> -       return 0;
> +       return result;
>  }
>
>  static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> @@ -850,14 +850,13 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
>  #endif
>
>  /* Sleep for the first alloc fail, break the loop for the second fail */
> -static bool alloc_fail_should_sleep(struct page **hpage, bool *wait)
> +static bool alloc_fail_should_sleep(int result, bool *wait)
>  {
> -       if (IS_ERR(*hpage)) {
> +       if (result == SCAN_ALLOC_HUGE_PAGE_FAIL) {
>                 if (!*wait)
>                         return true;
>
>                 *wait = false;
> -               *hpage = NULL;
>                 khugepaged_alloc_sleep();
>         }
>         return false;
> @@ -868,7 +867,6 @@ static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>         *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
>         if (unlikely(!*hpage)) {
>                 count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> -               *hpage = ERR_PTR(-ENOMEM);
>                 return false;
>         }
>
> @@ -1010,17 +1008,17 @@ static int alloc_charge_hpage(struct mm_struct *mm, struct page **hpage,
>         return SCAN_SUCCEED;
>  }
>
> -static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
> -                              struct page **hpage, int referenced,
> -                              int unmapped, struct collapse_control *cc)
> +static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> +                             int referenced, int unmapped,
> +                             struct collapse_control *cc)
>  {
>         LIST_HEAD(compound_pagelist);
>         pmd_t *pmd, _pmd;
>         pte_t *pte;
>         pgtable_t pgtable;
> -       struct page *new_page;
> +       struct page *hpage;
>         spinlock_t *pmd_ptl, *pte_ptl;
> -       int isolated = 0, result = 0;
> +       int result = SCAN_FAIL;
>         struct vm_area_struct *vma;
>         struct mmu_notifier_range range;
>
> @@ -1034,12 +1032,10 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
>          */
>         mmap_read_unlock(mm);
>
> -       result = alloc_charge_hpage(mm, hpage, cc);
> +       result = alloc_charge_hpage(mm, &hpage, cc);
>         if (result != SCAN_SUCCEED)
>                 goto out_nolock;
>
> -       new_page = *hpage;
> -
>         mmap_read_lock(mm);
>         result = hugepage_vma_revalidate(mm, address, &vma);
>         if (result) {
> @@ -1100,11 +1096,11 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         mmu_notifier_invalidate_range_end(&range);
>
>         spin_lock(pte_ptl);
> -       isolated = __collapse_huge_page_isolate(vma, address, pte,
> -                       &compound_pagelist);
> +       result =  __collapse_huge_page_isolate(vma, address, pte,
> +                                              &compound_pagelist);
>         spin_unlock(pte_ptl);
>
> -       if (unlikely(!isolated)) {
> +       if (unlikely(result != SCAN_SUCCEED)) {
>                 pte_unmap(pte);
>                 spin_lock(pmd_ptl);
>                 BUG_ON(!pmd_none(*pmd));
> @@ -1116,7 +1112,6 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
>                 pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>                 spin_unlock(pmd_ptl);
>                 anon_vma_unlock_write(vma->anon_vma);
> -               result = SCAN_FAIL;
>                 goto out_up_write;
>         }
>
> @@ -1126,8 +1121,8 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
>          */
>         anon_vma_unlock_write(vma->anon_vma);
>
> -       __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl,
> -                       &compound_pagelist);
> +       __collapse_huge_page_copy(pte, hpage, vma, address, pte_ptl,
> +                                 &compound_pagelist);
>         pte_unmap(pte);
>         /*
>          * spin_lock() below is not the equivalent of smp_wmb(), but
> @@ -1135,43 +1130,42 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
>          * avoid the copy_huge_page writes to become visible after
>          * the set_pmd_at() write.
>          */
> -       __SetPageUptodate(new_page);
> +       __SetPageUptodate(hpage);
>         pgtable = pmd_pgtable(_pmd);
>
> -       _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
> +       _pmd = mk_huge_pmd(hpage, vma->vm_page_prot);
>         _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>
>         spin_lock(pmd_ptl);
>         BUG_ON(!pmd_none(*pmd));
> -       page_add_new_anon_rmap(new_page, vma, address);
> -       lru_cache_add_inactive_or_unevictable(new_page, vma);
> +       page_add_new_anon_rmap(hpage, vma, address);
> +       lru_cache_add_inactive_or_unevictable(hpage, vma);
>         pgtable_trans_huge_deposit(mm, pmd, pgtable);
>         set_pmd_at(mm, address, pmd, _pmd);
>         update_mmu_cache_pmd(vma, address, pmd);
>         spin_unlock(pmd_ptl);
>
> -       *hpage = NULL;
> +       hpage = NULL;
>
> -       khugepaged_pages_collapsed++;
>         result = SCAN_SUCCEED;
>  out_up_write:
>         mmap_write_unlock(mm);
>  out_nolock:
> -       if (!IS_ERR_OR_NULL(*hpage)) {
> -               mem_cgroup_uncharge(page_folio(*hpage));
> -               put_page(*hpage);
> +       if (hpage) {
> +               mem_cgroup_uncharge(page_folio(hpage));
> +               put_page(hpage);
>         }
> -       trace_mm_collapse_huge_page(mm, isolated, result);
> -       return;
> +       trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> +       return result;
>  }
>
>  static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> -                              unsigned long address, struct page **hpage,
> +                              unsigned long address, bool *mmap_locked,
>                                struct collapse_control *cc)
>  {
>         pmd_t *pmd;
>         pte_t *pte, *_pte;
> -       int ret = 0, result = 0, referenced = 0;
> +       int result = SCAN_FAIL, referenced = 0;
>         int none_or_zero = 0, shared = 0;
>         struct page *page = NULL;
>         unsigned long _address;
> @@ -1306,19 +1300,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>                 result = SCAN_LACK_REFERENCED_PAGE;
>         } else {
>                 result = SCAN_SUCCEED;
> -               ret = 1;
>         }
>  out_unmap:
>         pte_unmap_unlock(pte, ptl);
> -       if (ret) {
> +       if (result == SCAN_SUCCEED) {
>                 /* collapse_huge_page will return with the mmap_lock released */
> -               collapse_huge_page(mm, address, hpage, referenced, unmapped,
> -                                  cc);
> +               *mmap_locked = false;
> +               result = collapse_huge_page(mm, address, referenced,
> +                                           unmapped, cc);

Shall move "*mmap_locked = false" after collapse_huge_page() call? No
functional change, but seems more consistent.

>         }
>  out:
>         trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
>                                      none_or_zero, result, unmapped);
> -       return ret;
> +       return result;
>  }
>
>  static void collect_mm_slot(struct mm_slot *mm_slot)
> @@ -1581,7 +1575,6 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>   * @mm: process address space where collapse happens
>   * @file: file that collapse on
>   * @start: collapse start address
> - * @hpage: new allocated huge page for collapse
>   * @cc: collapse context and scratchpad
>   *
>   * Basic scheme is simple, details are more complex:
> @@ -1599,12 +1592,11 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>   *    + restore gaps in the page cache;
>   *    + unlock and free huge page;
>   */
> -static void collapse_file(struct mm_struct *mm, struct file *file,
> -                         pgoff_t start, struct page **hpage,
> -                         struct collapse_control *cc)
> +static int collapse_file(struct mm_struct *mm, struct file *file,
> +                        pgoff_t start, struct collapse_control *cc)
>  {
>         struct address_space *mapping = file->f_mapping;
> -       struct page *new_page;
> +       struct page *hpage;
>         pgoff_t index, end = start + HPAGE_PMD_NR;
>         LIST_HEAD(pagelist);
>         XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
> @@ -1615,12 +1607,10 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
>         VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>         VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>
> -       result = alloc_charge_hpage(mm, hpage, cc);
> +       result = alloc_charge_hpage(mm, &hpage, cc);
>         if (result != SCAN_SUCCEED)
>                 goto out;
>
> -       new_page = *hpage;
> -
>         /*
>          * Ensure we have slots for all the pages in the range.  This is
>          * almost certainly a no-op because most of the pages must be present
> @@ -1637,14 +1627,14 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
>                 }
>         } while (1);
>
> -       __SetPageLocked(new_page);
> +       __SetPageLocked(hpage);
>         if (is_shmem)
> -               __SetPageSwapBacked(new_page);
> -       new_page->index = start;
> -       new_page->mapping = mapping;
> +               __SetPageSwapBacked(hpage);
> +       hpage->index = start;
> +       hpage->mapping = mapping;
>
>         /*
> -        * At this point the new_page is locked and not up-to-date.
> +        * At this point the hpage is locked and not up-to-date.
>          * It's safe to insert it into the page cache, because nobody would
>          * be able to map it or use it in another way until we unlock it.
>          */
> @@ -1672,7 +1662,7 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
>                                         result = SCAN_FAIL;
>                                         goto xa_locked;
>                                 }
> -                               xas_store(&xas, new_page);
> +                               xas_store(&xas, hpage);
>                                 nr_none++;
>                                 continue;
>                         }
> @@ -1814,19 +1804,19 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
>                 list_add_tail(&page->lru, &pagelist);
>
>                 /* Finally, replace with the new page. */
> -               xas_store(&xas, new_page);
> +               xas_store(&xas, hpage);
>                 continue;
>  out_unlock:
>                 unlock_page(page);
>                 put_page(page);
>                 goto xa_unlocked;
>         }
> -       nr = thp_nr_pages(new_page);
> +       nr = thp_nr_pages(hpage);
>
>         if (is_shmem)
> -               __mod_lruvec_page_state(new_page, NR_SHMEM_THPS, nr);
> +               __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
>         else {
> -               __mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
> +               __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
>                 filemap_nr_thps_inc(mapping);
>                 /*
>                  * Paired with smp_mb() in do_dentry_open() to ensure
> @@ -1837,21 +1827,21 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
>                 smp_mb();
>                 if (inode_is_open_for_write(mapping->host)) {
>                         result = SCAN_FAIL;
> -                       __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
> +                       __mod_lruvec_page_state(hpage, NR_FILE_THPS, -nr);
>                         filemap_nr_thps_dec(mapping);
>                         goto xa_locked;
>                 }
>         }
>
>         if (nr_none) {
> -               __mod_lruvec_page_state(new_page, NR_FILE_PAGES, nr_none);
> +               __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
>                 if (is_shmem)
> -                       __mod_lruvec_page_state(new_page, NR_SHMEM, nr_none);
> +                       __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
>         }
>
>         /* Join all the small entries into a single multi-index entry */
>         xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> -       xas_store(&xas, new_page);
> +       xas_store(&xas, hpage);
>  xa_locked:
>         xas_unlock_irq(&xas);
>  xa_unlocked:
> @@ -1873,11 +1863,11 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
>                 index = start;
>                 list_for_each_entry_safe(page, tmp, &pagelist, lru) {
>                         while (index < page->index) {
> -                               clear_highpage(new_page + (index % HPAGE_PMD_NR));
> +                               clear_highpage(hpage + (index % HPAGE_PMD_NR));
>                                 index++;
>                         }
> -                       copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
> -                                       page);
> +                       copy_highpage(hpage + (page->index % HPAGE_PMD_NR),
> +                                     page);
>                         list_del(&page->lru);
>                         page->mapping = NULL;
>                         page_ref_unfreeze(page, 1);
> @@ -1888,23 +1878,23 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
>                         index++;
>                 }
>                 while (index < end) {
> -                       clear_highpage(new_page + (index % HPAGE_PMD_NR));
> +                       clear_highpage(hpage + (index % HPAGE_PMD_NR));
>                         index++;
>                 }
>
> -               SetPageUptodate(new_page);
> -               page_ref_add(new_page, HPAGE_PMD_NR - 1);
> +               SetPageUptodate(hpage);
> +               page_ref_add(hpage, HPAGE_PMD_NR - 1);
>                 if (is_shmem)
> -                       set_page_dirty(new_page);
> -               lru_cache_add(new_page);
> +                       set_page_dirty(hpage);
> +               lru_cache_add(hpage);
>
>                 /*
>                  * Remove pte page tables, so we can re-fault the page as huge.
>                  */
>                 retract_page_tables(mapping, start);
> -               *hpage = NULL;
> -
> -               khugepaged_pages_collapsed++;
> +               unlock_page(hpage);
> +               hpage = NULL;
> +               goto out;

Maybe just set hpage to NULL here, then later...

>         } else {
>                 struct page *page;
>
> @@ -1943,22 +1933,22 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
>                 VM_BUG_ON(nr_none);
>                 xas_unlock_irq(&xas);
>
> -               new_page->mapping = NULL;
> +               hpage->mapping = NULL;
>         }
>
> -       unlock_page(new_page);
> +       unlock_page(hpage);

do
if (hpage)
    unlock_page(hpage);

>  out:
>         VM_BUG_ON(!list_empty(&pagelist));
> -       if (!IS_ERR_OR_NULL(*hpage)) {
> -               mem_cgroup_uncharge(page_folio(*hpage));
> -               put_page(*hpage);
> +       if (hpage) {
> +               mem_cgroup_uncharge(page_folio(hpage));
> +               put_page(hpage);
>         }
>         /* TODO: tracepoints */
> +       return result;
>  }
>
> -static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> -                                pgoff_t start, struct page **hpage,
> -                                struct collapse_control *cc)
> +static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> +                               pgoff_t start, struct collapse_control *cc)
>  {
>         struct page *page = NULL;
>         struct address_space *mapping = file->f_mapping;
> @@ -2031,15 +2021,16 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
>                         result = SCAN_EXCEED_NONE_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>                 } else {
> -                       collapse_file(mm, file, start, hpage, cc);
> +                       result = collapse_file(mm, file, start, cc);
>                 }
>         }
>
>         /* TODO: tracepoints */
> +       return result;
>  }
>  #else
> -static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> -                                pgoff_t start, struct collapse_control *cc)
> +static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, pgoff_t start,
> +                               struct collapse_control *cc)
>  {
>         BUILD_BUG();
>  }
> @@ -2049,8 +2040,7 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
>  }
>  #endif
>
> -static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> -                                           struct page **hpage,
> +static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>                                             struct collapse_control *cc)
>         __releases(&khugepaged_mm_lock)
>         __acquires(&khugepaged_mm_lock)
> @@ -2064,6 +2054,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
>
>         VM_BUG_ON(!pages);
>         lockdep_assert_held(&khugepaged_mm_lock);
> +       *result = SCAN_FAIL;
>
>         if (khugepaged_scan.mm_slot)
>                 mm_slot = khugepaged_scan.mm_slot;
> @@ -2117,7 +2108,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
>                         goto skip;
>
>                 while (khugepaged_scan.address < hend) {
> -                       int ret;
> +                       bool mmap_locked = true;
> +
>                         cond_resched();
>                         if (unlikely(khugepaged_test_exit(mm)))
>                                 goto breakouterloop;
> @@ -2134,20 +2126,28 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
>                                                 khugepaged_scan.address);
>
>                                 mmap_read_unlock(mm);
> -                               ret = 1;
> -                               khugepaged_scan_file(mm, file, pgoff, hpage,
> -                                                    cc);
> +                               mmap_locked = false;
> +                               *result = khugepaged_scan_file(mm, file, pgoff,
> +                                                              cc);
>                                 fput(file);
>                         } else {
> -                               ret = khugepaged_scan_pmd(mm, vma,
> -                                               khugepaged_scan.address,
> -                                               hpage, cc);
> +                               *result = khugepaged_scan_pmd(mm, vma,
> +                                                             khugepaged_scan.address,
> +                                                             &mmap_locked, cc);
>                         }
> +                       if (*result == SCAN_SUCCEED)
> +                               ++khugepaged_pages_collapsed;
>                         /* move to next address */
>                         khugepaged_scan.address += HPAGE_PMD_SIZE;
>                         progress += HPAGE_PMD_NR;
> -                       if (ret)
> -                               /* we released mmap_lock so break loop */
> +                       if (!mmap_locked)
> +                               /*
> +                                * We released mmap_lock so break loop.  Note
> +                                * that we drop mmap_lock before all hugepage
> +                                * allocations, so if allocation fails, we are
> +                                * guaranteed to break here and report the
> +                                * correct result back to caller.
> +                                */
>                                 goto breakouterloop_mmap_lock;
>                         if (progress >= pages)
>                                 goto breakouterloop;
> @@ -2199,15 +2199,15 @@ static int khugepaged_wait_event(void)
>
>  static void khugepaged_do_scan(struct collapse_control *cc)
>  {
> -       struct page *hpage = NULL;
>         unsigned int progress = 0, pass_through_head = 0;
>         unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
>         bool wait = true;
> +       int result = SCAN_SUCCEED;
>
>         lru_add_drain_all();
>
>         while (progress < pages) {
> -               if (alloc_fail_should_sleep(&hpage, &wait))
> +               if (alloc_fail_should_sleep(result, &wait))
>                         break;
>
>                 cond_resched();
> @@ -2221,7 +2221,7 @@ static void khugepaged_do_scan(struct collapse_control *cc)
>                 if (khugepaged_has_work() &&
>                     pass_through_head < 2)
>                         progress += khugepaged_scan_mm_slot(pages - progress,
> -                                                           &hpage, cc);
> +                                                           &result, cc);
>                 else
>                         progress = pages;
>                 spin_unlock(&khugepaged_mm_lock);
> --
> 2.36.1.255.ge46751e96f-goog
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 07/15] mm/khugepaged: add flag to ignore khugepaged heuristics
  2022-06-04  0:39 ` [PATCH v6 07/15] mm/khugepaged: add flag to ignore khugepaged heuristics Zach O'Keefe
@ 2022-06-06 22:51   ` Yang Shi
  0 siblings, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-06 22:51 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Add enforce_page_heuristics flag to struct collapse_control that allows
> context to ignore heuristics originally designed to guide khugepaged:
>
> 1) sysfs-controlled knobs khugepaged_max_ptes_[none|swap|shared]
> 2) requirement that some pages in region being collapsed be young or
>    referenced
>
> This flag is set in khugepaged collapse context to preserve existing
> khugepaged behavior.
>
> This flag will be used (unset) when introducing madvise collapse
> context since here, the user presumably has reason to believe the
> collapse will be beneficial and khugepaged heuristics shouldn't tell
> the user they are wrong.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  mm/khugepaged.c | 55 +++++++++++++++++++++++++++++++++----------------
>  1 file changed, 37 insertions(+), 18 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 03e0da0008f1..c3589b3e238d 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -87,6 +87,13 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
>  #define MAX_PTE_MAPPED_THP 8
>
>  struct collapse_control {
> +       /*
> +        * Heuristics:
> +        * - khugepaged_max_ptes_[none|swap|shared]
> +        * - require memory to be young / referenced
> +        */
> +       bool enforce_page_heuristics;
> +
>         /* Num pages scanned per node */
>         int node_load[MAX_NUMNODES];
>
> @@ -604,6 +611,7 @@ static bool is_refcount_suitable(struct page *page)
>  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                                         unsigned long address,
>                                         pte_t *pte,
> +                                       struct collapse_control *cc,
>                                         struct list_head *compound_pagelist)
>  {
>         struct page *page = NULL;
> @@ -617,7 +625,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                 if (pte_none(pteval) || (pte_present(pteval) &&
>                                 is_zero_pfn(pte_pfn(pteval)))) {
>                         if (!userfaultfd_armed(vma) &&
> -                           ++none_or_zero <= khugepaged_max_ptes_none) {
> +                           (++none_or_zero <= khugepaged_max_ptes_none ||
> +                            !cc->enforce_page_heuristics)) {
>                                 continue;
>                         } else {
>                                 result = SCAN_EXCEED_NONE_PTE;
> @@ -637,8 +646,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>
>                 VM_BUG_ON_PAGE(!PageAnon(page), page);
>
> -               if (page_mapcount(page) > 1 &&
> -                               ++shared > khugepaged_max_ptes_shared) {
> +               if (cc->enforce_page_heuristics && page_mapcount(page) > 1 &&
> +                   ++shared > khugepaged_max_ptes_shared) {
>                         result = SCAN_EXCEED_SHARED_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>                         goto out;
> @@ -705,9 +714,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                         list_add_tail(&page->lru, compound_pagelist);
>  next:
>                 /* There should be enough young pte to collapse the page */
> -               if (pte_young(pteval) ||
> -                   page_is_young(page) || PageReferenced(page) ||
> -                   mmu_notifier_test_young(vma->vm_mm, address))
> +               if (cc->enforce_page_heuristics &&
> +                   (pte_young(pteval) || page_is_young(page) ||
> +                    PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
> +                                                                    address)))
>                         referenced++;
>
>                 if (pte_write(pteval))
> @@ -716,7 +726,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>
>         if (unlikely(!writable)) {
>                 result = SCAN_PAGE_RO;
> -       } else if (unlikely(!referenced)) {
> +       } else if (unlikely(cc->enforce_page_heuristics && !referenced)) {
>                 result = SCAN_LACK_REFERENCED_PAGE;
>         } else {
>                 result = SCAN_SUCCEED;
> @@ -1096,7 +1106,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         mmu_notifier_invalidate_range_end(&range);
>
>         spin_lock(pte_ptl);
> -       result =  __collapse_huge_page_isolate(vma, address, pte,
> +       result =  __collapse_huge_page_isolate(vma, address, pte, cc,
>                                                &compound_pagelist);
>         spin_unlock(pte_ptl);
>
> @@ -1185,7 +1195,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>              _pte++, _address += PAGE_SIZE) {
>                 pte_t pteval = *_pte;
>                 if (is_swap_pte(pteval)) {
> -                       if (++unmapped <= khugepaged_max_ptes_swap) {
> +                       if (++unmapped <= khugepaged_max_ptes_swap ||
> +                           !cc->enforce_page_heuristics) {
>                                 /*
>                                  * Always be strict with uffd-wp
>                                  * enabled swap entries.  Please see
> @@ -1204,7 +1215,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>                 }
>                 if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>                         if (!userfaultfd_armed(vma) &&
> -                           ++none_or_zero <= khugepaged_max_ptes_none) {
> +                           (++none_or_zero <= khugepaged_max_ptes_none ||
> +                            !cc->enforce_page_heuristics)) {
>                                 continue;
>                         } else {
>                                 result = SCAN_EXCEED_NONE_PTE;
> @@ -1234,8 +1246,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>                         goto out_unmap;
>                 }
>
> -               if (page_mapcount(page) > 1 &&
> -                               ++shared > khugepaged_max_ptes_shared) {
> +               if (cc->enforce_page_heuristics &&
> +                   page_mapcount(page) > 1 &&
> +                   ++shared > khugepaged_max_ptes_shared) {
>                         result = SCAN_EXCEED_SHARED_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>                         goto out_unmap;
> @@ -1289,14 +1302,17 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>                         result = SCAN_PAGE_COUNT;
>                         goto out_unmap;
>                 }
> -               if (pte_young(pteval) ||
> -                   page_is_young(page) || PageReferenced(page) ||
> -                   mmu_notifier_test_young(vma->vm_mm, address))
> +               if (cc->enforce_page_heuristics &&
> +                   (pte_young(pteval) || page_is_young(page) ||
> +                    PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
> +                                                                    address)))
>                         referenced++;
>         }
>         if (!writable) {
>                 result = SCAN_PAGE_RO;
> -       } else if (!referenced || (unmapped && referenced < HPAGE_PMD_NR/2)) {
> +       } else if (cc->enforce_page_heuristics &&
> +                  (!referenced ||
> +                   (unmapped && referenced < HPAGE_PMD_NR / 2))) {
>                 result = SCAN_LACK_REFERENCED_PAGE;
>         } else {
>                 result = SCAN_SUCCEED;
> @@ -1966,7 +1982,8 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
>                         continue;
>
>                 if (xa_is_value(page)) {
> -                       if (++swap > khugepaged_max_ptes_swap) {
> +                       if (cc->enforce_page_heuristics &&
> +                           ++swap > khugepaged_max_ptes_swap) {
>                                 result = SCAN_EXCEED_SWAP_PTE;
>                                 count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>                                 break;
> @@ -2017,7 +2034,8 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
>         rcu_read_unlock();
>
>         if (result == SCAN_SUCCEED) {
> -               if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
> +               if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none &&
> +                   cc->enforce_page_heuristics) {
>                         result = SCAN_EXCEED_NONE_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>                 } else {
> @@ -2258,6 +2276,7 @@ static int khugepaged(void *none)
>  {
>         struct mm_slot *mm_slot;
>         struct collapse_control cc = {
> +               .enforce_page_heuristics = true,
>                 .last_target_node = NUMA_NO_NODE,
>                 /* .gfp set later  */
>         };
> --
> 2.36.1.255.ge46751e96f-goog
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 08/15] mm/khugepaged: add flag to ignore THP sysfs enabled
  2022-06-04  0:39 ` [PATCH v6 08/15] mm/khugepaged: add flag to ignore THP sysfs enabled Zach O'Keefe
@ 2022-06-06 23:02   ` Yang Shi
       [not found]   ` <YrzehlUoo2iMMLC2@xz-m1.local>
  1 sibling, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-06 23:02 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Add enforce_thp_enabled flag to struct collapse_control that allows context
> to ignore constraints imposed by /sys/kernel/transparent_hugepage/enabled.
>
> This flag is set in khugepaged collapse context to preserve existing
> khugepaged behavior.
>
> This flag will be used (unset) when introducing madvise collapse
> context since the desired THP semantics of MADV_COLLAPSE aren't coupled
> to sysfs THP settings.  Most notably, for the purpose of eventual
> madvise_collapse(2) support, this allows userspace to trigger THP collapse
> on behalf of another processes, without adding support to meddle with
> the VMA flags of said process, or change sysfs THP settings.
>
> For now, limit this flag to /sys/kernel/transparent_hugepage/enabled,
> but it can be expanded to include
> /sys/kernel/transparent_hugepage/shmem_enabled later.
>
> Link: https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Looks good to me. Reviewed-by: Yang Shi <shy828301@gmail.com>

Just a reminder, I just posted series
https://lore.kernel.org/linux-mm/20220606214414.736109-1-shy828301@gmail.com/T/#m5dae2dfa4b247f3b3903951dd3a1f0978a927e16,
it changed some logic in hugepage_vma_check(). If your series gets in
after it, you should need some additional tweaks to disregard sys THP
setting.

> ---
>  mm/khugepaged.c | 34 +++++++++++++++++++++++++++-------
>  1 file changed, 27 insertions(+), 7 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c3589b3e238d..4ad04f552347 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -94,6 +94,11 @@ struct collapse_control {
>          */
>         bool enforce_page_heuristics;
>
> +       /* Enforce constraints of
> +        * /sys/kernel/mm/transparent_hugepage/enabled
> +        */
> +       bool enforce_thp_enabled;
> +
>         /* Num pages scanned per node */
>         int node_load[MAX_NUMNODES];
>
> @@ -893,10 +898,12 @@ static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>   */
>
>  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> -               struct vm_area_struct **vmap)
> +                                  struct vm_area_struct **vmap,
> +                                  struct collapse_control *cc)
>  {
>         struct vm_area_struct *vma;
>         unsigned long hstart, hend;
> +       unsigned long vma_flags;
>
>         if (unlikely(khugepaged_test_exit(mm)))
>                 return SCAN_ANY_PROCESS;
> @@ -909,7 +916,18 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>         hend = vma->vm_end & HPAGE_PMD_MASK;
>         if (address < hstart || address + HPAGE_PMD_SIZE > hend)
>                 return SCAN_ADDRESS_RANGE;
> -       if (!hugepage_vma_check(vma, vma->vm_flags))
> +
> +       /*
> +        * If !cc->enforce_thp_enabled, set VM_HUGEPAGE so that
> +        * hugepage_vma_check() can pass even if
> +        * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
> +        * Note that hugepage_vma_check() doesn't enforce that
> +        * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
> +        * must be set (i.e. "never" mode).
> +        */
> +       vma_flags = cc->enforce_thp_enabled ?  vma->vm_flags
> +                       : vma->vm_flags | VM_HUGEPAGE;
> +       if (!hugepage_vma_check(vma, vma_flags))
>                 return SCAN_VMA_CHECK;
>         /* Anon VMA expected */
>         if (!vma->anon_vma || !vma_is_anonymous(vma))
> @@ -953,7 +971,8 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
>  static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>                                         struct vm_area_struct *vma,
>                                         unsigned long haddr, pmd_t *pmd,
> -                                       int referenced)
> +                                       int referenced,
> +                                       struct collapse_control *cc)
>  {
>         int swapped_in = 0;
>         vm_fault_t ret = 0;
> @@ -980,7 +999,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>                 /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
>                 if (ret & VM_FAULT_RETRY) {
>                         mmap_read_lock(mm);
> -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> +                       if (hugepage_vma_revalidate(mm, haddr, &vma, cc)) {
>                                 /* vma is no longer available, don't continue to swapin */
>                                 trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
>                                 return false;
> @@ -1047,7 +1066,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>                 goto out_nolock;
>
>         mmap_read_lock(mm);
> -       result = hugepage_vma_revalidate(mm, address, &vma);
> +       result = hugepage_vma_revalidate(mm, address, &vma, cc);
>         if (result) {
>                 mmap_read_unlock(mm);
>                 goto out_nolock;
> @@ -1066,7 +1085,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>          * Continuing to collapse causes inconsistency.
>          */
>         if (unmapped && !__collapse_huge_page_swapin(mm, vma, address,
> -                                                    pmd, referenced)) {
> +                                                    pmd, referenced, cc)) {
>                 mmap_read_unlock(mm);
>                 goto out_nolock;
>         }
> @@ -1078,7 +1097,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>          * handled by the anon_vma lock + PG_lock.
>          */
>         mmap_write_lock(mm);
> -       result = hugepage_vma_revalidate(mm, address, &vma);
> +       result = hugepage_vma_revalidate(mm, address, &vma, cc);
>         if (result)
>                 goto out_up_write;
>         /* check if the pmd is still valid */
> @@ -2277,6 +2296,7 @@ static int khugepaged(void *none)
>         struct mm_slot *mm_slot;
>         struct collapse_control cc = {
>                 .enforce_page_heuristics = true,
> +               .enforce_thp_enabled = true,
>                 .last_target_node = NUMA_NO_NODE,
>                 /* .gfp set later  */
>         };
> --
> 2.36.1.255.ge46751e96f-goog
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-06-04  0:39 ` [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
@ 2022-06-06 23:53   ` Yang Shi
  2022-06-07 22:48     ` Zach O'Keefe
  0 siblings, 1 reply; 63+ messages in thread
From: Yang Shi @ 2022-06-06 23:53 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> This idea was introduced by David Rientjes[1].
>
> Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
> synchronous collapse of memory at their own expense.
>
> The benefits of this approach are:
>
> * CPU is charged to the process that wants to spend the cycles for the
>   THP
> * Avoid unpredictable timing of khugepaged collapse
>
> An immediate user of this new functionality are malloc() implementations
> that manage memory in hugepage-sized chunks, but sometimes subrelease
> memory back to the system in native-sized chunks via MADV_DONTNEED;
> zapping the pmd.  Later, when the memory is hot, the implementation
> could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
> hugepage coverage and dTLB performance.  TCMalloc is such an
> implementation that could benefit from this[2].
>
> Only privately-mapped anon memory is supported for now, but it is
> expected that file and shmem support will be added later to support the
> use-case of backing executable text by THPs.  Current support provided
> by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
> which might impair services from serving at their full rated load after
> (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
> immediately realize iTLB performance prevents page sharing and demand
> paging, both of which increase steady state memory footprint.  With
> MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
> and lower RAM footprints.
>
> This call is independent of the system-wide THP sysfs settings, but will
> fail for memory marked VM_NOHUGEPAGE.
>
> THP allocation may enter direct reclaim and/or compaction.
>
> [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
>
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  arch/alpha/include/uapi/asm/mman.h     |   2 +
>  arch/mips/include/uapi/asm/mman.h      |   2 +
>  arch/parisc/include/uapi/asm/mman.h    |   2 +
>  arch/xtensa/include/uapi/asm/mman.h    |   2 +
>  include/linux/huge_mm.h                |  12 +++
>  include/uapi/asm-generic/mman-common.h |   2 +
>  mm/khugepaged.c                        | 124 +++++++++++++++++++++++++
>  mm/madvise.c                           |   5 +
>  8 files changed, 151 insertions(+)
>
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 4aa996423b0d..763929e814e9 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -76,6 +76,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index 1be428663c10..c6e1fc77c996 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -103,6 +103,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index a7ea3204a5fa..22133a6a506e 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -70,6 +70,8 @@
>  #define MADV_WIPEONFORK 71             /* Zero memory on fork, child only */
>  #define MADV_KEEPONFORK 72             /* Undo MADV_WIPEONFORK */
>
> +#define MADV_COLLAPSE  73              /* Synchronous hugepage collapse */
> +
>  #define MADV_HWPOISON     100          /* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
>
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 7966a58af472..1ff0c858544f 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -111,6 +111,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 648cb3ce7099..2ca2f3b41fc8 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -240,6 +240,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>
>  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
>                      int advice);
> +int madvise_collapse(struct vm_area_struct *vma,
> +                    struct vm_area_struct **prev,
> +                    unsigned long start, unsigned long end);
>  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
>                            unsigned long end, long adjust_next);
>  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> @@ -395,6 +398,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
>         BUG();
>         return 0;
>  }
> +
> +static inline int madvise_collapse(struct vm_area_struct *vma,
> +                                  struct vm_area_struct **prev,
> +                                  unsigned long start, unsigned long end)
> +{
> +       BUG();
> +       return 0;

I wish -ENOSYS could have been returned, but it seems madvise()
doesn't support this return value.

> +}
> +
>  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
>                                          unsigned long start,
>                                          unsigned long end,
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6c1aa92a92e4..6ce1f1ceb432 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -77,6 +77,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4ad04f552347..073d6bb03b37 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2404,3 +2404,127 @@ void khugepaged_min_free_kbytes_update(void)
>                 set_recommended_min_free_kbytes();
>         mutex_unlock(&khugepaged_mutex);
>  }
> +
> +static int madvise_collapse_errno(enum scan_result r)
> +{
> +       switch (r) {
> +       case SCAN_PMD_NULL:
> +       case SCAN_ADDRESS_RANGE:
> +       case SCAN_VMA_NULL:
> +       case SCAN_PTE_NON_PRESENT:
> +       case SCAN_PAGE_NULL:
> +               /*
> +                * Addresses in the specified range are not currently mapped,
> +                * or are outside the AS of the process.
> +                */
> +               return -ENOMEM;
> +       case SCAN_ALLOC_HUGE_PAGE_FAIL:
> +       case SCAN_CGROUP_CHARGE_FAIL:
> +               /* A kernel resource was temporarily unavailable. */
> +               return -EAGAIN;

I thought this should return -ENOMEM too.

> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
> +int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> +                    unsigned long start, unsigned long end)
> +{
> +       struct collapse_control cc = {
> +               .enforce_page_heuristics = false,
> +               .enforce_thp_enabled = false,
> +               .last_target_node = NUMA_NO_NODE,
> +               .gfp = GFP_TRANSHUGE | __GFP_THISNODE,
> +       };
> +       struct mm_struct *mm = vma->vm_mm;
> +       unsigned long hstart, hend, addr;
> +       int thps = 0, last_fail = SCAN_FAIL;
> +       bool mmap_locked = true;
> +
> +       BUG_ON(vma->vm_start > start);
> +       BUG_ON(vma->vm_end < end);
> +
> +       *prev = vma;
> +
> +       /* TODO: Support file/shmem */
> +       if (!vma->anon_vma || !vma_is_anonymous(vma))
> +               return -EINVAL;
> +
> +       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> +       hend = end & HPAGE_PMD_MASK;
> +
> +       /*
> +        * Set VM_HUGEPAGE so that hugepage_vma_check() can pass even if
> +        * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
> +        * Note that hugepage_vma_check() doesn't enforce that
> +        * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
> +        * must be set (i.e. "never" mode)
> +        */
> +       if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))

hugepage_vma_check() doesn't check vma size, so MADV_COLLAPSE may be
running for a unsuitable vma, hugepage_vma_revalidate() called by
khugepaged_scan_pmd() may find it out finally, but it is a huge waste
of effort. So, it is better to check vma size upfront.

BTW, my series moved the vma size check in hugepage_vma_check(), so if
your series could be based on top of that, you get that for free.

> +               return -EINVAL;
> +
> +       mmgrab(mm);
> +       lru_add_drain();
> +
> +       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> +               int result = SCAN_FAIL;
> +               bool retry = true;  /* Allow one retry per hugepage */
> +retry:
> +               if (!mmap_locked) {
> +                       cond_resched();
> +                       mmap_read_lock(mm);
> +                       mmap_locked = true;
> +                       result = hugepage_vma_revalidate(mm, addr, &vma, &cc);

How's about making hugepage_vma_revalidate() return SCAN_SUCCEED too?
It seems more consistent.

> +                       if (result) {
> +                               last_fail = result;
> +                               goto out_nolock;
> +                       }
> +               }
> +               mmap_assert_locked(mm);
> +               memset(cc.node_load, 0, sizeof(cc.node_load));
> +               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
> +               if (!mmap_locked)
> +                       *prev = NULL;  /* Tell caller we dropped mmap_lock */
> +
> +               switch (result) {
> +               case SCAN_SUCCEED:
> +               case SCAN_PMD_MAPPED:
> +                       ++thps;
> +                       break;
> +               /* Whitelisted set of results where continuing OK */
> +               case SCAN_PMD_NULL:
> +               case SCAN_PTE_NON_PRESENT:
> +               case SCAN_PTE_UFFD_WP:
> +               case SCAN_PAGE_RO:
> +               case SCAN_LACK_REFERENCED_PAGE:
> +               case SCAN_PAGE_NULL:
> +               case SCAN_PAGE_COUNT:
> +               case SCAN_PAGE_LOCK:
> +               case SCAN_PAGE_COMPOUND:
> +                       last_fail = result;
> +                       break;
> +               case SCAN_PAGE_LRU:
> +                       if (retry) {
> +                               lru_add_drain_all();
> +                               retry = false;
> +                               goto retry;

I'm not sure whether the retry logic is necessary or not, do you have
any data about how retry improves the success rate? You could just
replace lru_add_drain() to lru_add_drain_all() and remove the retry
logic IMHO. I'd prefer to keep it simple at the moment personally.

> +                       }
> +                       fallthrough;
> +               default:
> +                       last_fail = result;
> +                       /* Other error, exit */
> +                       goto out_maybelock;
> +               }
> +       }
> +
> +out_maybelock:
> +       /* Caller expects us to hold mmap_lock on return */
> +       if (!mmap_locked)
> +               mmap_read_lock(mm);
> +out_nolock:
> +       mmap_assert_locked(mm);
> +       mmdrop(mm);
> +
> +       return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> +                       : madvise_collapse_errno(last_fail);
> +}
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 46feb62ce163..eccac2620226 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
>         case MADV_FREE:
>         case MADV_POPULATE_READ:
>         case MADV_POPULATE_WRITE:
> +       case MADV_COLLAPSE:
>                 return 0;
>         default:
>                 /* be safe, default to 1. list exceptions explicitly */
> @@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>                 if (error)
>                         goto out;
>                 break;
> +       case MADV_COLLAPSE:
> +               return madvise_collapse(vma, prev, start, end);
>         }
>
>         anon_name = anon_vma_name(vma);
> @@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         case MADV_HUGEPAGE:
>         case MADV_NOHUGEPAGE:
> +       case MADV_COLLAPSE:
>  #endif
>         case MADV_DONTDUMP:
>         case MADV_DODUMP:
> @@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
>   *             transparent huge pages so the existing pages will not be
>   *             coalesced into THP and new pages will not be allocated as THP.
> + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *             from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> --
> 2.36.1.255.ge46751e96f-goog
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
  2022-06-06 22:23         ` Andrew Morton
@ 2022-06-06 23:53           ` Yang Shi
  -1 siblings, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-06 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zach O'Keefe, kernel test robot, Alex Shi, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Michal Hocko, Pasha Tatashin,
	Peter Xu, Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Zi Yan, Linux MM, kbuild-all, Andrea Arcangeli, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin

On Mon, Jun 6, 2022 at 3:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 6 Jun 2022 09:40:20 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
>
> > On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> > >
> > > Hi Zach,
> > >
> > > Thank you for the patch! Perhaps something to improve:
> > >
> > > [auto build test WARNING on akpm-mm/mm-everything]
> > >
> > > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp@intel.com/config)
> > > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > > reproduce (this is a W=1 build):
> > >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> > >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> > >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> > >         # save the config file
> > >         mkdir build_dir && cp config build_dir/.config
> > >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> > >
> > > If you fix the issue, kindly add following tag where applicable
> > > Reported-by: kernel test robot <lkp@intel.com>
> > >
> > > All warnings (new ones prefixed by >>):
> > >
> > >    mm/khugepaged.c: In function 'khugepaged':
> > > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> > >     2284 | }
> > >          | ^
> >
> > Thanks lkp@intel.com.
> >
> > This is due to config with:
> >
> > CONFIG_FRAME_WARN=2048
> > CONFIG_NODES_SHIFT=10
> >
> > Where struct collapse_control has a member int
> > node_load[MAX_NUMNODES], and we stack allocate one.
> >
> > Is this a configuration that needs to be supported? 1024 nodes seems
> > like a lot and I'm not sure if these configs are randomly generated or
> > are reminiscent of real systems.
>
> Adding 4k to the stack isn't a good thing to do.  It's trivial to
> kmalloc the thing, so why not do that?

Thanks, Andrew. Yeah, I just suggested that too.

>
> I'll await some reviewer input (hopefully positive ;)) before merging
> this series.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
@ 2022-06-06 23:53           ` Yang Shi
  0 siblings, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-06 23:53 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2466 bytes --]

On Mon, Jun 6, 2022 at 3:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 6 Jun 2022 09:40:20 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
>
> > On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> > >
> > > Hi Zach,
> > >
> > > Thank you for the patch! Perhaps something to improve:
> > >
> > > [auto build test WARNING on akpm-mm/mm-everything]
> > >
> > > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp(a)intel.com/config)
> > > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > > reproduce (this is a W=1 build):
> > >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> > >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> > >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> > >         # save the config file
> > >         mkdir build_dir && cp config build_dir/.config
> > >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> > >
> > > If you fix the issue, kindly add following tag where applicable
> > > Reported-by: kernel test robot <lkp@intel.com>
> > >
> > > All warnings (new ones prefixed by >>):
> > >
> > >    mm/khugepaged.c: In function 'khugepaged':
> > > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> > >     2284 | }
> > >          | ^
> >
> > Thanks lkp(a)intel.com.
> >
> > This is due to config with:
> >
> > CONFIG_FRAME_WARN=2048
> > CONFIG_NODES_SHIFT=10
> >
> > Where struct collapse_control has a member int
> > node_load[MAX_NUMNODES], and we stack allocate one.
> >
> > Is this a configuration that needs to be supported? 1024 nodes seems
> > like a lot and I'm not sure if these configs are randomly generated or
> > are reminiscent of real systems.
>
> Adding 4k to the stack isn't a good thing to do.  It's trivial to
> kmalloc the thing, so why not do that?

Thanks, Andrew. Yeah, I just suggested that too.

>
> I'll await some reviewer input (hopefully positive ;)) before merging
> this series.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 10/15] mm/khugepaged: rename prefix of shared collapse functions
  2022-06-04  0:39 ` [PATCH v6 10/15] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
@ 2022-06-06 23:56   ` Yang Shi
  2022-06-07  0:31     ` Zach O'Keefe
  0 siblings, 1 reply; 63+ messages in thread
From: Yang Shi @ 2022-06-06 23:56 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> The following functions/tracepoints are shared between khugepaged and
> madvise collapse contexts.  Replace the "khugepaged_" prefix with
> generic "hpage_collapse_" prefix in such cases:
>
> khugepaged_test_exit() -> hpage_collapse_test_exit()
> khugepaged_scan_abort() -> hpage_collapse_scan_abort()
> khugepaged_scan_pmd() -> hpage_collapse_scan_pmd()
> khugepaged_find_target_node() -> hpage_collapse_find_target_node()
> khugepaged_alloc_page() -> hpage_collapse_alloc_page()
> huge_memory:mm_khugepaged_scan_pmd ->
>         huge_memory:mm_hpage_collapse_scan_pmd
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  include/trace/events/huge_memory.h |  2 +-
>  mm/khugepaged.c                    | 71 ++++++++++++++++--------------
>  2 files changed, 38 insertions(+), 35 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 55392bf30a03..fb6c73632ff3 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -48,7 +48,7 @@ SCAN_STATUS
>  #define EM(a, b)       {a, b},
>  #define EMe(a, b)      {a, b}
>
> -TRACE_EVENT(mm_khugepaged_scan_pmd,
> +TRACE_EVENT(mm_hpage_collapse_scan_pmd,

You may not want to change the name of the tracepoint since it is a
part of kernel ABI. Otherwise the patch looks good to me.
Reviewed-by: Yang Shi <shy828301@gmail.om>

>
>         TP_PROTO(struct mm_struct *mm, struct page *page, bool writable,
>                  int referenced, int none_or_zero, int status, int unmapped),
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 073d6bb03b37..119c1bc84af7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -102,7 +102,7 @@ struct collapse_control {
>         /* Num pages scanned per node */
>         int node_load[MAX_NUMNODES];
>
> -       /* Last target selected in khugepaged_find_target_node() */
> +       /* Last target selected in hpage_collapse_find_target_node() */
>         int last_target_node;
>
>         /* gfp used for allocation and memcg charging */
> @@ -456,7 +456,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
>         hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
>  }
>
> -static inline int khugepaged_test_exit(struct mm_struct *mm)
> +static inline int hpage_collapse_test_exit(struct mm_struct *mm)
>  {
>         return atomic_read(&mm->mm_users) == 0;
>  }
> @@ -508,7 +508,7 @@ void __khugepaged_enter(struct mm_struct *mm)
>                 return;
>
>         /* __khugepaged_exit() must not run from under us */
> -       VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
> +       VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
>         if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
>                 free_mm_slot(mm_slot);
>                 return;
> @@ -562,11 +562,10 @@ void __khugepaged_exit(struct mm_struct *mm)
>         } else if (mm_slot) {
>                 /*
>                  * This is required to serialize against
> -                * khugepaged_test_exit() (which is guaranteed to run
> -                * under mmap sem read mode). Stop here (after we
> -                * return all pagetables will be destroyed) until
> -                * khugepaged has finished working on the pagetables
> -                * under the mmap_lock.
> +                * hpage_collapse_test_exit() (which is guaranteed to run
> +                * under mmap sem read mode). Stop here (after we return all
> +                * pagetables will be destroyed) until khugepaged has finished
> +                * working on the pagetables under the mmap_lock.
>                  */
>                 mmap_write_lock(mm);
>                 mmap_write_unlock(mm);
> @@ -803,7 +802,7 @@ static void khugepaged_alloc_sleep(void)
>         remove_wait_queue(&khugepaged_wait, &wait);
>  }
>
> -static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
> +static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
>  {
>         int i;
>
> @@ -834,7 +833,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
>  }
>
>  #ifdef CONFIG_NUMA
> -static int khugepaged_find_target_node(struct collapse_control *cc)
> +static int hpage_collapse_find_target_node(struct collapse_control *cc)
>  {
>         int nid, target_node = 0, max_value = 0;
>
> @@ -858,7 +857,7 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
>         return target_node;
>  }
>  #else
> -static int khugepaged_find_target_node(struct collapse_control *cc)
> +static int hpage_collapse_find_target_node(struct collapse_control *cc)
>  {
>         return 0;
>  }
> @@ -877,7 +876,7 @@ static bool alloc_fail_should_sleep(int result, bool *wait)
>         return false;
>  }
>
> -static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> +static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node)
>  {
>         *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
>         if (unlikely(!*hpage)) {
> @@ -905,7 +904,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>         unsigned long hstart, hend;
>         unsigned long vma_flags;
>
> -       if (unlikely(khugepaged_test_exit(mm)))
> +       if (unlikely(hpage_collapse_test_exit(mm)))
>                 return SCAN_ANY_PROCESS;
>
>         *vmap = vma = find_vma(mm, address);
> @@ -962,7 +961,7 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
>
>  /*
>   * Bring missing pages in from swap, to complete THP collapse.
> - * Only done if khugepaged_scan_pmd believes it is worthwhile.
> + * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
>   *
>   * Called and returns without pte mapped or spinlocks held,
>   * but with mmap_lock held to protect against vma changes.
> @@ -1027,9 +1026,9 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>  static int alloc_charge_hpage(struct mm_struct *mm, struct page **hpage,
>                               struct collapse_control *cc)
>  {
> -       int node = khugepaged_find_target_node(cc);
> +       int node = hpage_collapse_find_target_node(cc);
>
> -       if (!khugepaged_alloc_page(hpage, cc->gfp, node))
> +       if (!hpage_collapse_alloc_page(hpage, cc->gfp, node))
>                 return SCAN_ALLOC_HUGE_PAGE_FAIL;
>         if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, cc->gfp)))
>                 return SCAN_CGROUP_CHARGE_FAIL;
> @@ -1188,9 +1187,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         return result;
>  }
>
> -static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> -                              unsigned long address, bool *mmap_locked,
> -                              struct collapse_control *cc)
> +static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> +                                  struct vm_area_struct *vma,
> +                                  unsigned long address, bool *mmap_locked,
> +                                  struct collapse_control *cc)
>  {
>         pmd_t *pmd;
>         pte_t *pte, *_pte;
> @@ -1282,7 +1282,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>                  * hit record.
>                  */
>                 node = page_to_nid(page);
> -               if (khugepaged_scan_abort(node, cc)) {
> +               if (hpage_collapse_scan_abort(node, cc)) {
>                         result = SCAN_SCAN_ABORT;
>                         goto out_unmap;
>                 }
> @@ -1345,8 +1345,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>                                             unmapped, cc);
>         }
>  out:
> -       trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
> -                                    none_or_zero, result, unmapped);
> +       trace_mm_hpage_collapse_scan_pmd(mm, page, writable, referenced,
> +                                        none_or_zero, result, unmapped);
>         return result;
>  }
>
> @@ -1356,7 +1356,7 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
>
>         lockdep_assert_held(&khugepaged_mm_lock);
>
> -       if (khugepaged_test_exit(mm)) {
> +       if (hpage_collapse_test_exit(mm)) {
>                 /* free mm_slot */
>                 hash_del(&mm_slot->hash);
>                 list_del(&mm_slot->mm_node);
> @@ -1530,7 +1530,7 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
>         if (!mmap_write_trylock(mm))
>                 return;
>
> -       if (unlikely(khugepaged_test_exit(mm)))
> +       if (unlikely(hpage_collapse_test_exit(mm)))
>                 goto out;
>
>         for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
> @@ -1593,7 +1593,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>                          * it'll always mapped in small page size for uffd-wp
>                          * registered ranges.
>                          */
> -                       if (!khugepaged_test_exit(mm) && !userfaultfd_wp(vma))
> +                       if (!hpage_collapse_test_exit(mm) &&
> +                           !userfaultfd_wp(vma))
>                                 collapse_and_free_pmd(mm, vma, addr, pmd);
>                         mmap_write_unlock(mm);
>                 } else {
> @@ -2020,7 +2021,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
>                 }
>
>                 node = page_to_nid(page);
> -               if (khugepaged_scan_abort(node, cc)) {
> +               if (hpage_collapse_scan_abort(node, cc)) {
>                         result = SCAN_SCAN_ABORT;
>                         break;
>                 }
> @@ -2114,7 +2115,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>                 goto breakouterloop_mmap_lock;
>
>         progress++;
> -       if (unlikely(khugepaged_test_exit(mm)))
> +       if (unlikely(hpage_collapse_test_exit(mm)))
>                 goto breakouterloop;
>
>         address = khugepaged_scan.address;
> @@ -2123,7 +2124,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>                 unsigned long hstart, hend;
>
>                 cond_resched();
> -               if (unlikely(khugepaged_test_exit(mm))) {
> +               if (unlikely(hpage_collapse_test_exit(mm))) {
>                         progress++;
>                         break;
>                 }
> @@ -2148,7 +2149,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>                         bool mmap_locked = true;
>
>                         cond_resched();
> -                       if (unlikely(khugepaged_test_exit(mm)))
> +                       if (unlikely(hpage_collapse_test_exit(mm)))
>                                 goto breakouterloop;
>
>                         /* reset gfp flags since sysfs settings might change */
> @@ -2168,9 +2169,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>                                                                cc);
>                                 fput(file);
>                         } else {
> -                               *result = khugepaged_scan_pmd(mm, vma,
> -                                                             khugepaged_scan.address,
> -                                                             &mmap_locked, cc);
> +                               *result = hpage_collapse_scan_pmd(mm, vma,
> +                                                                 khugepaged_scan.address,
> +                                                                 &mmap_locked,
> +                                                                 cc);
>                         }
>                         if (*result == SCAN_SUCCEED)
>                                 ++khugepaged_pages_collapsed;
> @@ -2200,7 +2202,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>          * Release the current mm_slot if this mm is about to die, or
>          * if we scanned all vmas of this mm.
>          */
> -       if (khugepaged_test_exit(mm) || !vma) {
> +       if (hpage_collapse_test_exit(mm) || !vma) {
>                 /*
>                  * Make sure that if mm_users is reaching zero while
>                  * khugepaged runs here, khugepaged_exit will find
> @@ -2482,7 +2484,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>                 }
>                 mmap_assert_locked(mm);
>                 memset(cc.node_load, 0, sizeof(cc.node_load));
> -               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
> +               result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
> +                                                &cc);
>                 if (!mmap_locked)
>                         *prev = NULL;  /* Tell caller we dropped mmap_lock */
>
> --
> 2.36.1.255.ge46751e96f-goog
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 15/15] tools headers uapi: add MADV_COLLAPSE madvise mode to tools
  2022-06-04  0:40 ` [PATCH v6 15/15] tools headers uapi: add MADV_COLLAPSE madvise mode to tools Zach O'Keefe
@ 2022-06-06 23:58   ` Yang Shi
  2022-06-07  0:24     ` Zach O'Keefe
  0 siblings, 1 reply; 63+ messages in thread
From: Yang Shi @ 2022-06-06 23:58 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Tools able to translate MADV_COLLAPSE advice to human readable string:
>
> $ tools/perf/trace/beauty/madvise_behavior.sh
> static const char *madvise_advices[] = {
>         [0] = "NORMAL",
>         [1] = "RANDOM",
>         [2] = "SEQUENTIAL",
>         [3] = "WILLNEED",
>         [4] = "DONTNEED",
>         [8] = "FREE",
>         [9] = "REMOVE",
>         [10] = "DONTFORK",
>         [11] = "DOFORK",
>         [12] = "MERGEABLE",
>         [13] = "UNMERGEABLE",
>         [14] = "HUGEPAGE",
>         [15] = "NOHUGEPAGE",
>         [16] = "DONTDUMP",
>         [17] = "DODUMP",
>         [18] = "WIPEONFORK",
>         [19] = "KEEPONFORK",
>         [20] = "COLD",
>         [21] = "PAGEOUT",
>         [22] = "POPULATE_READ",
>         [23] = "POPULATE_WRITE",
>         [24] = "DONTNEED_LOCKED",
>         [25] = "COLLAPSE",
>         [100] = "HWPOISON",
>         [101] = "SOFT_OFFLINE",
> };
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  tools/include/uapi/asm-generic/mman-common.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> index 6c1aa92a92e4..6ce1f1ceb432 100644
> --- a/tools/include/uapi/asm-generic/mman-common.h
> +++ b/tools/include/uapi/asm-generic/mman-common.h
> @@ -77,6 +77,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */

I think this patch could be squashed into patch #9?

> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> --
> 2.36.1.255.ge46751e96f-goog
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 06/15] mm/khugepaged: pipe enum scan_result codes back to callers
  2022-06-06 22:39   ` Yang Shi
@ 2022-06-07  0:17     ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-07  0:17 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Mon, Jun 6, 2022 at 3:40 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > Pipe enum scan_result codes back through return values of functions
> > downstream of khugepaged_scan_file() and khugepaged_scan_pmd() to
> > inform callers if the operation was successful, and if not, why.
> >
> > Since khugepaged_scan_pmd()'s return value already has a specific
> > meaning (whether mmap_lock was unlocked or not), add a bool* argument
> > to khugepaged_scan_pmd() to retrieve this information.
> >
> > Change khugepaged to take action based on the return values of
> > khugepaged_scan_file() and khugepaged_scan_pmd() instead of acting
> > deep within the collapsing functions themselves.
> >
> > Remove dependency on error pointers to communicate to khugepaged that
> > allocation failed and it should sleep; instead just use the result of
> > the scan (SCAN_ALLOC_HUGE_PAGE_FAIL if allocation fails).
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
>
> Reviewed-by: Yang Shi <shy828301@gmail.com>

Hey Yang,

Thanks for taking the time to review this series. Very much appreciated!

> A couple of minor nits below...
>
> > ---
> >  mm/khugepaged.c | 192 ++++++++++++++++++++++++------------------------
> >  1 file changed, 96 insertions(+), 96 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index ba722347bebd..03e0da0008f1 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -722,13 +722,13 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >                 result = SCAN_SUCCEED;
> >                 trace_mm_collapse_huge_page_isolate(page, none_or_zero,
> >                                                     referenced, writable, result);
> > -               return 1;
> > +               return SCAN_SUCCEED;
>
> You could do "return result" too.

Thanks for catching. Done.

> >         }
> >  out:
> >         release_pte_pages(pte, _pte, compound_pagelist);
> >         trace_mm_collapse_huge_page_isolate(page, none_or_zero,
> >                                             referenced, writable, result);
> > -       return 0;
> > +       return result;
> >  }
> >
> >  static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > @@ -850,14 +850,13 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
> >  #endif
> >
> >  /* Sleep for the first alloc fail, break the loop for the second fail */
> > -static bool alloc_fail_should_sleep(struct page **hpage, bool *wait)
> > +static bool alloc_fail_should_sleep(int result, bool *wait)
> >  {
> > -       if (IS_ERR(*hpage)) {
> > +       if (result == SCAN_ALLOC_HUGE_PAGE_FAIL) {
> >                 if (!*wait)
> >                         return true;
> >
> >                 *wait = false;
> > -               *hpage = NULL;
> >                 khugepaged_alloc_sleep();
> >         }
> >         return false;
> > @@ -868,7 +867,6 @@ static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> >         *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> >         if (unlikely(!*hpage)) {
> >                 count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> > -               *hpage = ERR_PTR(-ENOMEM);
> >                 return false;
> >         }
> >
> > @@ -1010,17 +1008,17 @@ static int alloc_charge_hpage(struct mm_struct *mm, struct page **hpage,
> >         return SCAN_SUCCEED;
> >  }
> >
> > -static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > -                              struct page **hpage, int referenced,
> > -                              int unmapped, struct collapse_control *cc)
> > +static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > +                             int referenced, int unmapped,
> > +                             struct collapse_control *cc)
> >  {
> >         LIST_HEAD(compound_pagelist);
> >         pmd_t *pmd, _pmd;
> >         pte_t *pte;
> >         pgtable_t pgtable;
> > -       struct page *new_page;
> > +       struct page *hpage;
> >         spinlock_t *pmd_ptl, *pte_ptl;
> > -       int isolated = 0, result = 0;
> > +       int result = SCAN_FAIL;
> >         struct vm_area_struct *vma;
> >         struct mmu_notifier_range range;
> >
> > @@ -1034,12 +1032,10 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >          */
> >         mmap_read_unlock(mm);
> >
> > -       result = alloc_charge_hpage(mm, hpage, cc);
> > +       result = alloc_charge_hpage(mm, &hpage, cc);
> >         if (result != SCAN_SUCCEED)
> >                 goto out_nolock;
> >
> > -       new_page = *hpage;
> > -
> >         mmap_read_lock(mm);
> >         result = hugepage_vma_revalidate(mm, address, &vma);
> >         if (result) {
> > @@ -1100,11 +1096,11 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >         mmu_notifier_invalidate_range_end(&range);
> >
> >         spin_lock(pte_ptl);
> > -       isolated = __collapse_huge_page_isolate(vma, address, pte,
> > -                       &compound_pagelist);
> > +       result =  __collapse_huge_page_isolate(vma, address, pte,
> > +                                              &compound_pagelist);
> >         spin_unlock(pte_ptl);
> >
> > -       if (unlikely(!isolated)) {
> > +       if (unlikely(result != SCAN_SUCCEED)) {
> >                 pte_unmap(pte);
> >                 spin_lock(pmd_ptl);
> >                 BUG_ON(!pmd_none(*pmd));
> > @@ -1116,7 +1112,6 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                 pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >                 spin_unlock(pmd_ptl);
> >                 anon_vma_unlock_write(vma->anon_vma);
> > -               result = SCAN_FAIL;
> >                 goto out_up_write;
> >         }
> >
> > @@ -1126,8 +1121,8 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >          */
> >         anon_vma_unlock_write(vma->anon_vma);
> >
> > -       __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl,
> > -                       &compound_pagelist);
> > +       __collapse_huge_page_copy(pte, hpage, vma, address, pte_ptl,
> > +                                 &compound_pagelist);
> >         pte_unmap(pte);
> >         /*
> >          * spin_lock() below is not the equivalent of smp_wmb(), but
> > @@ -1135,43 +1130,42 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >          * avoid the copy_huge_page writes to become visible after
> >          * the set_pmd_at() write.
> >          */
> > -       __SetPageUptodate(new_page);
> > +       __SetPageUptodate(hpage);
> >         pgtable = pmd_pgtable(_pmd);
> >
> > -       _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
> > +       _pmd = mk_huge_pmd(hpage, vma->vm_page_prot);
> >         _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> >
> >         spin_lock(pmd_ptl);
> >         BUG_ON(!pmd_none(*pmd));
> > -       page_add_new_anon_rmap(new_page, vma, address);
> > -       lru_cache_add_inactive_or_unevictable(new_page, vma);
> > +       page_add_new_anon_rmap(hpage, vma, address);
> > +       lru_cache_add_inactive_or_unevictable(hpage, vma);
> >         pgtable_trans_huge_deposit(mm, pmd, pgtable);
> >         set_pmd_at(mm, address, pmd, _pmd);
> >         update_mmu_cache_pmd(vma, address, pmd);
> >         spin_unlock(pmd_ptl);
> >
> > -       *hpage = NULL;
> > +       hpage = NULL;
> >
> > -       khugepaged_pages_collapsed++;
> >         result = SCAN_SUCCEED;
> >  out_up_write:
> >         mmap_write_unlock(mm);
> >  out_nolock:
> > -       if (!IS_ERR_OR_NULL(*hpage)) {
> > -               mem_cgroup_uncharge(page_folio(*hpage));
> > -               put_page(*hpage);
> > +       if (hpage) {
> > +               mem_cgroup_uncharge(page_folio(hpage));
> > +               put_page(hpage);
> >         }
> > -       trace_mm_collapse_huge_page(mm, isolated, result);
> > -       return;
> > +       trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> > +       return result;
> >  }
> >
> >  static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> > -                              unsigned long address, struct page **hpage,
> > +                              unsigned long address, bool *mmap_locked,
> >                                struct collapse_control *cc)
> >  {
> >         pmd_t *pmd;
> >         pte_t *pte, *_pte;
> > -       int ret = 0, result = 0, referenced = 0;
> > +       int result = SCAN_FAIL, referenced = 0;
> >         int none_or_zero = 0, shared = 0;
> >         struct page *page = NULL;
> >         unsigned long _address;
> > @@ -1306,19 +1300,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> >                 result = SCAN_LACK_REFERENCED_PAGE;
> >         } else {
> >                 result = SCAN_SUCCEED;
> > -               ret = 1;
> >         }
> >  out_unmap:
> >         pte_unmap_unlock(pte, ptl);
> > -       if (ret) {
> > +       if (result == SCAN_SUCCEED) {
> >                 /* collapse_huge_page will return with the mmap_lock released */
> > -               collapse_huge_page(mm, address, hpage, referenced, unmapped,
> > -                                  cc);
> > +               *mmap_locked = false;
> > +               result = collapse_huge_page(mm, address, referenced,
> > +                                           unmapped, cc);
>
> Shall move "*mmap_locked = false" after collapse_huge_page() call? No
> functional change, but seems more consistent.

Done.

> >         }
> >  out:
> >         trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
> >                                      none_or_zero, result, unmapped);
> > -       return ret;
> > +       return result;
> >  }
> >
> >  static void collect_mm_slot(struct mm_slot *mm_slot)
> > @@ -1581,7 +1575,6 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >   * @mm: process address space where collapse happens
> >   * @file: file that collapse on
> >   * @start: collapse start address
> > - * @hpage: new allocated huge page for collapse
> >   * @cc: collapse context and scratchpad
> >   *
> >   * Basic scheme is simple, details are more complex:
> > @@ -1599,12 +1592,11 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >   *    + restore gaps in the page cache;
> >   *    + unlock and free huge page;
> >   */
> > -static void collapse_file(struct mm_struct *mm, struct file *file,
> > -                         pgoff_t start, struct page **hpage,
> > -                         struct collapse_control *cc)
> > +static int collapse_file(struct mm_struct *mm, struct file *file,
> > +                        pgoff_t start, struct collapse_control *cc)
> >  {
> >         struct address_space *mapping = file->f_mapping;
> > -       struct page *new_page;
> > +       struct page *hpage;
> >         pgoff_t index, end = start + HPAGE_PMD_NR;
> >         LIST_HEAD(pagelist);
> >         XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
> > @@ -1615,12 +1607,10 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
> >         VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> >         VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> >
> > -       result = alloc_charge_hpage(mm, hpage, cc);
> > +       result = alloc_charge_hpage(mm, &hpage, cc);
> >         if (result != SCAN_SUCCEED)
> >                 goto out;
> >
> > -       new_page = *hpage;
> > -
> >         /*
> >          * Ensure we have slots for all the pages in the range.  This is
> >          * almost certainly a no-op because most of the pages must be present
> > @@ -1637,14 +1627,14 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
> >                 }
> >         } while (1);
> >
> > -       __SetPageLocked(new_page);
> > +       __SetPageLocked(hpage);
> >         if (is_shmem)
> > -               __SetPageSwapBacked(new_page);
> > -       new_page->index = start;
> > -       new_page->mapping = mapping;
> > +               __SetPageSwapBacked(hpage);
> > +       hpage->index = start;
> > +       hpage->mapping = mapping;
> >
> >         /*
> > -        * At this point the new_page is locked and not up-to-date.
> > +        * At this point the hpage is locked and not up-to-date.
> >          * It's safe to insert it into the page cache, because nobody would
> >          * be able to map it or use it in another way until we unlock it.
> >          */
> > @@ -1672,7 +1662,7 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
> >                                         result = SCAN_FAIL;
> >                                         goto xa_locked;
> >                                 }
> > -                               xas_store(&xas, new_page);
> > +                               xas_store(&xas, hpage);
> >                                 nr_none++;
> >                                 continue;
> >                         }
> > @@ -1814,19 +1804,19 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
> >                 list_add_tail(&page->lru, &pagelist);
> >
> >                 /* Finally, replace with the new page. */
> > -               xas_store(&xas, new_page);
> > +               xas_store(&xas, hpage);
> >                 continue;
> >  out_unlock:
> >                 unlock_page(page);
> >                 put_page(page);
> >                 goto xa_unlocked;
> >         }
> > -       nr = thp_nr_pages(new_page);
> > +       nr = thp_nr_pages(hpage);
> >
> >         if (is_shmem)
> > -               __mod_lruvec_page_state(new_page, NR_SHMEM_THPS, nr);
> > +               __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
> >         else {
> > -               __mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
> > +               __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
> >                 filemap_nr_thps_inc(mapping);
> >                 /*
> >                  * Paired with smp_mb() in do_dentry_open() to ensure
> > @@ -1837,21 +1827,21 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
> >                 smp_mb();
> >                 if (inode_is_open_for_write(mapping->host)) {
> >                         result = SCAN_FAIL;
> > -                       __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
> > +                       __mod_lruvec_page_state(hpage, NR_FILE_THPS, -nr);
> >                         filemap_nr_thps_dec(mapping);
> >                         goto xa_locked;
> >                 }
> >         }
> >
> >         if (nr_none) {
> > -               __mod_lruvec_page_state(new_page, NR_FILE_PAGES, nr_none);
> > +               __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
> >                 if (is_shmem)
> > -                       __mod_lruvec_page_state(new_page, NR_SHMEM, nr_none);
> > +                       __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
> >         }
> >
> >         /* Join all the small entries into a single multi-index entry */
> >         xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> > -       xas_store(&xas, new_page);
> > +       xas_store(&xas, hpage);
> >  xa_locked:
> >         xas_unlock_irq(&xas);
> >  xa_unlocked:
> > @@ -1873,11 +1863,11 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
> >                 index = start;
> >                 list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> >                         while (index < page->index) {
> > -                               clear_highpage(new_page + (index % HPAGE_PMD_NR));
> > +                               clear_highpage(hpage + (index % HPAGE_PMD_NR));
> >                                 index++;
> >                         }
> > -                       copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
> > -                                       page);
> > +                       copy_highpage(hpage + (page->index % HPAGE_PMD_NR),
> > +                                     page);
> >                         list_del(&page->lru);
> >                         page->mapping = NULL;
> >                         page_ref_unfreeze(page, 1);
> > @@ -1888,23 +1878,23 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
> >                         index++;
> >                 }
> >                 while (index < end) {
> > -                       clear_highpage(new_page + (index % HPAGE_PMD_NR));
> > +                       clear_highpage(hpage + (index % HPAGE_PMD_NR));
> >                         index++;
> >                 }
> >
> > -               SetPageUptodate(new_page);
> > -               page_ref_add(new_page, HPAGE_PMD_NR - 1);
> > +               SetPageUptodate(hpage);
> > +               page_ref_add(hpage, HPAGE_PMD_NR - 1);
> >                 if (is_shmem)
> > -                       set_page_dirty(new_page);
> > -               lru_cache_add(new_page);
> > +                       set_page_dirty(hpage);
> > +               lru_cache_add(hpage);
> >
> >                 /*
> >                  * Remove pte page tables, so we can re-fault the page as huge.
> >                  */
> >                 retract_page_tables(mapping, start);
> > -               *hpage = NULL;
> > -
> > -               khugepaged_pages_collapsed++;
> > +               unlock_page(hpage);
> > +               hpage = NULL;
> > +               goto out;
>
> Maybe just set hpage to NULL here, then later...
>
> >         } else {
> >                 struct page *page;
> >
> > @@ -1943,22 +1933,22 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
> >                 VM_BUG_ON(nr_none);
> >                 xas_unlock_irq(&xas);
> >
> > -               new_page->mapping = NULL;
> > +               hpage->mapping = NULL;
> >         }
> >
> > -       unlock_page(new_page);
> > +       unlock_page(hpage);
>
> do
> if (hpage)
>     unlock_page(hpage);

I don't have a strong preference here. Done.


> >  out:
> >         VM_BUG_ON(!list_empty(&pagelist));
> > -       if (!IS_ERR_OR_NULL(*hpage)) {
> > -               mem_cgroup_uncharge(page_folio(*hpage));
> > -               put_page(*hpage);
> > +       if (hpage) {
> > +               mem_cgroup_uncharge(page_folio(hpage));
> > +               put_page(hpage);
> >         }
> >         /* TODO: tracepoints */
> > +       return result;
> >  }
> >
> > -static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > -                                pgoff_t start, struct page **hpage,
> > -                                struct collapse_control *cc)
> > +static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > +                               pgoff_t start, struct collapse_control *cc)
> >  {
> >         struct page *page = NULL;
> >         struct address_space *mapping = file->f_mapping;
> > @@ -2031,15 +2021,16 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> >                         result = SCAN_EXCEED_NONE_PTE;
> >                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >                 } else {
> > -                       collapse_file(mm, file, start, hpage, cc);
> > +                       result = collapse_file(mm, file, start, cc);
> >                 }
> >         }
> >
> >         /* TODO: tracepoints */
> > +       return result;
> >  }
> >  #else
> > -static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > -                                pgoff_t start, struct collapse_control *cc)
> > +static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, pgoff_t start,
> > +                               struct collapse_control *cc)
> >  {
> >         BUILD_BUG();
> >  }
> > @@ -2049,8 +2040,7 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
> >  }
> >  #endif
> >
> > -static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> > -                                           struct page **hpage,
> > +static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                                             struct collapse_control *cc)
> >         __releases(&khugepaged_mm_lock)
> >         __acquires(&khugepaged_mm_lock)
> > @@ -2064,6 +2054,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> >
> >         VM_BUG_ON(!pages);
> >         lockdep_assert_held(&khugepaged_mm_lock);
> > +       *result = SCAN_FAIL;
> >
> >         if (khugepaged_scan.mm_slot)
> >                 mm_slot = khugepaged_scan.mm_slot;
> > @@ -2117,7 +2108,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> >                         goto skip;
> >
> >                 while (khugepaged_scan.address < hend) {
> > -                       int ret;
> > +                       bool mmap_locked = true;
> > +
> >                         cond_resched();
> >                         if (unlikely(khugepaged_test_exit(mm)))
> >                                 goto breakouterloop;
> > @@ -2134,20 +2126,28 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> >                                                 khugepaged_scan.address);
> >
> >                                 mmap_read_unlock(mm);
> > -                               ret = 1;
> > -                               khugepaged_scan_file(mm, file, pgoff, hpage,
> > -                                                    cc);
> > +                               mmap_locked = false;
> > +                               *result = khugepaged_scan_file(mm, file, pgoff,
> > +                                                              cc);
> >                                 fput(file);
> >                         } else {
> > -                               ret = khugepaged_scan_pmd(mm, vma,
> > -                                               khugepaged_scan.address,
> > -                                               hpage, cc);
> > +                               *result = khugepaged_scan_pmd(mm, vma,
> > +                                                             khugepaged_scan.address,
> > +                                                             &mmap_locked, cc);
> >                         }
> > +                       if (*result == SCAN_SUCCEED)
> > +                               ++khugepaged_pages_collapsed;
> >                         /* move to next address */
> >                         khugepaged_scan.address += HPAGE_PMD_SIZE;
> >                         progress += HPAGE_PMD_NR;
> > -                       if (ret)
> > -                               /* we released mmap_lock so break loop */
> > +                       if (!mmap_locked)
> > +                               /*
> > +                                * We released mmap_lock so break loop.  Note
> > +                                * that we drop mmap_lock before all hugepage
> > +                                * allocations, so if allocation fails, we are
> > +                                * guaranteed to break here and report the
> > +                                * correct result back to caller.
> > +                                */
> >                                 goto breakouterloop_mmap_lock;
> >                         if (progress >= pages)
> >                                 goto breakouterloop;
> > @@ -2199,15 +2199,15 @@ static int khugepaged_wait_event(void)
> >
> >  static void khugepaged_do_scan(struct collapse_control *cc)
> >  {
> > -       struct page *hpage = NULL;
> >         unsigned int progress = 0, pass_through_head = 0;
> >         unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
> >         bool wait = true;
> > +       int result = SCAN_SUCCEED;
> >
> >         lru_add_drain_all();
> >
> >         while (progress < pages) {
> > -               if (alloc_fail_should_sleep(&hpage, &wait))
> > +               if (alloc_fail_should_sleep(result, &wait))
> >                         break;
> >
> >                 cond_resched();
> > @@ -2221,7 +2221,7 @@ static void khugepaged_do_scan(struct collapse_control *cc)
> >                 if (khugepaged_has_work() &&
> >                     pass_through_head < 2)
> >                         progress += khugepaged_scan_mm_slot(pages - progress,
> > -                                                           &hpage, cc);
> > +                                                           &result, cc);
> >                 else
> >                         progress = pages;
> >                 spin_unlock(&khugepaged_mm_lock);
> > --
> > 2.36.1.255.ge46751e96f-goog
> >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 15/15] tools headers uapi: add MADV_COLLAPSE madvise mode to tools
  2022-06-06 23:58   ` Yang Shi
@ 2022-06-07  0:24     ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-07  0:24 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Mon, Jun 6, 2022 at 4:58 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > Tools able to translate MADV_COLLAPSE advice to human readable string:
> >
> > $ tools/perf/trace/beauty/madvise_behavior.sh
> > static const char *madvise_advices[] = {
> >         [0] = "NORMAL",
> >         [1] = "RANDOM",
> >         [2] = "SEQUENTIAL",
> >         [3] = "WILLNEED",
> >         [4] = "DONTNEED",
> >         [8] = "FREE",
> >         [9] = "REMOVE",
> >         [10] = "DONTFORK",
> >         [11] = "DOFORK",
> >         [12] = "MERGEABLE",
> >         [13] = "UNMERGEABLE",
> >         [14] = "HUGEPAGE",
> >         [15] = "NOHUGEPAGE",
> >         [16] = "DONTDUMP",
> >         [17] = "DODUMP",
> >         [18] = "WIPEONFORK",
> >         [19] = "KEEPONFORK",
> >         [20] = "COLD",
> >         [21] = "PAGEOUT",
> >         [22] = "POPULATE_READ",
> >         [23] = "POPULATE_WRITE",
> >         [24] = "DONTNEED_LOCKED",
> >         [25] = "COLLAPSE",
> >         [100] = "HWPOISON",
> >         [101] = "SOFT_OFFLINE",
> > };
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  tools/include/uapi/asm-generic/mman-common.h | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> > index 6c1aa92a92e4..6ce1f1ceb432 100644
> > --- a/tools/include/uapi/asm-generic/mman-common.h
> > +++ b/tools/include/uapi/asm-generic/mman-common.h
> > @@ -77,6 +77,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
>
> I think this patch could be squashed into patch #9?

Sure, SGTM.  Was just trying to follow what has been done with e.g.
MADV_DONTNEED_LOCKED ; but I see no reason to not squash. Done.

> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > --
> > 2.36.1.255.ge46751e96f-goog
> >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 10/15] mm/khugepaged: rename prefix of shared collapse functions
  2022-06-06 23:56   ` Yang Shi
@ 2022-06-07  0:31     ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-07  0:31 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Mon, Jun 6, 2022 at 4:56 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > The following functions/tracepoints are shared between khugepaged and
> > madvise collapse contexts.  Replace the "khugepaged_" prefix with
> > generic "hpage_collapse_" prefix in such cases:
> >
> > khugepaged_test_exit() -> hpage_collapse_test_exit()
> > khugepaged_scan_abort() -> hpage_collapse_scan_abort()
> > khugepaged_scan_pmd() -> hpage_collapse_scan_pmd()
> > khugepaged_find_target_node() -> hpage_collapse_find_target_node()
> > khugepaged_alloc_page() -> hpage_collapse_alloc_page()
> > huge_memory:mm_khugepaged_scan_pmd ->
> >         huge_memory:mm_hpage_collapse_scan_pmd
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  include/trace/events/huge_memory.h |  2 +-
> >  mm/khugepaged.c                    | 71 ++++++++++++++++--------------
> >  2 files changed, 38 insertions(+), 35 deletions(-)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 55392bf30a03..fb6c73632ff3 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -48,7 +48,7 @@ SCAN_STATUS
> >  #define EM(a, b)       {a, b},
> >  #define EMe(a, b)      {a, b}
> >
> > -TRACE_EVENT(mm_khugepaged_scan_pmd,
> > +TRACE_EVENT(mm_hpage_collapse_scan_pmd,
>
> You may not want to change the name of the tracepoint since it is a
> part of kernel ABI. Otherwise the patch looks good to me.
> Reviewed-by: Yang Shi <shy828301@gmail.om>

Thanks for the review, Yang. Yes, this is something I debated / was
unsure about. For the sake of erring on the safer side, I'll remove
the tracepoint renaming. Thanks for voicing your concerns.

> >
> >         TP_PROTO(struct mm_struct *mm, struct page *page, bool writable,
> >                  int referenced, int none_or_zero, int status, int unmapped),
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 073d6bb03b37..119c1bc84af7 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -102,7 +102,7 @@ struct collapse_control {
> >         /* Num pages scanned per node */
> >         int node_load[MAX_NUMNODES];
> >
> > -       /* Last target selected in khugepaged_find_target_node() */
> > +       /* Last target selected in hpage_collapse_find_target_node() */
> >         int last_target_node;
> >
> >         /* gfp used for allocation and memcg charging */
> > @@ -456,7 +456,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
> >         hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
> >  }
> >
> > -static inline int khugepaged_test_exit(struct mm_struct *mm)
> > +static inline int hpage_collapse_test_exit(struct mm_struct *mm)
> >  {
> >         return atomic_read(&mm->mm_users) == 0;
> >  }
> > @@ -508,7 +508,7 @@ void __khugepaged_enter(struct mm_struct *mm)
> >                 return;
> >
> >         /* __khugepaged_exit() must not run from under us */
> > -       VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
> > +       VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
> >         if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
> >                 free_mm_slot(mm_slot);
> >                 return;
> > @@ -562,11 +562,10 @@ void __khugepaged_exit(struct mm_struct *mm)
> >         } else if (mm_slot) {
> >                 /*
> >                  * This is required to serialize against
> > -                * khugepaged_test_exit() (which is guaranteed to run
> > -                * under mmap sem read mode). Stop here (after we
> > -                * return all pagetables will be destroyed) until
> > -                * khugepaged has finished working on the pagetables
> > -                * under the mmap_lock.
> > +                * hpage_collapse_test_exit() (which is guaranteed to run
> > +                * under mmap sem read mode). Stop here (after we return all
> > +                * pagetables will be destroyed) until khugepaged has finished
> > +                * working on the pagetables under the mmap_lock.
> >                  */
> >                 mmap_write_lock(mm);
> >                 mmap_write_unlock(mm);
> > @@ -803,7 +802,7 @@ static void khugepaged_alloc_sleep(void)
> >         remove_wait_queue(&khugepaged_wait, &wait);
> >  }
> >
> > -static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
> > +static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
> >  {
> >         int i;
> >
> > @@ -834,7 +833,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
> >  }
> >
> >  #ifdef CONFIG_NUMA
> > -static int khugepaged_find_target_node(struct collapse_control *cc)
> > +static int hpage_collapse_find_target_node(struct collapse_control *cc)
> >  {
> >         int nid, target_node = 0, max_value = 0;
> >
> > @@ -858,7 +857,7 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
> >         return target_node;
> >  }
> >  #else
> > -static int khugepaged_find_target_node(struct collapse_control *cc)
> > +static int hpage_collapse_find_target_node(struct collapse_control *cc)
> >  {
> >         return 0;
> >  }
> > @@ -877,7 +876,7 @@ static bool alloc_fail_should_sleep(int result, bool *wait)
> >         return false;
> >  }
> >
> > -static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > +static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node)
> >  {
> >         *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> >         if (unlikely(!*hpage)) {
> > @@ -905,7 +904,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >         unsigned long hstart, hend;
> >         unsigned long vma_flags;
> >
> > -       if (unlikely(khugepaged_test_exit(mm)))
> > +       if (unlikely(hpage_collapse_test_exit(mm)))
> >                 return SCAN_ANY_PROCESS;
> >
> >         *vmap = vma = find_vma(mm, address);
> > @@ -962,7 +961,7 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> >
> >  /*
> >   * Bring missing pages in from swap, to complete THP collapse.
> > - * Only done if khugepaged_scan_pmd believes it is worthwhile.
> > + * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
> >   *
> >   * Called and returns without pte mapped or spinlocks held,
> >   * but with mmap_lock held to protect against vma changes.
> > @@ -1027,9 +1026,9 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> >  static int alloc_charge_hpage(struct mm_struct *mm, struct page **hpage,
> >                               struct collapse_control *cc)
> >  {
> > -       int node = khugepaged_find_target_node(cc);
> > +       int node = hpage_collapse_find_target_node(cc);
> >
> > -       if (!khugepaged_alloc_page(hpage, cc->gfp, node))
> > +       if (!hpage_collapse_alloc_page(hpage, cc->gfp, node))
> >                 return SCAN_ALLOC_HUGE_PAGE_FAIL;
> >         if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, cc->gfp)))
> >                 return SCAN_CGROUP_CHARGE_FAIL;
> > @@ -1188,9 +1187,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >         return result;
> >  }
> >
> > -static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> > -                              unsigned long address, bool *mmap_locked,
> > -                              struct collapse_control *cc)
> > +static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > +                                  struct vm_area_struct *vma,
> > +                                  unsigned long address, bool *mmap_locked,
> > +                                  struct collapse_control *cc)
> >  {
> >         pmd_t *pmd;
> >         pte_t *pte, *_pte;
> > @@ -1282,7 +1282,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> >                  * hit record.
> >                  */
> >                 node = page_to_nid(page);
> > -               if (khugepaged_scan_abort(node, cc)) {
> > +               if (hpage_collapse_scan_abort(node, cc)) {
> >                         result = SCAN_SCAN_ABORT;
> >                         goto out_unmap;
> >                 }
> > @@ -1345,8 +1345,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> >                                             unmapped, cc);
> >         }
> >  out:
> > -       trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
> > -                                    none_or_zero, result, unmapped);
> > +       trace_mm_hpage_collapse_scan_pmd(mm, page, writable, referenced,
> > +                                        none_or_zero, result, unmapped);
> >         return result;
> >  }
> >
> > @@ -1356,7 +1356,7 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
> >
> >         lockdep_assert_held(&khugepaged_mm_lock);
> >
> > -       if (khugepaged_test_exit(mm)) {
> > +       if (hpage_collapse_test_exit(mm)) {
> >                 /* free mm_slot */
> >                 hash_del(&mm_slot->hash);
> >                 list_del(&mm_slot->mm_node);
> > @@ -1530,7 +1530,7 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
> >         if (!mmap_write_trylock(mm))
> >                 return;
> >
> > -       if (unlikely(khugepaged_test_exit(mm)))
> > +       if (unlikely(hpage_collapse_test_exit(mm)))
> >                 goto out;
> >
> >         for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
> > @@ -1593,7 +1593,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >                          * it'll always mapped in small page size for uffd-wp
> >                          * registered ranges.
> >                          */
> > -                       if (!khugepaged_test_exit(mm) && !userfaultfd_wp(vma))
> > +                       if (!hpage_collapse_test_exit(mm) &&
> > +                           !userfaultfd_wp(vma))
> >                                 collapse_and_free_pmd(mm, vma, addr, pmd);
> >                         mmap_write_unlock(mm);
> >                 } else {
> > @@ -2020,7 +2021,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> >                 }
> >
> >                 node = page_to_nid(page);
> > -               if (khugepaged_scan_abort(node, cc)) {
> > +               if (hpage_collapse_scan_abort(node, cc)) {
> >                         result = SCAN_SCAN_ABORT;
> >                         break;
> >                 }
> > @@ -2114,7 +2115,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                 goto breakouterloop_mmap_lock;
> >
> >         progress++;
> > -       if (unlikely(khugepaged_test_exit(mm)))
> > +       if (unlikely(hpage_collapse_test_exit(mm)))
> >                 goto breakouterloop;
> >
> >         address = khugepaged_scan.address;
> > @@ -2123,7 +2124,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                 unsigned long hstart, hend;
> >
> >                 cond_resched();
> > -               if (unlikely(khugepaged_test_exit(mm))) {
> > +               if (unlikely(hpage_collapse_test_exit(mm))) {
> >                         progress++;
> >                         break;
> >                 }
> > @@ -2148,7 +2149,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                         bool mmap_locked = true;
> >
> >                         cond_resched();
> > -                       if (unlikely(khugepaged_test_exit(mm)))
> > +                       if (unlikely(hpage_collapse_test_exit(mm)))
> >                                 goto breakouterloop;
> >
> >                         /* reset gfp flags since sysfs settings might change */
> > @@ -2168,9 +2169,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                                                                cc);
> >                                 fput(file);
> >                         } else {
> > -                               *result = khugepaged_scan_pmd(mm, vma,
> > -                                                             khugepaged_scan.address,
> > -                                                             &mmap_locked, cc);
> > +                               *result = hpage_collapse_scan_pmd(mm, vma,
> > +                                                                 khugepaged_scan.address,
> > +                                                                 &mmap_locked,
> > +                                                                 cc);
> >                         }
> >                         if (*result == SCAN_SUCCEED)
> >                                 ++khugepaged_pages_collapsed;
> > @@ -2200,7 +2202,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >          * Release the current mm_slot if this mm is about to die, or
> >          * if we scanned all vmas of this mm.
> >          */
> > -       if (khugepaged_test_exit(mm) || !vma) {
> > +       if (hpage_collapse_test_exit(mm) || !vma) {
> >                 /*
> >                  * Make sure that if mm_users is reaching zero while
> >                  * khugepaged runs here, khugepaged_exit will find
> > @@ -2482,7 +2484,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >                 }
> >                 mmap_assert_locked(mm);
> >                 memset(cc.node_load, 0, sizeof(cc.node_load));
> > -               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
> > +               result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
> > +                                                &cc);
> >                 if (!mmap_locked)
> >                         *prev = NULL;  /* Tell caller we dropped mmap_lock */
> >
> > --
> > 2.36.1.255.ge46751e96f-goog
> >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 02/15] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-06-06 20:45   ` Yang Shi
@ 2022-06-07 16:01     ` Zach O'Keefe
  2022-06-07 19:32       ` Zach O'Keefe
  0 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-07 16:01 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Mon, Jun 6, 2022 at 1:46 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > When scanning an anon pmd to see if it's eligible for collapse, return
> > SCAN_PMD_MAPPED if the pmd already maps a THP.  Note that
> > SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
> > file-collapse path, since the latter might identify pte-mapped compound
> > pages.  This is required by MADV_COLLAPSE which necessarily needs to
> > know what hugepage-aligned/sized regions are already pmd-mapped.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  include/trace/events/huge_memory.h |  1 +
> >  mm/internal.h                      |  1 +
> >  mm/khugepaged.c                    | 32 ++++++++++++++++++++++++++----
> >  mm/rmap.c                          | 15 ++++++++++++--
> >  4 files changed, 43 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index d651f3437367..55392bf30a03 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -11,6 +11,7 @@
> >         EM( SCAN_FAIL,                  "failed")                       \
> >         EM( SCAN_SUCCEED,               "succeeded")                    \
> >         EM( SCAN_PMD_NULL,              "pmd_null")                     \
> > +       EM( SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
> >         EM( SCAN_EXCEED_NONE_PTE,       "exceed_none_pte")              \
> >         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
> >         EM( SCAN_EXCEED_SHARED_PTE,     "exceed_shared_pte")            \
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 6e14749ad1e5..f768c7fae668 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -188,6 +188,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
> >  /*
> >   * in mm/rmap.c:
> >   */
> > +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address);
> >  extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> >
> >  /*
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index cc3d6fb446d5..7a914ca19e96 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -28,6 +28,7 @@ enum scan_result {
> >         SCAN_FAIL,
> >         SCAN_SUCCEED,
> >         SCAN_PMD_NULL,
> > +       SCAN_PMD_MAPPED,
> >         SCAN_EXCEED_NONE_PTE,
> >         SCAN_EXCEED_SWAP_PTE,
> >         SCAN_EXCEED_SHARED_PTE,
> > @@ -901,6 +902,31 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >         return 0;
> >  }
> >
> > +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > +                                  unsigned long address,
> > +                                  pmd_t **pmd)
> > +{
> > +       pmd_t pmde;
> > +
> > +       *pmd = mm_find_pmd_raw(mm, address);
> > +       if (!*pmd)
> > +               return SCAN_PMD_NULL;
> > +
> > +       pmde = pmd_read_atomic(*pmd);
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > +       barrier();
> > +#endif
> > +       if (!pmd_present(pmde))
> > +               return SCAN_PMD_NULL;
> > +       if (pmd_trans_huge(pmde))
> > +               return SCAN_PMD_MAPPED;
> > +       if (pmd_bad(pmde))
> > +               return SCAN_FAIL;
>
> khugepaged doesn't handle pmd_bad before, IIRC it may just return
> SCAN_SUCCEED if everything else is good? It is fine to add it, but it
> may be better to return SCAN_PMD_NULL?

Correct, pmd_bad() wasn't handled before. I actually don't know how a
bad pmd might arise in the wild (would love to actually know this),
but I don't see the check hurting (might be overly convervative
though).  Conversely, I'm not sure where things go astray currently if
the pmd is bad. Guess it depends in what way the flags are mutated.
Returning SCAN_PMD_NULL SGTM.

>
> > +       return SCAN_SUCCEED;
> > +}
> > +
> >  /*
> >   * Bring missing pages in from swap, to complete THP collapse.
> >   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> > @@ -1146,11 +1172,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >
> >         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > -       pmd = mm_find_pmd(mm, address);
> > -       if (!pmd) {
> > -               result = SCAN_PMD_NULL;
> > +       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > +       if (result != SCAN_SUCCEED)
>
> There are a couple of other callsites for mm_find_pmd(), you may need
> to change all of them to find_pmd_or_thp_or_none() for MADV_COLLAPSE
> since khugepaged may collapse the area before MADV_COLLAPSE
> reacquiring mmap_lock IIUC and MADV_COLLAPSE does care this case. It
> is fine w/o MADV_COLLAPSE since khupaged doesn't care if it is PMD
> mapped or not.

Ya, I was just questioning the same thing after responding above - at
least w.r.t whether the pmd_bad() also needs to be in these callsites
(check for pmd mapping, as you mention, I think is definitely
necessary). Thanks for catching this!

> So it may be better to move this patch right before MADV_COLLAPSE is introduced?

I think this should be ok - I'll give it a try at least.

Again, thank you for taking the time to thoroughly review this.

Best,
Zach

> >                 goto out;
> > -       }
> >
> >         memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
> >         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 04fac1af870b..c9979c6ad7a1 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -767,13 +767,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> >         return vma_address(page, vma);
> >  }
> >
> > -pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address)
>
> May be better to have some notes for mm_find_pmd_raw() and mm_find_pmd().
>
> >  {
> >         pgd_t *pgd;
> >         p4d_t *p4d;
> >         pud_t *pud;
> >         pmd_t *pmd = NULL;
> > -       pmd_t pmde;
> >
> >         pgd = pgd_offset(mm, address);
> >         if (!pgd_present(*pgd))
> > @@ -788,6 +787,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> >                 goto out;
> >
> >         pmd = pmd_offset(pud, address);
> > +out:
> > +       return pmd;
> > +}
> > +
> > +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > +{
> > +       pmd_t pmde;
> > +       pmd_t *pmd;
> > +
> > +       pmd = mm_find_pmd_raw(mm, address);
> > +       if (!pmd)
> > +               goto out;
> >         /*
> >          * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
> >          * without holding anon_vma lock for write.  So when looking for a
> > --
> > 2.36.1.255.ge46751e96f-goog
> >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 11/15] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-06-04  0:40 ` [PATCH v6 11/15] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
@ 2022-06-07 19:14   ` Yang Shi
  0 siblings, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-07 19:14 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> CAP_SYS_ADMIN or is requesting collapse of it's own memory.

It is fine to me. But I'd like to hear more from other folks.

>
> This is useful for the development of userspace agents that seek to
> optimize THP utilization system-wide by using userspace signals to
> prioritize what memory is most deserving of being THP-backed.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/madvise.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index eccac2620226..b19e2f4b924c 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1171,13 +1171,15 @@ madvise_behavior_valid(int behavior)
>  }
>
>  static bool
> -process_madvise_behavior_valid(int behavior)
> +process_madvise_behavior_valid(int behavior, struct task_struct *task)
>  {
>         switch (behavior) {
>         case MADV_COLD:
>         case MADV_PAGEOUT:
>         case MADV_WILLNEED:
>                 return true;
> +       case MADV_COLLAPSE:
> +               return task == current || capable(CAP_SYS_ADMIN);
>         default:
>                 return false;
>         }
> @@ -1455,7 +1457,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>                 goto free_iov;
>         }
>
> -       if (!process_madvise_behavior_valid(behavior)) {
> +       if (!process_madvise_behavior_valid(behavior, task)) {
>                 ret = -EINVAL;
>                 goto release_task;
>         }
> --
> 2.36.1.255.ge46751e96f-goog
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 02/15] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-06-07 16:01     ` Zach O'Keefe
@ 2022-06-07 19:32       ` Zach O'Keefe
  2022-06-07 21:27         ` Yang Shi
  0 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-07 19:32 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Tue, Jun 7, 2022 at 9:01 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Mon, Jun 6, 2022 at 1:46 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > When scanning an anon pmd to see if it's eligible for collapse, return
> > > SCAN_PMD_MAPPED if the pmd already maps a THP.  Note that
> > > SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
> > > file-collapse path, since the latter might identify pte-mapped compound
> > > pages.  This is required by MADV_COLLAPSE which necessarily needs to
> > > know what hugepage-aligned/sized regions are already pmd-mapped.
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >  include/trace/events/huge_memory.h |  1 +
> > >  mm/internal.h                      |  1 +
> > >  mm/khugepaged.c                    | 32 ++++++++++++++++++++++++++----
> > >  mm/rmap.c                          | 15 ++++++++++++--
> > >  4 files changed, 43 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > index d651f3437367..55392bf30a03 100644
> > > --- a/include/trace/events/huge_memory.h
> > > +++ b/include/trace/events/huge_memory.h
> > > @@ -11,6 +11,7 @@
> > >         EM( SCAN_FAIL,                  "failed")                       \
> > >         EM( SCAN_SUCCEED,               "succeeded")                    \
> > >         EM( SCAN_PMD_NULL,              "pmd_null")                     \
> > > +       EM( SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
> > >         EM( SCAN_EXCEED_NONE_PTE,       "exceed_none_pte")              \
> > >         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
> > >         EM( SCAN_EXCEED_SHARED_PTE,     "exceed_shared_pte")            \
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 6e14749ad1e5..f768c7fae668 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -188,6 +188,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
> > >  /*
> > >   * in mm/rmap.c:
> > >   */
> > > +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address);
> > >  extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> > >
> > >  /*
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index cc3d6fb446d5..7a914ca19e96 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -28,6 +28,7 @@ enum scan_result {
> > >         SCAN_FAIL,
> > >         SCAN_SUCCEED,
> > >         SCAN_PMD_NULL,
> > > +       SCAN_PMD_MAPPED,
> > >         SCAN_EXCEED_NONE_PTE,
> > >         SCAN_EXCEED_SWAP_PTE,
> > >         SCAN_EXCEED_SHARED_PTE,
> > > @@ -901,6 +902,31 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > >         return 0;
> > >  }
> > >
> > > +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > > +                                  unsigned long address,
> > > +                                  pmd_t **pmd)
> > > +{
> > > +       pmd_t pmde;
> > > +
> > > +       *pmd = mm_find_pmd_raw(mm, address);
> > > +       if (!*pmd)
> > > +               return SCAN_PMD_NULL;
> > > +
> > > +       pmde = pmd_read_atomic(*pmd);
> > > +
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > +       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > > +       barrier();
> > > +#endif
> > > +       if (!pmd_present(pmde))
> > > +               return SCAN_PMD_NULL;
> > > +       if (pmd_trans_huge(pmde))
> > > +               return SCAN_PMD_MAPPED;
> > > +       if (pmd_bad(pmde))
> > > +               return SCAN_FAIL;
> >
> > khugepaged doesn't handle pmd_bad before, IIRC it may just return
> > SCAN_SUCCEED if everything else is good? It is fine to add it, but it
> > may be better to return SCAN_PMD_NULL?
>
> Correct, pmd_bad() wasn't handled before. I actually don't know how a
> bad pmd might arise in the wild (would love to actually know this),
> but I don't see the check hurting (might be overly convervative
> though).  Conversely, I'm not sure where things go astray currently if
> the pmd is bad. Guess it depends in what way the flags are mutated.
> Returning SCAN_PMD_NULL SGTM.
>
> >
> > > +       return SCAN_SUCCEED;
> > > +}
> > > +
> > >  /*
> > >   * Bring missing pages in from swap, to complete THP collapse.
> > >   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> > > @@ -1146,11 +1172,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> > >
> > >         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > >
> > > -       pmd = mm_find_pmd(mm, address);
> > > -       if (!pmd) {
> > > -               result = SCAN_PMD_NULL;
> > > +       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > +       if (result != SCAN_SUCCEED)
> >
> > There are a couple of other callsites for mm_find_pmd(), you may need
> > to change all of them to find_pmd_or_thp_or_none() for MADV_COLLAPSE
> > since khugepaged may collapse the area before MADV_COLLAPSE
> > reacquiring mmap_lock IIUC and MADV_COLLAPSE does care this case. It
> > is fine w/o MADV_COLLAPSE since khupaged doesn't care if it is PMD
> > mapped or not.
>
> Ya, I was just questioning the same thing after responding above - at
> least w.r.t whether the pmd_bad() also needs to be in these callsites
> (check for pmd mapping, as you mention, I think is definitely
> necessary). Thanks for catching this!
>
> > So it may be better to move this patch right before MADV_COLLAPSE is introduced?
>
> I think this should be ok - I'll give it a try at least.
>
> Again, thank you for taking the time to thoroughly review this.
>
> Best,
> Zach
>
> > >                 goto out;
> > > -       }
> > >
> > >         memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
> > >         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 04fac1af870b..c9979c6ad7a1 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -767,13 +767,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> > >         return vma_address(page, vma);
> > >  }
> > >
> > > -pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > > +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address)
> >
> > May be better to have some notes for mm_find_pmd_raw() and mm_find_pmd().
> >

Agreed. Looking over this code again, there are only 3 users of mm_find_pmd():

1) khugepaged
2) ksm (replace_page())
3) split_huge_pmd_address()

Once khugepaged codepaths care about THP-pmds, ksm is the only
remaining user that really wants a pte-mapping pmd.

I've gone and consolidated the open-coded code in
split_huge_pmd_address() to use the mm_find_pmd_raw().

I've also done a name switch:

mm_find_pmd() -> mm_find_pte_pmd()
mm_find_pmd_raw() -> mm_find_pmd()

This basically reverts mm_find_pmd() to its pre commit f72e7dcdd252
("mm: let mm_find_pmd fix buggy race with THP fault")
behavior, and special cases (what will be, after MADV_COLLAPSE file
support) the only remaining callsite which *doesn't* care about
THP-pmds (ksm). The naming here is a little more meaningful than
"*raw", and IMHO more readable.


> > >  {
> > >         pgd_t *pgd;
> > >         p4d_t *p4d;
> > >         pud_t *pud;
> > >         pmd_t *pmd = NULL;
> > > -       pmd_t pmde;
> > >
> > >         pgd = pgd_offset(mm, address);
> > >         if (!pgd_present(*pgd))
> > > @@ -788,6 +787,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > >                 goto out;
> > >
> > >         pmd = pmd_offset(pud, address);
> > > +out:
> > > +       return pmd;
> > > +}
> > > +
> > > +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > > +{
> > > +       pmd_t pmde;
> > > +       pmd_t *pmd;
> > > +
> > > +       pmd = mm_find_pmd_raw(mm, address);
> > > +       if (!pmd)
> > > +               goto out;
> > >         /*
> > >          * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
> > >          * without holding anon_vma lock for write.  So when looking for a
> > > --
> > > 2.36.1.255.ge46751e96f-goog
> > >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 05/15] mm/khugepaged: make allocation semantics context-specific
  2022-06-06 20:58   ` Yang Shi
@ 2022-06-07 19:56     ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-07 19:56 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Mon, Jun 6, 2022 at 1:59 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > Add a gfp_t flags member to struct collapse_control that allows contexts
> > to specify their own allocation semantics.  This decouples the
> > allocation semantics from
> > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag.
> >
> > khugepaged updates this member for every hugepage processed, since the
> > sysfs setting might change at any time.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  mm/khugepaged.c | 21 +++++++++++++--------
> >  1 file changed, 13 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 38488d114073..ba722347bebd 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -92,6 +92,9 @@ struct collapse_control {
> >
> >         /* Last target selected in khugepaged_find_target_node() */
> >         int last_target_node;
> > +
> > +       /* gfp used for allocation and memcg charging */
> > +       gfp_t gfp;
> >  };
> >
> >  /**
> > @@ -994,15 +997,14 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> >         return true;
> >  }
> >
> > -static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> > +static int alloc_charge_hpage(struct mm_struct *mm, struct page **hpage,
>
> Why did you have to reverse the order of mm and hpage? It seems
> pointless and you could save a couple of changed lines.

At some point many versions ago I must have thought it was cleaner to
be consistent with struct *mm as 1st arg, but agree with unnecessary
change and have removed.

> >                               struct collapse_control *cc)
> >  {
> > -       gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> >         int node = khugepaged_find_target_node(cc);
> >
> > -       if (!khugepaged_alloc_page(hpage, gfp, node))
> > +       if (!khugepaged_alloc_page(hpage, cc->gfp, node))
> >                 return SCAN_ALLOC_HUGE_PAGE_FAIL;
> > -       if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
> > +       if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, cc->gfp)))
> >                 return SCAN_CGROUP_CHARGE_FAIL;
> >         count_memcg_page_event(*hpage, THP_COLLAPSE_ALLOC);
> >         return SCAN_SUCCEED;
> > @@ -1032,7 +1034,7 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >          */
> >         mmap_read_unlock(mm);
> >
> > -       result = alloc_charge_hpage(hpage, mm, cc);
> > +       result = alloc_charge_hpage(mm, hpage, cc);
> >         if (result != SCAN_SUCCEED)
> >                 goto out_nolock;
> >
> > @@ -1613,7 +1615,7 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
> >         VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> >         VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> >
> > -       result = alloc_charge_hpage(hpage, mm, cc);
> > +       result = alloc_charge_hpage(mm, hpage, cc);
> >         if (result != SCAN_SUCCEED)
> >                 goto out;
> >
> > @@ -2037,8 +2039,7 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> >  }
> >  #else
> >  static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > -                                pgoff_t start, struct page **hpage,
> > -                                struct collapse_control *cc)
> > +                                pgoff_t start, struct collapse_control *cc)
>
> Why was the !CONFIG_SHMEM version definition changed, but CONFIG_SHMEM
> version was not?

Definitely an error - sorry about that. Surprised I didn't get caught
by this earlier - must be because the signatures are eventually made
equal when struct **hpage is dropped. Have fixed it.

> >  {
> >         BUILD_BUG();
> >  }
> > @@ -2121,6 +2122,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> >                         if (unlikely(khugepaged_test_exit(mm)))
> >                                 goto breakouterloop;
> >
> > +                       /* reset gfp flags since sysfs settings might change */
> > +                       cc->gfp = alloc_hugepage_khugepaged_gfpmask() |
> > +                                       __GFP_THISNODE;
> >                         VM_BUG_ON(khugepaged_scan.address < hstart ||
> >                                   khugepaged_scan.address + HPAGE_PMD_SIZE >
> >                                   hend);
> > @@ -2255,6 +2259,7 @@ static int khugepaged(void *none)
> >         struct mm_slot *mm_slot;
> >         struct collapse_control cc = {
> >                 .last_target_node = NUMA_NO_NODE,
> > +               /* .gfp set later  */
>
> Seems pointless to me.

Ack. Removed.

> >         };
> >
> >         set_freezable();
> > --
> > 2.36.1.255.ge46751e96f-goog
> >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 02/15] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-06-07 19:32       ` Zach O'Keefe
@ 2022-06-07 21:27         ` Yang Shi
  2022-06-08  0:27           ` Zach O'Keefe
  0 siblings, 1 reply; 63+ messages in thread
From: Yang Shi @ 2022-06-07 21:27 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Tue, Jun 7, 2022 at 12:33 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Tue, Jun 7, 2022 at 9:01 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Mon, Jun 6, 2022 at 1:46 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > When scanning an anon pmd to see if it's eligible for collapse, return
> > > > SCAN_PMD_MAPPED if the pmd already maps a THP.  Note that
> > > > SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
> > > > file-collapse path, since the latter might identify pte-mapped compound
> > > > pages.  This is required by MADV_COLLAPSE which necessarily needs to
> > > > know what hugepage-aligned/sized regions are already pmd-mapped.
> > > >
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >  include/trace/events/huge_memory.h |  1 +
> > > >  mm/internal.h                      |  1 +
> > > >  mm/khugepaged.c                    | 32 ++++++++++++++++++++++++++----
> > > >  mm/rmap.c                          | 15 ++++++++++++--
> > > >  4 files changed, 43 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > > index d651f3437367..55392bf30a03 100644
> > > > --- a/include/trace/events/huge_memory.h
> > > > +++ b/include/trace/events/huge_memory.h
> > > > @@ -11,6 +11,7 @@
> > > >         EM( SCAN_FAIL,                  "failed")                       \
> > > >         EM( SCAN_SUCCEED,               "succeeded")                    \
> > > >         EM( SCAN_PMD_NULL,              "pmd_null")                     \
> > > > +       EM( SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
> > > >         EM( SCAN_EXCEED_NONE_PTE,       "exceed_none_pte")              \
> > > >         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
> > > >         EM( SCAN_EXCEED_SHARED_PTE,     "exceed_shared_pte")            \
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index 6e14749ad1e5..f768c7fae668 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -188,6 +188,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
> > > >  /*
> > > >   * in mm/rmap.c:
> > > >   */
> > > > +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address);
> > > >  extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> > > >
> > > >  /*
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index cc3d6fb446d5..7a914ca19e96 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -28,6 +28,7 @@ enum scan_result {
> > > >         SCAN_FAIL,
> > > >         SCAN_SUCCEED,
> > > >         SCAN_PMD_NULL,
> > > > +       SCAN_PMD_MAPPED,
> > > >         SCAN_EXCEED_NONE_PTE,
> > > >         SCAN_EXCEED_SWAP_PTE,
> > > >         SCAN_EXCEED_SHARED_PTE,
> > > > @@ -901,6 +902,31 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > >         return 0;
> > > >  }
> > > >
> > > > +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > > > +                                  unsigned long address,
> > > > +                                  pmd_t **pmd)
> > > > +{
> > > > +       pmd_t pmde;
> > > > +
> > > > +       *pmd = mm_find_pmd_raw(mm, address);
> > > > +       if (!*pmd)
> > > > +               return SCAN_PMD_NULL;
> > > > +
> > > > +       pmde = pmd_read_atomic(*pmd);
> > > > +
> > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > +       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > > > +       barrier();
> > > > +#endif
> > > > +       if (!pmd_present(pmde))
> > > > +               return SCAN_PMD_NULL;
> > > > +       if (pmd_trans_huge(pmde))
> > > > +               return SCAN_PMD_MAPPED;
> > > > +       if (pmd_bad(pmde))
> > > > +               return SCAN_FAIL;
> > >
> > > khugepaged doesn't handle pmd_bad before, IIRC it may just return
> > > SCAN_SUCCEED if everything else is good? It is fine to add it, but it
> > > may be better to return SCAN_PMD_NULL?
> >
> > Correct, pmd_bad() wasn't handled before. I actually don't know how a
> > bad pmd might arise in the wild (would love to actually know this),
> > but I don't see the check hurting (might be overly convervative
> > though).  Conversely, I'm not sure where things go astray currently if
> > the pmd is bad. Guess it depends in what way the flags are mutated.
> > Returning SCAN_PMD_NULL SGTM.
> >
> > >
> > > > +       return SCAN_SUCCEED;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Bring missing pages in from swap, to complete THP collapse.
> > > >   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> > > > @@ -1146,11 +1172,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> > > >
> > > >         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > >
> > > > -       pmd = mm_find_pmd(mm, address);
> > > > -       if (!pmd) {
> > > > -               result = SCAN_PMD_NULL;
> > > > +       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > > +       if (result != SCAN_SUCCEED)
> > >
> > > There are a couple of other callsites for mm_find_pmd(), you may need
> > > to change all of them to find_pmd_or_thp_or_none() for MADV_COLLAPSE
> > > since khugepaged may collapse the area before MADV_COLLAPSE
> > > reacquiring mmap_lock IIUC and MADV_COLLAPSE does care this case. It
> > > is fine w/o MADV_COLLAPSE since khupaged doesn't care if it is PMD
> > > mapped or not.
> >
> > Ya, I was just questioning the same thing after responding above - at
> > least w.r.t whether the pmd_bad() also needs to be in these callsites
> > (check for pmd mapping, as you mention, I think is definitely
> > necessary). Thanks for catching this!
> >
> > > So it may be better to move this patch right before MADV_COLLAPSE is introduced?
> >
> > I think this should be ok - I'll give it a try at least.
> >
> > Again, thank you for taking the time to thoroughly review this.
> >
> > Best,
> > Zach
> >
> > > >                 goto out;
> > > > -       }
> > > >
> > > >         memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
> > > >         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index 04fac1af870b..c9979c6ad7a1 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -767,13 +767,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> > > >         return vma_address(page, vma);
> > > >  }
> > > >
> > > > -pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > > > +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address)
> > >
> > > May be better to have some notes for mm_find_pmd_raw() and mm_find_pmd().
> > >
>
> Agreed. Looking over this code again, there are only 3 users of mm_find_pmd():
>
> 1) khugepaged
> 2) ksm (replace_page())
> 3) split_huge_pmd_address()
>
> Once khugepaged codepaths care about THP-pmds, ksm is the only
> remaining user that really wants a pte-mapping pmd.
>
> I've gone and consolidated the open-coded code in
> split_huge_pmd_address() to use the mm_find_pmd_raw().
>
> I've also done a name switch:
>
> mm_find_pmd() -> mm_find_pte_pmd()
> mm_find_pmd_raw() -> mm_find_pmd()

If ksm is the only user of *current* mm_find_pmd(), I think you should
be able to open code it w/o introducing mm_find_pte_pmd() and revert
mm_find_pmd() to its *old* behavior.

>
> This basically reverts mm_find_pmd() to its pre commit f72e7dcdd252
> ("mm: let mm_find_pmd fix buggy race with THP fault")
> behavior, and special cases (what will be, after MADV_COLLAPSE file
> support) the only remaining callsite which *doesn't* care about
> THP-pmds (ksm). The naming here is a little more meaningful than
> "*raw", and IMHO more readable.
>
>
> > > >  {
> > > >         pgd_t *pgd;
> > > >         p4d_t *p4d;
> > > >         pud_t *pud;
> > > >         pmd_t *pmd = NULL;
> > > > -       pmd_t pmde;
> > > >
> > > >         pgd = pgd_offset(mm, address);
> > > >         if (!pgd_present(*pgd))
> > > > @@ -788,6 +787,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > > >                 goto out;
> > > >
> > > >         pmd = pmd_offset(pud, address);
> > > > +out:
> > > > +       return pmd;
> > > > +}
> > > > +
> > > > +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > > > +{
> > > > +       pmd_t pmde;
> > > > +       pmd_t *pmd;
> > > > +
> > > > +       pmd = mm_find_pmd_raw(mm, address);
> > > > +       if (!pmd)
> > > > +               goto out;
> > > >         /*
> > > >          * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
> > > >          * without holding anon_vma lock for write.  So when looking for a
> > > > --
> > > > 2.36.1.255.ge46751e96f-goog
> > > >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-06-06 23:53   ` Yang Shi
@ 2022-06-07 22:48     ` Zach O'Keefe
  2022-06-08  0:39       ` Yang Shi
  0 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-07 22:48 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Mon, Jun 6, 2022 at 4:53 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > This idea was introduced by David Rientjes[1].
> >
> > Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
> > synchronous collapse of memory at their own expense.
> >
> > The benefits of this approach are:
> >
> > * CPU is charged to the process that wants to spend the cycles for the
> >   THP
> > * Avoid unpredictable timing of khugepaged collapse
> >
> > An immediate user of this new functionality are malloc() implementations
> > that manage memory in hugepage-sized chunks, but sometimes subrelease
> > memory back to the system in native-sized chunks via MADV_DONTNEED;
> > zapping the pmd.  Later, when the memory is hot, the implementation
> > could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
> > hugepage coverage and dTLB performance.  TCMalloc is such an
> > implementation that could benefit from this[2].
> >
> > Only privately-mapped anon memory is supported for now, but it is
> > expected that file and shmem support will be added later to support the
> > use-case of backing executable text by THPs.  Current support provided
> > by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
> > which might impair services from serving at their full rated load after
> > (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
> > immediately realize iTLB performance prevents page sharing and demand
> > paging, both of which increase steady state memory footprint.  With
> > MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
> > and lower RAM footprints.
> >
> > This call is independent of the system-wide THP sysfs settings, but will
> > fail for memory marked VM_NOHUGEPAGE.
> >
> > THP allocation may enter direct reclaim and/or compaction.
> >
> > [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> > [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
> >
> > Suggested-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  arch/alpha/include/uapi/asm/mman.h     |   2 +
> >  arch/mips/include/uapi/asm/mman.h      |   2 +
> >  arch/parisc/include/uapi/asm/mman.h    |   2 +
> >  arch/xtensa/include/uapi/asm/mman.h    |   2 +
> >  include/linux/huge_mm.h                |  12 +++
> >  include/uapi/asm-generic/mman-common.h |   2 +
> >  mm/khugepaged.c                        | 124 +++++++++++++++++++++++++
> >  mm/madvise.c                           |   5 +
> >  8 files changed, 151 insertions(+)
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > index 4aa996423b0d..763929e814e9 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -76,6 +76,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > index 1be428663c10..c6e1fc77c996 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -103,6 +103,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > index a7ea3204a5fa..22133a6a506e 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -70,6 +70,8 @@
> >  #define MADV_WIPEONFORK 71             /* Zero memory on fork, child only */
> >  #define MADV_KEEPONFORK 72             /* Undo MADV_WIPEONFORK */
> >
> > +#define MADV_COLLAPSE  73              /* Synchronous hugepage collapse */
> > +
> >  #define MADV_HWPOISON     100          /* poison a page for testing */
> >  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
> >
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > index 7966a58af472..1ff0c858544f 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -111,6 +111,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 648cb3ce7099..2ca2f3b41fc8 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -240,6 +240,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >
> >  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> >                      int advice);
> > +int madvise_collapse(struct vm_area_struct *vma,
> > +                    struct vm_area_struct **prev,
> > +                    unsigned long start, unsigned long end);
> >  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> >                            unsigned long end, long adjust_next);
> >  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > @@ -395,6 +398,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> >         BUG();
> >         return 0;
> >  }
> > +
> > +static inline int madvise_collapse(struct vm_area_struct *vma,
> > +                                  struct vm_area_struct **prev,
> > +                                  unsigned long start, unsigned long end)
> > +{
> > +       BUG();
> > +       return 0;
>
> I wish -ENOSYS could have been returned, but it seems madvise()
> doesn't support this return value.

This is somewhat tangential, but I agree that ENOSYS (or some other
errno, but ENOSYS makes most sense to me, after EINVAL, (ENOTSUP?))
should be anointed the dedicated return value for "madvise mode not
supported". Ran into this recently when wanting some form of feature
detection for MADV_COLLAPSE where EINVAL is overloaded  (including
madvise mode not supported). Happy to move this forward if others
agree.

> > +}
> > +
> >  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> >                                          unsigned long start,
> >                                          unsigned long end,
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 6c1aa92a92e4..6ce1f1ceb432 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -77,6 +77,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 4ad04f552347..073d6bb03b37 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2404,3 +2404,127 @@ void khugepaged_min_free_kbytes_update(void)
> >                 set_recommended_min_free_kbytes();
> >         mutex_unlock(&khugepaged_mutex);
> >  }
> > +
> > +static int madvise_collapse_errno(enum scan_result r)
> > +{
> > +       switch (r) {
> > +       case SCAN_PMD_NULL:
> > +       case SCAN_ADDRESS_RANGE:
> > +       case SCAN_VMA_NULL:
> > +       case SCAN_PTE_NON_PRESENT:
> > +       case SCAN_PAGE_NULL:
> > +               /*
> > +                * Addresses in the specified range are not currently mapped,
> > +                * or are outside the AS of the process.
> > +                */
> > +               return -ENOMEM;
> > +       case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > +       case SCAN_CGROUP_CHARGE_FAIL:
> > +               /* A kernel resource was temporarily unavailable. */
> > +               return -EAGAIN;
>
> I thought this should return -ENOMEM too.

Do you mean specifically SCAN_CGROUP_CHARGE_FAIL?

At least going by the comment above do_madvise(), and in the man
pages, for ENOMEM: "Addresses in the specified range are not currently
mapped, or are outside the address space of the process." doesn't
really apply here (though I don't know if "A kernel resource was
temporarily unavailable" applies any better).

That said, should we differentiate between allocation and charging
failure? At least in the case of a userspace agent using
process_madvise(2) to collapse memory on behalf of others, knowing
"this memcg is at its limit" vs "no THPs available" would be valuable.
Maybe the former should be EBUSY?

> > +       default:
> > +               return -EINVAL;
> > +       }
> > +}
> > +
> > +int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > +                    unsigned long start, unsigned long end)
> > +{
> > +       struct collapse_control cc = {
> > +               .enforce_page_heuristics = false,
> > +               .enforce_thp_enabled = false,
> > +               .last_target_node = NUMA_NO_NODE,
> > +               .gfp = GFP_TRANSHUGE | __GFP_THISNODE,
> > +       };
> > +       struct mm_struct *mm = vma->vm_mm;
> > +       unsigned long hstart, hend, addr;
> > +       int thps = 0, last_fail = SCAN_FAIL;
> > +       bool mmap_locked = true;
> > +
> > +       BUG_ON(vma->vm_start > start);
> > +       BUG_ON(vma->vm_end < end);
> > +
> > +       *prev = vma;
> > +
> > +       /* TODO: Support file/shmem */
> > +       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > +               return -EINVAL;
> > +
> > +       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> > +       hend = end & HPAGE_PMD_MASK;
> > +
> > +       /*
> > +        * Set VM_HUGEPAGE so that hugepage_vma_check() can pass even if
> > +        * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
> > +        * Note that hugepage_vma_check() doesn't enforce that
> > +        * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
> > +        * must be set (i.e. "never" mode)
> > +        */
> > +       if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
>
> hugepage_vma_check() doesn't check vma size, so MADV_COLLAPSE may be
> running for a unsuitable vma, hugepage_vma_revalidate() called by
> khugepaged_scan_pmd() may find it out finally, but it is a huge waste
> of effort. So, it is better to check vma size upfront.

This actually does check the vma size, but it's subtle. hstart and
hend are clamped to the first/last
hugepaged-aligned address covered by [start,end], which are themselves
contained in vma->vm_start/vma->vm_end, respectively. We then check
that addr = hstart < hend ; so if the main loop passes the first
check, we know that vma->vm_start <= addr and addr + HPAGE_PMD_SIZE <=
vma->vma_end. Agreed that we might be needlessly doing mmgrab() and
lru_add_drain() needlessly though.

> BTW, my series moved the vma size check in hugepage_vma_check(), so if
> your series could be based on top of that, you get that for free.

I'll try rebasing on top of your series, thank you!

> > +               return -EINVAL;
> > +
> > +       mmgrab(mm);
> > +       lru_add_drain();
> > +
> > +       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> > +               int result = SCAN_FAIL;
> > +               bool retry = true;  /* Allow one retry per hugepage */
> > +retry:
> > +               if (!mmap_locked) {
> > +                       cond_resched();
> > +                       mmap_read_lock(mm);
"> > +                       mmap_locked = true;
> > +                       result = hugepage_vma_revalidate(mm, addr, &vma, &cc);
>
> How's about making hugepage_vma_revalidate() return SCAN_SUCCEED too?
> It seems more consistent.

Ya, I didn't like this either.  I'll add this to "mm/khugepaged: pipe
enum scan_result codes back to callers"

> > +                       if (result) {
> > +                               last_fail = result;
> > +                               goto out_nolock;
> > +                       }
> > +               }
> > +               mmap_assert_locked(mm);
> > +               memset(cc.node_load, 0, sizeof(cc.node_load));
> > +               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
> > +               if (!mmap_locked)
> > +                       *prev = NULL;  /* Tell caller we dropped mmap_lock */
> > +
> > +               switch (result) {
> > +               case SCAN_SUCCEED:
> > +               case SCAN_PMD_MAPPED:
> > +                       ++thps;
> > +                       break;
> > +               /* Whitelisted set of results where continuing OK */
> > +               case SCAN_PMD_NULL:
> > +               case SCAN_PTE_NON_PRESENT:
> > +               case SCAN_PTE_UFFD_WP:
> > +               case SCAN_PAGE_RO:
> > +               case SCAN_LACK_REFERENCED_PAGE:
> > +               case SCAN_PAGE_NULL:
> > +               case SCAN_PAGE_COUNT:
> > +               case SCAN_PAGE_LOCK:
> > +               case SCAN_PAGE_COMPOUND:
> > +                       last_fail = result;
> > +                       break;
> > +               case SCAN_PAGE_LRU:
> > +                       if (retry) {
> > +                               lru_add_drain_all();
> > +                               retry = false;
> > +                               goto retry;
>
> I'm not sure whether the retry logic is necessary or not, do you have
> any data about how retry improves the success rate? You could just
> replace lru_add_drain() to lru_add_drain_all() and remove the retry
> logic IMHO. I'd prefer to keep it simple at the moment personally.

Transparently, I've only had success hitting this logic on small vms
under selftests.  That said, it does happen, and I can't imagine this
hurting, especially on larger systems + tasks using lots of mem.
Originally, I didn't plan to do this, but as things shook out and we
had SCAN_PAGE_LRU so readily available, it seemed like we got this for
free.

> > +                       }
> > +                       fallthrough;
> > +               default:
> > +                       last_fail = result;
> > +                       /* Other error, exit */
> > +                       goto out_maybelock;
> > +               }
> > +       }
> > +
> > +out_maybelock:
> > +       /* Caller expects us to hold mmap_lock on return */
> > +       if (!mmap_locked)
> > +               mmap_read_lock(mm);
> > +out_nolock:
> > +       mmap_assert_locked(mm);
> > +       mmdrop(mm);
> > +
> > +       return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> > +                       : madvise_collapse_errno(last_fail);
> > +}
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 46feb62ce163..eccac2620226 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
> >         case MADV_FREE:
> >         case MADV_POPULATE_READ:
> >         case MADV_POPULATE_WRITE:
> > +       case MADV_COLLAPSE:
> >                 return 0;
> >         default:
> >                 /* be safe, default to 1. list exceptions explicitly */
> > @@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> >                 if (error)
> >                         goto out;
> >                 break;
> > +       case MADV_COLLAPSE:
> > +               return madvise_collapse(vma, prev, start, end);
> >         }
> >
> >         anon_name = anon_vma_name(vma);
> > @@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >         case MADV_HUGEPAGE:
> >         case MADV_NOHUGEPAGE:
> > +       case MADV_COLLAPSE:
> >  #endif
> >         case MADV_DONTDUMP:
> >         case MADV_DODUMP:
> > @@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> >   *             transparent huge pages so the existing pages will not be
> >   *             coalesced into THP and new pages will not be allocated as THP.
> > + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> >   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> >   *             from being included in its core dump.
> >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > --
> > 2.36.1.255.ge46751e96f-goog
> >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 02/15] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-06-07 21:27         ` Yang Shi
@ 2022-06-08  0:27           ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-08  0:27 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Tue, Jun 7, 2022 at 2:27 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Jun 7, 2022 at 12:33 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Tue, Jun 7, 2022 at 9:01 AM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > On Mon, Jun 6, 2022 at 1:46 PM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > >
> > > > > When scanning an anon pmd to see if it's eligible for collapse, return
> > > > > SCAN_PMD_MAPPED if the pmd already maps a THP.  Note that
> > > > > SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
> > > > > file-collapse path, since the latter might identify pte-mapped compound
> > > > > pages.  This is required by MADV_COLLAPSE which necessarily needs to
> > > > > know what hugepage-aligned/sized regions are already pmd-mapped.
> > > > >
> > > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > > ---
> > > > >  include/trace/events/huge_memory.h |  1 +
> > > > >  mm/internal.h                      |  1 +
> > > > >  mm/khugepaged.c                    | 32 ++++++++++++++++++++++++++----
> > > > >  mm/rmap.c                          | 15 ++++++++++++--
> > > > >  4 files changed, 43 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > > > index d651f3437367..55392bf30a03 100644
> > > > > --- a/include/trace/events/huge_memory.h
> > > > > +++ b/include/trace/events/huge_memory.h
> > > > > @@ -11,6 +11,7 @@
> > > > >         EM( SCAN_FAIL,                  "failed")                       \
> > > > >         EM( SCAN_SUCCEED,               "succeeded")                    \
> > > > >         EM( SCAN_PMD_NULL,              "pmd_null")                     \
> > > > > +       EM( SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
> > > > >         EM( SCAN_EXCEED_NONE_PTE,       "exceed_none_pte")              \
> > > > >         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
> > > > >         EM( SCAN_EXCEED_SHARED_PTE,     "exceed_shared_pte")            \
> > > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > > index 6e14749ad1e5..f768c7fae668 100644
> > > > > --- a/mm/internal.h
> > > > > +++ b/mm/internal.h
> > > > > @@ -188,6 +188,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
> > > > >  /*
> > > > >   * in mm/rmap.c:
> > > > >   */
> > > > > +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address);
> > > > >  extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> > > > >
> > > > >  /*
> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > index cc3d6fb446d5..7a914ca19e96 100644
> > > > > --- a/mm/khugepaged.c
> > > > > +++ b/mm/khugepaged.c
> > > > > @@ -28,6 +28,7 @@ enum scan_result {
> > > > >         SCAN_FAIL,
> > > > >         SCAN_SUCCEED,
> > > > >         SCAN_PMD_NULL,
> > > > > +       SCAN_PMD_MAPPED,
> > > > >         SCAN_EXCEED_NONE_PTE,
> > > > >         SCAN_EXCEED_SWAP_PTE,
> > > > >         SCAN_EXCEED_SHARED_PTE,
> > > > > @@ -901,6 +902,31 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > > >         return 0;
> > > > >  }
> > > > >
> > > > > +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > > > > +                                  unsigned long address,
> > > > > +                                  pmd_t **pmd)
> > > > > +{
> > > > > +       pmd_t pmde;
> > > > > +
> > > > > +       *pmd = mm_find_pmd_raw(mm, address);
> > > > > +       if (!*pmd)
> > > > > +               return SCAN_PMD_NULL;
> > > > > +
> > > > > +       pmde = pmd_read_atomic(*pmd);
> > > > > +
> > > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > > +       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > > > > +       barrier();
> > > > > +#endif
> > > > > +       if (!pmd_present(pmde))
> > > > > +               return SCAN_PMD_NULL;
> > > > > +       if (pmd_trans_huge(pmde))
> > > > > +               return SCAN_PMD_MAPPED;
> > > > > +       if (pmd_bad(pmde))
> > > > > +               return SCAN_FAIL;
> > > >
> > > > khugepaged doesn't handle pmd_bad before, IIRC it may just return
> > > > SCAN_SUCCEED if everything else is good? It is fine to add it, but it
> > > > may be better to return SCAN_PMD_NULL?
> > >
> > > Correct, pmd_bad() wasn't handled before. I actually don't know how a
> > > bad pmd might arise in the wild (would love to actually know this),
> > > but I don't see the check hurting (might be overly convervative
> > > though).  Conversely, I'm not sure where things go astray currently if
> > > the pmd is bad. Guess it depends in what way the flags are mutated.
> > > Returning SCAN_PMD_NULL SGTM.
> > >
> > > >
> > > > > +       return SCAN_SUCCEED;
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Bring missing pages in from swap, to complete THP collapse.
> > > > >   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> > > > > @@ -1146,11 +1172,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> > > > >
> > > > >         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > > >
> > > > > -       pmd = mm_find_pmd(mm, address);
> > > > > -       if (!pmd) {
> > > > > -               result = SCAN_PMD_NULL;
> > > > > +       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > > > +       if (result != SCAN_SUCCEED)
> > > >
> > > > There are a couple of other callsites for mm_find_pmd(), you may need
> > > > to change all of them to find_pmd_or_thp_or_none() for MADV_COLLAPSE
> > > > since khugepaged may collapse the area before MADV_COLLAPSE
> > > > reacquiring mmap_lock IIUC and MADV_COLLAPSE does care this case. It
> > > > is fine w/o MADV_COLLAPSE since khupaged doesn't care if it is PMD
> > > > mapped or not.
> > >
> > > Ya, I was just questioning the same thing after responding above - at
> > > least w.r.t whether the pmd_bad() also needs to be in these callsites
> > > (check for pmd mapping, as you mention, I think is definitely
> > > necessary). Thanks for catching this!
> > >
> > > > So it may be better to move this patch right before MADV_COLLAPSE is introduced?
> > >
> > > I think this should be ok - I'll give it a try at least.
> > >
> > > Again, thank you for taking the time to thoroughly review this.
> > >
> > > Best,
> > > Zach
> > >
> > > > >                 goto out;
> > > > > -       }
> > > > >
> > > > >         memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
> > > > >         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > index 04fac1af870b..c9979c6ad7a1 100644
> > > > > --- a/mm/rmap.c
> > > > > +++ b/mm/rmap.c
> > > > > @@ -767,13 +767,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> > > > >         return vma_address(page, vma);
> > > > >  }
> > > > >
> > > > > -pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > > > > +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address)
> > > >
> > > > May be better to have some notes for mm_find_pmd_raw() and mm_find_pmd().
> > > >
> >
> > Agreed. Looking over this code again, there are only 3 users of mm_find_pmd():
> >
> > 1) khugepaged
> > 2) ksm (replace_page())
> > 3) split_huge_pmd_address()
> >
> > Once khugepaged codepaths care about THP-pmds, ksm is the only
> > remaining user that really wants a pte-mapping pmd.
> >
> > I've gone and consolidated the open-coded code in
> > split_huge_pmd_address() to use the mm_find_pmd_raw().
> >
> > I've also done a name switch:
> >
> > mm_find_pmd() -> mm_find_pte_pmd()
> > mm_find_pmd_raw() -> mm_find_pmd()
>
> If ksm is the only user of *current* mm_find_pmd(), I think you should
> be able to open code it w/o introducing mm_find_pte_pmd() and revert
> mm_find_pmd() to its *old* behavior.

SGTM. Tried it out and it looks fine. Thanks for the suggestion.

> >
> > This basically reverts mm_find_pmd() to its pre commit f72e7dcdd252
> > ("mm: let mm_find_pmd fix buggy race with THP fault")
> > behavior, and special cases (what will be, after MADV_COLLAPSE file
> > support) the only remaining callsite which *doesn't* care about
> > THP-pmds (ksm). The naming here is a little more meaningful than
> > "*raw", and IMHO more readable.
> >
> >
> > > > >  {
> > > > >         pgd_t *pgd;
> > > > >         p4d_t *p4d;
> > > > >         pud_t *pud;
> > > > >         pmd_t *pmd = NULL;
> > > > > -       pmd_t pmde;
> > > > >
> > > > >         pgd = pgd_offset(mm, address);
> > > > >         if (!pgd_present(*pgd))
> > > > > @@ -788,6 +787,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > > > >                 goto out;
> > > > >
> > > > >         pmd = pmd_offset(pud, address);
> > > > > +out:
> > > > > +       return pmd;
> > > > > +}
> > > > > +
> > > > > +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > > > > +{
> > > > > +       pmd_t pmde;
> > > > > +       pmd_t *pmd;
> > > > > +
> > > > > +       pmd = mm_find_pmd_raw(mm, address);
> > > > > +       if (!pmd)
> > > > > +               goto out;
> > > > >         /*
> > > > >          * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
> > > > >          * without holding anon_vma lock for write.  So when looking for a
> > > > > --
> > > > > 2.36.1.255.ge46751e96f-goog
> > > > >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-06-07 22:48     ` Zach O'Keefe
@ 2022-06-08  0:39       ` Yang Shi
  2022-06-09 17:35         ` Zach O'Keefe
  0 siblings, 1 reply; 63+ messages in thread
From: Yang Shi @ 2022-06-08  0:39 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Tue, Jun 7, 2022 at 3:48 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Mon, Jun 6, 2022 at 4:53 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > This idea was introduced by David Rientjes[1].
> > >
> > > Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
> > > synchronous collapse of memory at their own expense.
> > >
> > > The benefits of this approach are:
> > >
> > > * CPU is charged to the process that wants to spend the cycles for the
> > >   THP
> > > * Avoid unpredictable timing of khugepaged collapse
> > >
> > > An immediate user of this new functionality are malloc() implementations
> > > that manage memory in hugepage-sized chunks, but sometimes subrelease
> > > memory back to the system in native-sized chunks via MADV_DONTNEED;
> > > zapping the pmd.  Later, when the memory is hot, the implementation
> > > could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
> > > hugepage coverage and dTLB performance.  TCMalloc is such an
> > > implementation that could benefit from this[2].
> > >
> > > Only privately-mapped anon memory is supported for now, but it is
> > > expected that file and shmem support will be added later to support the
> > > use-case of backing executable text by THPs.  Current support provided
> > > by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
> > > which might impair services from serving at their full rated load after
> > > (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
> > > immediately realize iTLB performance prevents page sharing and demand
> > > paging, both of which increase steady state memory footprint.  With
> > > MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
> > > and lower RAM footprints.
> > >
> > > This call is independent of the system-wide THP sysfs settings, but will
> > > fail for memory marked VM_NOHUGEPAGE.
> > >
> > > THP allocation may enter direct reclaim and/or compaction.
> > >
> > > [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> > > [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
> > >
> > > Suggested-by: David Rientjes <rientjes@google.com>
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >  arch/alpha/include/uapi/asm/mman.h     |   2 +
> > >  arch/mips/include/uapi/asm/mman.h      |   2 +
> > >  arch/parisc/include/uapi/asm/mman.h    |   2 +
> > >  arch/xtensa/include/uapi/asm/mman.h    |   2 +
> > >  include/linux/huge_mm.h                |  12 +++
> > >  include/uapi/asm-generic/mman-common.h |   2 +
> > >  mm/khugepaged.c                        | 124 +++++++++++++++++++++++++
> > >  mm/madvise.c                           |   5 +
> > >  8 files changed, 151 insertions(+)
> > >
> > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > > index 4aa996423b0d..763929e814e9 100644
> > > --- a/arch/alpha/include/uapi/asm/mman.h
> > > +++ b/arch/alpha/include/uapi/asm/mman.h
> > > @@ -76,6 +76,8 @@
> > >
> > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > >
> > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > +
> > >  /* compatibility flags */
> > >  #define MAP_FILE       0
> > >
> > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > > index 1be428663c10..c6e1fc77c996 100644
> > > --- a/arch/mips/include/uapi/asm/mman.h
> > > +++ b/arch/mips/include/uapi/asm/mman.h
> > > @@ -103,6 +103,8 @@
> > >
> > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > >
> > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > +
> > >  /* compatibility flags */
> > >  #define MAP_FILE       0
> > >
> > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > > index a7ea3204a5fa..22133a6a506e 100644
> > > --- a/arch/parisc/include/uapi/asm/mman.h
> > > +++ b/arch/parisc/include/uapi/asm/mman.h
> > > @@ -70,6 +70,8 @@
> > >  #define MADV_WIPEONFORK 71             /* Zero memory on fork, child only */
> > >  #define MADV_KEEPONFORK 72             /* Undo MADV_WIPEONFORK */
> > >
> > > +#define MADV_COLLAPSE  73              /* Synchronous hugepage collapse */
> > > +
> > >  #define MADV_HWPOISON     100          /* poison a page for testing */
> > >  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
> > >
> > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > > index 7966a58af472..1ff0c858544f 100644
> > > --- a/arch/xtensa/include/uapi/asm/mman.h
> > > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > > @@ -111,6 +111,8 @@
> > >
> > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > >
> > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > +
> > >  /* compatibility flags */
> > >  #define MAP_FILE       0
> > >
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 648cb3ce7099..2ca2f3b41fc8 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -240,6 +240,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> > >
> > >  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> > >                      int advice);
> > > +int madvise_collapse(struct vm_area_struct *vma,
> > > +                    struct vm_area_struct **prev,
> > > +                    unsigned long start, unsigned long end);
> > >  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> > >                            unsigned long end, long adjust_next);
> > >  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > > @@ -395,6 +398,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> > >         BUG();
> > >         return 0;
> > >  }
> > > +
> > > +static inline int madvise_collapse(struct vm_area_struct *vma,
> > > +                                  struct vm_area_struct **prev,
> > > +                                  unsigned long start, unsigned long end)
> > > +{
> > > +       BUG();
> > > +       return 0;
> >
> > I wish -ENOSYS could have been returned, but it seems madvise()
> > doesn't support this return value.
>
> This is somewhat tangential, but I agree that ENOSYS (or some other
> errno, but ENOSYS makes most sense to me, after EINVAL, (ENOTSUP?))
> should be anointed the dedicated return value for "madvise mode not
> supported". Ran into this recently when wanting some form of feature
> detection for MADV_COLLAPSE where EINVAL is overloaded  (including
> madvise mode not supported). Happy to move this forward if others
> agree.

I did a quick test by calling MADV_HUGEPAGE on !THP kernel, madvise()
actually returns -EINVAL by madvise_behavior_valid(). So
madvise_collapse() won't be called at all. So madvise_collapse() is
basically used to make !THP compile happy.

I think we could just return -EINVAL.

>
> > > +}
> > > +
> > >  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> > >                                          unsigned long start,
> > >                                          unsigned long end,
> > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > index 6c1aa92a92e4..6ce1f1ceb432 100644
> > > --- a/include/uapi/asm-generic/mman-common.h
> > > +++ b/include/uapi/asm-generic/mman-common.h
> > > @@ -77,6 +77,8 @@
> > >
> > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > >
> > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > +
> > >  /* compatibility flags */
> > >  #define MAP_FILE       0
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 4ad04f552347..073d6bb03b37 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -2404,3 +2404,127 @@ void khugepaged_min_free_kbytes_update(void)
> > >                 set_recommended_min_free_kbytes();
> > >         mutex_unlock(&khugepaged_mutex);
> > >  }
> > > +
> > > +static int madvise_collapse_errno(enum scan_result r)
> > > +{
> > > +       switch (r) {
> > > +       case SCAN_PMD_NULL:
> > > +       case SCAN_ADDRESS_RANGE:
> > > +       case SCAN_VMA_NULL:
> > > +       case SCAN_PTE_NON_PRESENT:
> > > +       case SCAN_PAGE_NULL:
> > > +               /*
> > > +                * Addresses in the specified range are not currently mapped,
> > > +                * or are outside the AS of the process.
> > > +                */
> > > +               return -ENOMEM;
> > > +       case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > > +       case SCAN_CGROUP_CHARGE_FAIL:
> > > +               /* A kernel resource was temporarily unavailable. */
> > > +               return -EAGAIN;
> >
> > I thought this should return -ENOMEM too.
>
> Do you mean specifically SCAN_CGROUP_CHARGE_FAIL?

No, I mean both.

>
> At least going by the comment above do_madvise(), and in the man
> pages, for ENOMEM: "Addresses in the specified range are not currently
> mapped, or are outside the address space of the process." doesn't
> really apply here (though I don't know if "A kernel resource was
> temporarily unavailable" applies any better).

Yes, the man page does say so. But IIRC some MADV_ operations do
return -ENOMEM for memory allocation failure, for example,
MADV_POPULATE_READ/WRITE. Typically the man pages don't cover all
cases.

>
> That said, should we differentiate between allocation and charging
> failure? At least in the case of a userspace agent using
> process_madvise(2) to collapse memory on behalf of others, knowing
> "this memcg is at its limit" vs "no THPs available" would be valuable.
> Maybe the former should be EBUSY?

IMHO we don't have to differentiate allocation and charging.

>
> > > +       default:
> > > +               return -EINVAL;
> > > +       }
> > > +}
> > > +
> > > +int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > +                    unsigned long start, unsigned long end)
> > > +{
> > > +       struct collapse_control cc = {
> > > +               .enforce_page_heuristics = false,
> > > +               .enforce_thp_enabled = false,
> > > +               .last_target_node = NUMA_NO_NODE,
> > > +               .gfp = GFP_TRANSHUGE | __GFP_THISNODE,
> > > +       };
> > > +       struct mm_struct *mm = vma->vm_mm;
> > > +       unsigned long hstart, hend, addr;
> > > +       int thps = 0, last_fail = SCAN_FAIL;
> > > +       bool mmap_locked = true;
> > > +
> > > +       BUG_ON(vma->vm_start > start);
> > > +       BUG_ON(vma->vm_end < end);
> > > +
> > > +       *prev = vma;
> > > +
> > > +       /* TODO: Support file/shmem */
> > > +       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > > +               return -EINVAL;
> > > +
> > > +       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> > > +       hend = end & HPAGE_PMD_MASK;
> > > +
> > > +       /*
> > > +        * Set VM_HUGEPAGE so that hugepage_vma_check() can pass even if
> > > +        * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
> > > +        * Note that hugepage_vma_check() doesn't enforce that
> > > +        * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
> > > +        * must be set (i.e. "never" mode)
> > > +        */
> > > +       if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
> >
> > hugepage_vma_check() doesn't check vma size, so MADV_COLLAPSE may be
> > running for a unsuitable vma, hugepage_vma_revalidate() called by
> > khugepaged_scan_pmd() may find it out finally, but it is a huge waste
> > of effort. So, it is better to check vma size upfront.
>
> This actually does check the vma size, but it's subtle. hstart and
> hend are clamped to the first/last
> hugepaged-aligned address covered by [start,end], which are themselves
> contained in vma->vm_start/vma->vm_end, respectively. We then check
> that addr = hstart < hend ; so if the main loop passes the first
> check, we know that vma->vm_start <= addr and addr + HPAGE_PMD_SIZE <=

Aha, yes, I overlooked that.

> vma->vma_end. Agreed that we might be needlessly doing mmgrab() and
> lru_add_drain() needlessly though.

Yeah

>
> > BTW, my series moved the vma size check in hugepage_vma_check(), so if
> > your series could be based on top of that, you get that for free.
>
> I'll try rebasing on top of your series, thank you!

You don't have to do it right now. I don't know what series will be
merged to mm tree first. Just a heads up.

>
> > > +               return -EINVAL;
> > > +
> > > +       mmgrab(mm);
> > > +       lru_add_drain();
> > > +
> > > +       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> > > +               int result = SCAN_FAIL;
> > > +               bool retry = true;  /* Allow one retry per hugepage */
> > > +retry:
> > > +               if (!mmap_locked) {
> > > +                       cond_resched();
> > > +                       mmap_read_lock(mm);
> "> > +                       mmap_locked = true;
> > > +                       result = hugepage_vma_revalidate(mm, addr, &vma, &cc);
> >
> > How's about making hugepage_vma_revalidate() return SCAN_SUCCEED too?
> > It seems more consistent.
>
> Ya, I didn't like this either.  I'll add this to "mm/khugepaged: pipe
> enum scan_result codes back to callers"
>
> > > +                       if (result) {
> > > +                               last_fail = result;
> > > +                               goto out_nolock;
> > > +                       }
> > > +               }
> > > +               mmap_assert_locked(mm);
> > > +               memset(cc.node_load, 0, sizeof(cc.node_load));
> > > +               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
> > > +               if (!mmap_locked)
> > > +                       *prev = NULL;  /* Tell caller we dropped mmap_lock */
> > > +
> > > +               switch (result) {
> > > +               case SCAN_SUCCEED:
> > > +               case SCAN_PMD_MAPPED:
> > > +                       ++thps;
> > > +                       break;
> > > +               /* Whitelisted set of results where continuing OK */
> > > +               case SCAN_PMD_NULL:
> > > +               case SCAN_PTE_NON_PRESENT:
> > > +               case SCAN_PTE_UFFD_WP:
> > > +               case SCAN_PAGE_RO:
> > > +               case SCAN_LACK_REFERENCED_PAGE:
> > > +               case SCAN_PAGE_NULL:
> > > +               case SCAN_PAGE_COUNT:
> > > +               case SCAN_PAGE_LOCK:
> > > +               case SCAN_PAGE_COMPOUND:
> > > +                       last_fail = result;
> > > +                       break;
> > > +               case SCAN_PAGE_LRU:
> > > +                       if (retry) {
> > > +                               lru_add_drain_all();
> > > +                               retry = false;
> > > +                               goto retry;
> >
> > I'm not sure whether the retry logic is necessary or not, do you have
> > any data about how retry improves the success rate? You could just
> > replace lru_add_drain() to lru_add_drain_all() and remove the retry
> > logic IMHO. I'd prefer to keep it simple at the moment personally.
>
> Transparently, I've only had success hitting this logic on small vms
> under selftests.  That said, it does happen, and I can't imagine this
> hurting, especially on larger systems + tasks using lots of mem.
> Originally, I didn't plan to do this, but as things shook out and we
> had SCAN_PAGE_LRU so readily available, it seemed like we got this for
> free.

"small vms" mean small virtual machines?

When the logic is hit, does lru_add_drain_all() help to improve the
success rate?

I don't mean this hurts anything. I'm just thinking about whether the
extra complexity is worth it or not. And calling lru_add_drain_all()
with holding mmap_lock might have some scalability issues since
draining lru for all is not cheap.


>
> > > +                       }
> > > +                       fallthrough;
> > > +               default:
> > > +                       last_fail = result;
> > > +                       /* Other error, exit */
> > > +                       goto out_maybelock;
> > > +               }
> > > +       }
> > > +
> > > +out_maybelock:
> > > +       /* Caller expects us to hold mmap_lock on return */
> > > +       if (!mmap_locked)
> > > +               mmap_read_lock(mm);
> > > +out_nolock:
> > > +       mmap_assert_locked(mm);
> > > +       mmdrop(mm);
> > > +
> > > +       return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> > > +                       : madvise_collapse_errno(last_fail);
> > > +}
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 46feb62ce163..eccac2620226 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
> > >         case MADV_FREE:
> > >         case MADV_POPULATE_READ:
> > >         case MADV_POPULATE_WRITE:
> > > +       case MADV_COLLAPSE:
> > >                 return 0;
> > >         default:
> > >                 /* be safe, default to 1. list exceptions explicitly */
> > > @@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > >                 if (error)
> > >                         goto out;
> > >                 break;
> > > +       case MADV_COLLAPSE:
> > > +               return madvise_collapse(vma, prev, start, end);
> > >         }
> > >
> > >         anon_name = anon_vma_name(vma);
> > > @@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
> > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >         case MADV_HUGEPAGE:
> > >         case MADV_NOHUGEPAGE:
> > > +       case MADV_COLLAPSE:
> > >  #endif
> > >         case MADV_DONTDUMP:
> > >         case MADV_DODUMP:
> > > @@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > >   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > >   *             transparent huge pages so the existing pages will not be
> > >   *             coalesced into THP and new pages will not be allocated as THP.
> > > + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> > >   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> > >   *             from being included in its core dump.
> > >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > > --
> > > 2.36.1.255.ge46751e96f-goog
> > >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
  2022-06-06 23:53           ` Yang Shi
@ 2022-06-08  0:42             ` Zach O'Keefe
  -1 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-08  0:42 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, kernel test robot, Alex Shi, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Michal Hocko, Pasha Tatashin,
	Peter Xu, Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Zi Yan, Linux MM, kbuild-all, Andrea Arcangeli, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin

On Mon, Jun 6, 2022 at 4:54 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Jun 6, 2022 at 3:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Mon, 6 Jun 2022 09:40:20 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> >
> > > On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> > > >
> > > > Hi Zach,
> > > >
> > > > Thank you for the patch! Perhaps something to improve:
> > > >
> > > > [auto build test WARNING on akpm-mm/mm-everything]
> > > >
> > > > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > > > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp@intel.com/config)
> > > > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > > > reproduce (this is a W=1 build):
> > > >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> > > >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> > > >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> > > >         # save the config file
> > > >         mkdir build_dir && cp config build_dir/.config
> > > >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> > > >
> > > > If you fix the issue, kindly add following tag where applicable
> > > > Reported-by: kernel test robot <lkp@intel.com>
> > > >
> > > > All warnings (new ones prefixed by >>):
> > > >
> > > >    mm/khugepaged.c: In function 'khugepaged':
> > > > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> > > >     2284 | }
> > > >          | ^
> > >
> > > Thanks lkp@intel.com.
> > >
> > > This is due to config with:
> > >
> > > CONFIG_FRAME_WARN=2048
> > > CONFIG_NODES_SHIFT=10
> > >
> > > Where struct collapse_control has a member int
> > > node_load[MAX_NUMNODES], and we stack allocate one.
> > >
> > > Is this a configuration that needs to be supported? 1024 nodes seems
> > > like a lot and I'm not sure if these configs are randomly generated or
> > > are reminiscent of real systems.
> >
> > Adding 4k to the stack isn't a good thing to do.  It's trivial to
> > kmalloc the thing, so why not do that?
>
> Thanks, Andrew. Yeah, I just suggested that too.

Thanks Yang / Andrew for taking the time to voice your suggestions.

I'll go ahead and just kmalloc() the thing and fail if we can't.

Yang, is there a reason to kmalloc() the entire struct
collapse_control with trailing flex array vs stack allocating the
struct collapse_control + kmalloc()'ing the node_load array?


> >
> > I'll await some reviewer input (hopefully positive ;)) before merging
> > this series.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
@ 2022-06-08  0:42             ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-08  0:42 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2978 bytes --]

On Mon, Jun 6, 2022 at 4:54 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Jun 6, 2022 at 3:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Mon, 6 Jun 2022 09:40:20 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> >
> > > On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> > > >
> > > > Hi Zach,
> > > >
> > > > Thank you for the patch! Perhaps something to improve:
> > > >
> > > > [auto build test WARNING on akpm-mm/mm-everything]
> > > >
> > > > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > > > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp(a)intel.com/config)
> > > > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > > > reproduce (this is a W=1 build):
> > > >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> > > >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> > > >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> > > >         # save the config file
> > > >         mkdir build_dir && cp config build_dir/.config
> > > >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> > > >
> > > > If you fix the issue, kindly add following tag where applicable
> > > > Reported-by: kernel test robot <lkp@intel.com>
> > > >
> > > > All warnings (new ones prefixed by >>):
> > > >
> > > >    mm/khugepaged.c: In function 'khugepaged':
> > > > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> > > >     2284 | }
> > > >          | ^
> > >
> > > Thanks lkp(a)intel.com.
> > >
> > > This is due to config with:
> > >
> > > CONFIG_FRAME_WARN=2048
> > > CONFIG_NODES_SHIFT=10
> > >
> > > Where struct collapse_control has a member int
> > > node_load[MAX_NUMNODES], and we stack allocate one.
> > >
> > > Is this a configuration that needs to be supported? 1024 nodes seems
> > > like a lot and I'm not sure if these configs are randomly generated or
> > > are reminiscent of real systems.
> >
> > Adding 4k to the stack isn't a good thing to do.  It's trivial to
> > kmalloc the thing, so why not do that?
>
> Thanks, Andrew. Yeah, I just suggested that too.

Thanks Yang / Andrew for taking the time to voice your suggestions.

I'll go ahead and just kmalloc() the thing and fail if we can't.

Yang, is there a reason to kmalloc() the entire struct
collapse_control with trailing flex array vs stack allocating the
struct collapse_control + kmalloc()'ing the node_load array?


> >
> > I'll await some reviewer input (hopefully positive ;)) before merging
> > this series.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
  2022-06-08  0:42             ` Zach O'Keefe
@ 2022-06-08  1:00               ` Yang Shi
  -1 siblings, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-08  1:00 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Andrew Morton, kernel test robot, Alex Shi, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Michal Hocko, Pasha Tatashin,
	Peter Xu, Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Zi Yan, Linux MM, kbuild-all, Andrea Arcangeli, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin

On Tue, Jun 7, 2022 at 5:43 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Mon, Jun 6, 2022 at 4:54 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Mon, Jun 6, 2022 at 3:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Mon, 6 Jun 2022 09:40:20 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> > >
> > > > On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> > > > >
> > > > > Hi Zach,
> > > > >
> > > > > Thank you for the patch! Perhaps something to improve:
> > > > >
> > > > > [auto build test WARNING on akpm-mm/mm-everything]
> > > > >
> > > > > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > > > > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp@intel.com/config)
> > > > > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > > > > reproduce (this is a W=1 build):
> > > > >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> > > > >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> > > > >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > > >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> > > > >         # save the config file
> > > > >         mkdir build_dir && cp config build_dir/.config
> > > > >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> > > > >
> > > > > If you fix the issue, kindly add following tag where applicable
> > > > > Reported-by: kernel test robot <lkp@intel.com>
> > > > >
> > > > > All warnings (new ones prefixed by >>):
> > > > >
> > > > >    mm/khugepaged.c: In function 'khugepaged':
> > > > > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> > > > >     2284 | }
> > > > >          | ^
> > > >
> > > > Thanks lkp@intel.com.
> > > >
> > > > This is due to config with:
> > > >
> > > > CONFIG_FRAME_WARN=2048
> > > > CONFIG_NODES_SHIFT=10
> > > >
> > > > Where struct collapse_control has a member int
> > > > node_load[MAX_NUMNODES], and we stack allocate one.
> > > >
> > > > Is this a configuration that needs to be supported? 1024 nodes seems
> > > > like a lot and I'm not sure if these configs are randomly generated or
> > > > are reminiscent of real systems.
> > >
> > > Adding 4k to the stack isn't a good thing to do.  It's trivial to
> > > kmalloc the thing, so why not do that?
> >
> > Thanks, Andrew. Yeah, I just suggested that too.
>
> Thanks Yang / Andrew for taking the time to voice your suggestions.
>
> I'll go ahead and just kmalloc() the thing and fail if we can't.
>
> Yang, is there a reason to kmalloc() the entire struct
> collapse_control with trailing flex array vs stack allocating the
> struct collapse_control + kmalloc()'ing the node_load array?

I don't think those two have too much difference. I don't have a
strong preference personally. However you could choose:

Define collapse_control as:
struct collapse_control {
    xxx;
    ...
    int node_load[MAX_NUMANODES];
}
Then you could kmalloc the whole struct.

Or it could be defined as:
struct collapse_control {
    xxx;
    ...
    int *node_load[];
}
In this way you could allocate collapse_control on stack or by
kmalloc, then kmalloc node_load for all possible nodes instead of
MAX_NUMANODES. This may have a better success rate since you do
kmalloc much less memory (typically the number of possible nodes is
much less than MAX_NUMANODES), but it may be not worth it since the
error handling path is more complicated and it may not make too much
difference.

The first choice is definitely much simpler, you may want to try that first.

>
>
> > >
> > > I'll await some reviewer input (hopefully positive ;)) before merging
> > > this series.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
@ 2022-06-08  1:00               ` Yang Shi
  0 siblings, 0 replies; 63+ messages in thread
From: Yang Shi @ 2022-06-08  1:00 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4068 bytes --]

On Tue, Jun 7, 2022 at 5:43 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Mon, Jun 6, 2022 at 4:54 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Mon, Jun 6, 2022 at 3:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Mon, 6 Jun 2022 09:40:20 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> > >
> > > > On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> > > > >
> > > > > Hi Zach,
> > > > >
> > > > > Thank you for the patch! Perhaps something to improve:
> > > > >
> > > > > [auto build test WARNING on akpm-mm/mm-everything]
> > > > >
> > > > > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > > > > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp(a)intel.com/config)
> > > > > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > > > > reproduce (this is a W=1 build):
> > > > >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> > > > >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> > > > >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > > >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> > > > >         # save the config file
> > > > >         mkdir build_dir && cp config build_dir/.config
> > > > >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> > > > >
> > > > > If you fix the issue, kindly add following tag where applicable
> > > > > Reported-by: kernel test robot <lkp@intel.com>
> > > > >
> > > > > All warnings (new ones prefixed by >>):
> > > > >
> > > > >    mm/khugepaged.c: In function 'khugepaged':
> > > > > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> > > > >     2284 | }
> > > > >          | ^
> > > >
> > > > Thanks lkp(a)intel.com.
> > > >
> > > > This is due to config with:
> > > >
> > > > CONFIG_FRAME_WARN=2048
> > > > CONFIG_NODES_SHIFT=10
> > > >
> > > > Where struct collapse_control has a member int
> > > > node_load[MAX_NUMNODES], and we stack allocate one.
> > > >
> > > > Is this a configuration that needs to be supported? 1024 nodes seems
> > > > like a lot and I'm not sure if these configs are randomly generated or
> > > > are reminiscent of real systems.
> > >
> > > Adding 4k to the stack isn't a good thing to do.  It's trivial to
> > > kmalloc the thing, so why not do that?
> >
> > Thanks, Andrew. Yeah, I just suggested that too.
>
> Thanks Yang / Andrew for taking the time to voice your suggestions.
>
> I'll go ahead and just kmalloc() the thing and fail if we can't.
>
> Yang, is there a reason to kmalloc() the entire struct
> collapse_control with trailing flex array vs stack allocating the
> struct collapse_control + kmalloc()'ing the node_load array?

I don't think those two have too much difference. I don't have a
strong preference personally. However you could choose:

Define collapse_control as:
struct collapse_control {
    xxx;
    ...
    int node_load[MAX_NUMANODES];
}
Then you could kmalloc the whole struct.

Or it could be defined as:
struct collapse_control {
    xxx;
    ...
    int *node_load[];
}
In this way you could allocate collapse_control on stack or by
kmalloc, then kmalloc node_load for all possible nodes instead of
MAX_NUMANODES. This may have a better success rate since you do
kmalloc much less memory (typically the number of possible nodes is
much less than MAX_NUMANODES), but it may be not worth it since the
error handling path is more complicated and it may not make too much
difference.

The first choice is definitely much simpler, you may want to try that first.

>
>
> > >
> > > I'll await some reviewer input (hopefully positive ;)) before merging
> > > this series.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
  2022-06-08  1:00               ` Yang Shi
@ 2022-06-08  1:06                 ` Zach O'Keefe
  -1 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-08  1:06 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, kernel test robot, Alex Shi, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Michal Hocko, Pasha Tatashin,
	Peter Xu, Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Zi Yan, Linux MM, kbuild-all, Andrea Arcangeli, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin

On Tue, Jun 7, 2022 at 6:00 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Jun 7, 2022 at 5:43 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Mon, Jun 6, 2022 at 4:54 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Mon, Jun 6, 2022 at 3:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Mon, 6 Jun 2022 09:40:20 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> > > >
> > > > > On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> > > > > >
> > > > > > Hi Zach,
> > > > > >
> > > > > > Thank you for the patch! Perhaps something to improve:
> > > > > >
> > > > > > [auto build test WARNING on akpm-mm/mm-everything]
> > > > > >
> > > > > > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > > > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > > > > > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp@intel.com/config)
> > > > > > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > > > > > reproduce (this is a W=1 build):
> > > > > >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> > > > > >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> > > > > >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > > > >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> > > > > >         # save the config file
> > > > > >         mkdir build_dir && cp config build_dir/.config
> > > > > >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> > > > > >
> > > > > > If you fix the issue, kindly add following tag where applicable
> > > > > > Reported-by: kernel test robot <lkp@intel.com>
> > > > > >
> > > > > > All warnings (new ones prefixed by >>):
> > > > > >
> > > > > >    mm/khugepaged.c: In function 'khugepaged':
> > > > > > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> > > > > >     2284 | }
> > > > > >          | ^
> > > > >
> > > > > Thanks lkp@intel.com.
> > > > >
> > > > > This is due to config with:
> > > > >
> > > > > CONFIG_FRAME_WARN=2048
> > > > > CONFIG_NODES_SHIFT=10
> > > > >
> > > > > Where struct collapse_control has a member int
> > > > > node_load[MAX_NUMNODES], and we stack allocate one.
> > > > >
> > > > > Is this a configuration that needs to be supported? 1024 nodes seems
> > > > > like a lot and I'm not sure if these configs are randomly generated or
> > > > > are reminiscent of real systems.
> > > >
> > > > Adding 4k to the stack isn't a good thing to do.  It's trivial to
> > > > kmalloc the thing, so why not do that?
> > >
> > > Thanks, Andrew. Yeah, I just suggested that too.
> >
> > Thanks Yang / Andrew for taking the time to voice your suggestions.
> >
> > I'll go ahead and just kmalloc() the thing and fail if we can't.
> >
> > Yang, is there a reason to kmalloc() the entire struct
> > collapse_control with trailing flex array vs stack allocating the
> > struct collapse_control + kmalloc()'ing the node_load array?
>
> I don't think those two have too much difference. I don't have a
> strong preference personally. However you could choose:
>
> Define collapse_control as:
> struct collapse_control {
>     xxx;
>     ...
>     int node_load[MAX_NUMANODES];
> }
> Then you could kmalloc the whole struct.
>
> Or it could be defined as:
> struct collapse_control {
>     xxx;
>     ...
>     int *node_load[];
> }
> In this way you could allocate collapse_control on stack or by
> kmalloc, then kmalloc node_load for all possible nodes instead of
> MAX_NUMANODES. This may have a better success rate since you do
> kmalloc much less memory (typically the number of possible nodes is
> much less than MAX_NUMANODES), but it may be not worth it since the
> error handling path is more complicated and it may not make too much
> difference.
>
> The first choice is definitely much simpler, you may want to try that first.

Thanks for the suggestion. First approach also has the benefit of
being able to statically allocate one for khugepaged and simplifies
error paths there. I'll try that.

Again, thanks for taking the time to review and help out / suggest
improvements :)

Best,
Zach

> >
> >
> > > >
> > > > I'll await some reviewer input (hopefully positive ;)) before merging
> > > > this series.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 03/15] mm/khugepaged: add struct collapse_control
@ 2022-06-08  1:06                 ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-08  1:06 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4606 bytes --]

On Tue, Jun 7, 2022 at 6:00 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Jun 7, 2022 at 5:43 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Mon, Jun 6, 2022 at 4:54 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Mon, Jun 6, 2022 at 3:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Mon, 6 Jun 2022 09:40:20 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:
> > > >
> > > > > On Sun, Jun 5, 2022 at 7:42 PM kernel test robot <lkp@intel.com> wrote:
> > > > > >
> > > > > > Hi Zach,
> > > > > >
> > > > > > Thank you for the patch! Perhaps something to improve:
> > > > > >
> > > > > > [auto build test WARNING on akpm-mm/mm-everything]
> > > > > >
> > > > > > url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > > > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > > > > > config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20220606/202206060911.I8rRqGwC-lkp(a)intel.com/config)
> > > > > > compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
> > > > > > reproduce (this is a W=1 build):
> > > > > >         # https://github.com/intel-lab-lkp/linux/commit/d87b6065d6050b89930cca0814921aca7c269286
> > > > > >         git remote add linux-review https://github.com/intel-lab-lkp/linux
> > > > > >         git fetch --no-tags linux-review Zach-O-Keefe/mm-userspace-hugepage-collapse/20220606-012953
> > > > > >         git checkout d87b6065d6050b89930cca0814921aca7c269286
> > > > > >         # save the config file
> > > > > >         mkdir build_dir && cp config build_dir/.config
> > > > > >         make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> > > > > >
> > > > > > If you fix the issue, kindly add following tag where applicable
> > > > > > Reported-by: kernel test robot <lkp@intel.com>
> > > > > >
> > > > > > All warnings (new ones prefixed by >>):
> > > > > >
> > > > > >    mm/khugepaged.c: In function 'khugepaged':
> > > > > > >> mm/khugepaged.c:2284:1: warning: the frame size of 4160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> > > > > >     2284 | }
> > > > > >          | ^
> > > > >
> > > > > Thanks lkp(a)intel.com.
> > > > >
> > > > > This is due to config with:
> > > > >
> > > > > CONFIG_FRAME_WARN=2048
> > > > > CONFIG_NODES_SHIFT=10
> > > > >
> > > > > Where struct collapse_control has a member int
> > > > > node_load[MAX_NUMNODES], and we stack allocate one.
> > > > >
> > > > > Is this a configuration that needs to be supported? 1024 nodes seems
> > > > > like a lot and I'm not sure if these configs are randomly generated or
> > > > > are reminiscent of real systems.
> > > >
> > > > Adding 4k to the stack isn't a good thing to do.  It's trivial to
> > > > kmalloc the thing, so why not do that?
> > >
> > > Thanks, Andrew. Yeah, I just suggested that too.
> >
> > Thanks Yang / Andrew for taking the time to voice your suggestions.
> >
> > I'll go ahead and just kmalloc() the thing and fail if we can't.
> >
> > Yang, is there a reason to kmalloc() the entire struct
> > collapse_control with trailing flex array vs stack allocating the
> > struct collapse_control + kmalloc()'ing the node_load array?
>
> I don't think those two have too much difference. I don't have a
> strong preference personally. However you could choose:
>
> Define collapse_control as:
> struct collapse_control {
>     xxx;
>     ...
>     int node_load[MAX_NUMANODES];
> }
> Then you could kmalloc the whole struct.
>
> Or it could be defined as:
> struct collapse_control {
>     xxx;
>     ...
>     int *node_load[];
> }
> In this way you could allocate collapse_control on stack or by
> kmalloc, then kmalloc node_load for all possible nodes instead of
> MAX_NUMANODES. This may have a better success rate since you do
> kmalloc much less memory (typically the number of possible nodes is
> much less than MAX_NUMANODES), but it may be not worth it since the
> error handling path is more complicated and it may not make too much
> difference.
>
> The first choice is definitely much simpler, you may want to try that first.

Thanks for the suggestion. First approach also has the benefit of
being able to statically allocate one for khugepaged and simplifies
error paths there. I'll try that.

Again, thanks for taking the time to review and help out / suggest
improvements :)

Best,
Zach

> >
> >
> > > >
> > > > I'll await some reviewer input (hopefully positive ;)) before merging
> > > > this series.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-06-08  0:39       ` Yang Shi
@ 2022-06-09 17:35         ` Zach O'Keefe
  2022-06-09 18:51           ` Yang Shi
  0 siblings, 1 reply; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-09 17:35 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Tue, Jun 7, 2022 at 5:39 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Jun 7, 2022 at 3:48 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Mon, Jun 6, 2022 at 4:53 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > This idea was introduced by David Rientjes[1].
> > > >
> > > > Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
> > > > synchronous collapse of memory at their own expense.
> > > >
> > > > The benefits of this approach are:
> > > >
> > > > * CPU is charged to the process that wants to spend the cycles for the
> > > >   THP
> > > > * Avoid unpredictable timing of khugepaged collapse
> > > >
> > > > An immediate user of this new functionality are malloc() implementations
> > > > that manage memory in hugepage-sized chunks, but sometimes subrelease
> > > > memory back to the system in native-sized chunks via MADV_DONTNEED;
> > > > zapping the pmd.  Later, when the memory is hot, the implementation
> > > > could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
> > > > hugepage coverage and dTLB performance.  TCMalloc is such an
> > > > implementation that could benefit from this[2].
> > > >
> > > > Only privately-mapped anon memory is supported for now, but it is
> > > > expected that file and shmem support will be added later to support the
> > > > use-case of backing executable text by THPs.  Current support provided
> > > > by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
> > > > which might impair services from serving at their full rated load after
> > > > (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
> > > > immediately realize iTLB performance prevents page sharing and demand
> > > > paging, both of which increase steady state memory footprint.  With
> > > > MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
> > > > and lower RAM footprints.
> > > >
> > > > This call is independent of the system-wide THP sysfs settings, but will
> > > > fail for memory marked VM_NOHUGEPAGE.
> > > >
> > > > THP allocation may enter direct reclaim and/or compaction.
> > > >
> > > > [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> > > > [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
> > > >
> > > > Suggested-by: David Rientjes <rientjes@google.com>
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >  arch/alpha/include/uapi/asm/mman.h     |   2 +
> > > >  arch/mips/include/uapi/asm/mman.h      |   2 +
> > > >  arch/parisc/include/uapi/asm/mman.h    |   2 +
> > > >  arch/xtensa/include/uapi/asm/mman.h    |   2 +
> > > >  include/linux/huge_mm.h                |  12 +++
> > > >  include/uapi/asm-generic/mman-common.h |   2 +
> > > >  mm/khugepaged.c                        | 124 +++++++++++++++++++++++++
> > > >  mm/madvise.c                           |   5 +
> > > >  8 files changed, 151 insertions(+)
> > > >
> > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > > > index 4aa996423b0d..763929e814e9 100644
> > > > --- a/arch/alpha/include/uapi/asm/mman.h
> > > > +++ b/arch/alpha/include/uapi/asm/mman.h
> > > > @@ -76,6 +76,8 @@
> > > >
> > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > >
> > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > +
> > > >  /* compatibility flags */
> > > >  #define MAP_FILE       0
> > > >
> > > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > > > index 1be428663c10..c6e1fc77c996 100644
> > > > --- a/arch/mips/include/uapi/asm/mman.h
> > > > +++ b/arch/mips/include/uapi/asm/mman.h
> > > > @@ -103,6 +103,8 @@
> > > >
> > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > >
> > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > +
> > > >  /* compatibility flags */
> > > >  #define MAP_FILE       0
> > > >
> > > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > > > index a7ea3204a5fa..22133a6a506e 100644
> > > > --- a/arch/parisc/include/uapi/asm/mman.h
> > > > +++ b/arch/parisc/include/uapi/asm/mman.h
> > > > @@ -70,6 +70,8 @@
> > > >  #define MADV_WIPEONFORK 71             /* Zero memory on fork, child only */
> > > >  #define MADV_KEEPONFORK 72             /* Undo MADV_WIPEONFORK */
> > > >
> > > > +#define MADV_COLLAPSE  73              /* Synchronous hugepage collapse */
> > > > +
> > > >  #define MADV_HWPOISON     100          /* poison a page for testing */
> > > >  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
> > > >
> > > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > > > index 7966a58af472..1ff0c858544f 100644
> > > > --- a/arch/xtensa/include/uapi/asm/mman.h
> > > > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > > > @@ -111,6 +111,8 @@
> > > >
> > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > >
> > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > +
> > > >  /* compatibility flags */
> > > >  #define MAP_FILE       0
> > > >
> > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > > index 648cb3ce7099..2ca2f3b41fc8 100644
> > > > --- a/include/linux/huge_mm.h
> > > > +++ b/include/linux/huge_mm.h
> > > > @@ -240,6 +240,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> > > >
> > > >  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> > > >                      int advice);
> > > > +int madvise_collapse(struct vm_area_struct *vma,
> > > > +                    struct vm_area_struct **prev,
> > > > +                    unsigned long start, unsigned long end);
> > > >  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> > > >                            unsigned long end, long adjust_next);
> > > >  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > > > @@ -395,6 +398,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> > > >         BUG();
> > > >         return 0;
> > > >  }
> > > > +
> > > > +static inline int madvise_collapse(struct vm_area_struct *vma,
> > > > +                                  struct vm_area_struct **prev,
> > > > +                                  unsigned long start, unsigned long end)
> > > > +{
> > > > +       BUG();
> > > > +       return 0;
> > >
> > > I wish -ENOSYS could have been returned, but it seems madvise()
> > > doesn't support this return value.
> >
> > This is somewhat tangential, but I agree that ENOSYS (or some other
> > errno, but ENOSYS makes most sense to me, after EINVAL, (ENOTSUP?))
> > should be anointed the dedicated return value for "madvise mode not
> > supported". Ran into this recently when wanting some form of feature
> > detection for MADV_COLLAPSE where EINVAL is overloaded  (including
> > madvise mode not supported). Happy to move this forward if others
> > agree.
>
> I did a quick test by calling MADV_HUGEPAGE on !THP kernel, madvise()
> actually returns -EINVAL by madvise_behavior_valid(). So
> madvise_collapse() won't be called at all. So madvise_collapse() is
> basically used to make !THP compile happy.

Ya, exactly. I was thinking -ENOTSUP could be used in
place of -ENVAL in the madvise_behavior_valid() path to tell callers
of madvise(2) if a given madvise mode was supported or not. At the
moment -EINVAL return could mean a number of different things. Anyways
- that's a side conversation.

> I think we could just return -EINVAL.

That sounds fine - as you mention it's code that shouldn't be called
anyways and is just there to satisfy !THP. Was just basing off
hugepage_madvise().

> >
> > > > +}
> > > > +
> > > >  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> > > >                                          unsigned long start,
> > > >                                          unsigned long end,
> > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > > index 6c1aa92a92e4..6ce1f1ceb432 100644
> > > > --- a/include/uapi/asm-generic/mman-common.h
> > > > +++ b/include/uapi/asm-generic/mman-common.h
> > > > @@ -77,6 +77,8 @@
> > > >
> > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > >
> > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > +
> > > >  /* compatibility flags */
> > > >  #define MAP_FILE       0
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 4ad04f552347..073d6bb03b37 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -2404,3 +2404,127 @@ void khugepaged_min_free_kbytes_update(void)
> > > >                 set_recommended_min_free_kbytes();
> > > >         mutex_unlock(&khugepaged_mutex);
> > > >  }
> > > > +
> > > > +static int madvise_collapse_errno(enum scan_result r)
> > > > +{
> > > > +       switch (r) {
> > > > +       case SCAN_PMD_NULL:
> > > > +       case SCAN_ADDRESS_RANGE:
> > > > +       case SCAN_VMA_NULL:
> > > > +       case SCAN_PTE_NON_PRESENT:
> > > > +       case SCAN_PAGE_NULL:
> > > > +               /*
> > > > +                * Addresses in the specified range are not currently mapped,
> > > > +                * or are outside the AS of the process.
> > > > +                */
> > > > +               return -ENOMEM;
> > > > +       case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > > > +       case SCAN_CGROUP_CHARGE_FAIL:
> > > > +               /* A kernel resource was temporarily unavailable. */
> > > > +               return -EAGAIN;
> > >
> > > I thought this should return -ENOMEM too.
> >
> > Do you mean specifically SCAN_CGROUP_CHARGE_FAIL?
>
> No, I mean both.
>
> >
> > At least going by the comment above do_madvise(), and in the man
> > pages, for ENOMEM: "Addresses in the specified range are not currently
> > mapped, or are outside the address space of the process." doesn't
> > really apply here (though I don't know if "A kernel resource was
> > temporarily unavailable" applies any better).
>
> Yes, the man page does say so. But IIRC some MADV_ operations do
> return -ENOMEM for memory allocation failure, for example,
> MADV_POPULATE_READ/WRITE. Typically the man pages don't cover all
> cases.

Good point, I missed MADV_POPULATE_READ/WRITE didn't go through the
-ENOMEM -> -EAGAIN remapping at the bottom of madvise_vma_behavior().

> >
> > That said, should we differentiate between allocation and charging
> > failure? At least in the case of a userspace agent using
> > process_madvise(2) to collapse memory on behalf of others, knowing
> > "this memcg is at its limit" vs "no THPs available" would be valuable.
> > Maybe the former should be EBUSY?
>
> IMHO we don't have to differentiate allocation and charging.

After some consideration (thanks for starting this discussion and
prompting me to do so), I do think it's very valuable for callers to
know when THP allocation fails, and that an errno should be reserved
for that. The caller needs to know when a generic error, specific to
the memory being collapsed, occurs vs THP allocation failure to help
guide next actions: fallback to other strategy, sleep, MADV_DONTNEED /
free memory elsewhere, etc.

As a concrete example, process init code that tries to back certain
segments of text by hugepages. Some existing strategies folks use for
this are CONFIG_READ_ONLY_THP_FOR_FS + khugepaged, and anon mremap(2)
tricks. CONFIG_READ_ONLY_THP_FOR_FS + MADV_COLLPASE might add a third,
and if it fails, it'd be nice to know which other option to fall back
to, depending on how badly the user wants THP backing. If THP
allocation fails, then likely the anon mremap(2) trick will fail too
(unless some reclaim/compaction is done).

Less immediately concrete, but a userspace agent seeking to optimize
system-wide THP utilization surely wants to know when it's exhausted
its precious THP supply.

So I'd like to see -EAGAIN reserved for THP allocation failure
(-ENOMEM is taken by AS errors, and it'd be nice to be consistent with
other modes here). I think -EBUSY for memcg charging makes sense, and
tells the caller something actionable and useful, so I'd like to see
it differentiated from -ENOMEM.

Would appreciate feedback from folks here before setting these in
stone and preventing the errno from ever being useful to callers.

> >
> > > > +       default:
> > > > +               return -EINVAL;
> > > > +       }
> > > > +}
> > > > +
> > > > +int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > > +                    unsigned long start, unsigned long end)
> > > > +{
> > > > +       struct collapse_control cc = {
> > > > +               .enforce_page_heuristics = false,
> > > > +               .enforce_thp_enabled = false,
> > > > +               .last_target_node = NUMA_NO_NODE,
> > > > +               .gfp = GFP_TRANSHUGE | __GFP_THISNODE,
> > > > +       };
> > > > +       struct mm_struct *mm = vma->vm_mm;
> > > > +       unsigned long hstart, hend, addr;
> > > > +       int thps = 0, last_fail = SCAN_FAIL;
> > > > +       bool mmap_locked = true;
> > > > +
> > > > +       BUG_ON(vma->vm_start > start);
> > > > +       BUG_ON(vma->vm_end < end);
> > > > +
> > > > +       *prev = vma;
> > > > +
> > > > +       /* TODO: Support file/shmem */
> > > > +       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > > > +               return -EINVAL;
> > > > +
> > > > +       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> > > > +       hend = end & HPAGE_PMD_MASK;
> > > > +
> > > > +       /*
> > > > +        * Set VM_HUGEPAGE so that hugepage_vma_check() can pass even if
> > > > +        * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
> > > > +        * Note that hugepage_vma_check() doesn't enforce that
> > > > +        * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
> > > > +        * must be set (i.e. "never" mode)
> > > > +        */
> > > > +       if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
> > >
> > > hugepage_vma_check() doesn't check vma size, so MADV_COLLAPSE may be
> > > running for a unsuitable vma, hugepage_vma_revalidate() called by
> > > khugepaged_scan_pmd() may find it out finally, but it is a huge waste
> > > of effort. So, it is better to check vma size upfront.
> >
> > This actually does check the vma size, but it's subtle. hstart and
> > hend are clamped to the first/last
> > hugepaged-aligned address covered by [start,end], which are themselves
> > contained in vma->vm_start/vma->vm_end, respectively. We then check
> > that addr = hstart < hend ; so if the main loop passes the first
> > check, we know that vma->vm_start <= addr and addr + HPAGE_PMD_SIZE <=
>
> Aha, yes, I overlooked that.
>
> > vma->vma_end. Agreed that we might be needlessly doing mmgrab() and
> > lru_add_drain() needlessly though.
>
> Yeah
>
> >
> > > BTW, my series moved the vma size check in hugepage_vma_check(), so if
> > > your series could be based on top of that, you get that for free.
> >
> > I'll try rebasing on top of your series, thank you!
>
> You don't have to do it right now. I don't know what series will be
> merged to mm tree first. Just a heads up.

Thanks! Seems beneficial here though, so I'll do that and add a note
in the cover letter.

> >
> > > > +               return -EINVAL;
> > > > +
> > > > +       mmgrab(mm);
> > > > +       lru_add_drain();
> > > > +
> > > > +       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> > > > +               int result = SCAN_FAIL;
> > > > +               bool retry = true;  /* Allow one retry per hugepage */
> > > > +retry:
> > > > +               if (!mmap_locked) {
> > > > +                       cond_resched();
> > > > +                       mmap_read_lock(mm);
> > "> > +                       mmap_locked = true;
> > > > +                       result = hugepage_vma_revalidate(mm, addr, &vma, &cc);
> > >
> > > How's about making hugepage_vma_revalidate() return SCAN_SUCCEED too?
> > > It seems more consistent.
> >
> > Ya, I didn't like this either.  I'll add this to "mm/khugepaged: pipe
> > enum scan_result codes back to callers"
> >
> > > > +                       if (result) {
> > > > +                               last_fail = result;
> > > > +                               goto out_nolock;
> > > > +                       }
> > > > +               }
> > > > +               mmap_assert_locked(mm);
> > > > +               memset(cc.node_load, 0, sizeof(cc.node_load));
> > > > +               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
> > > > +               if (!mmap_locked)
> > > > +                       *prev = NULL;  /* Tell caller we dropped mmap_lock */
> > > > +
> > > > +               switch (result) {
> > > > +               case SCAN_SUCCEED:
> > > > +               case SCAN_PMD_MAPPED:
> > > > +                       ++thps;
> > > > +                       break;
> > > > +               /* Whitelisted set of results where continuing OK */
> > > > +               case SCAN_PMD_NULL:
> > > > +               case SCAN_PTE_NON_PRESENT:
> > > > +               case SCAN_PTE_UFFD_WP:
> > > > +               case SCAN_PAGE_RO:
> > > > +               case SCAN_LACK_REFERENCED_PAGE:
> > > > +               case SCAN_PAGE_NULL:
> > > > +               case SCAN_PAGE_COUNT:
> > > > +               case SCAN_PAGE_LOCK:
> > > > +               case SCAN_PAGE_COMPOUND:
> > > > +                       last_fail = result;
> > > > +                       break;
> > > > +               case SCAN_PAGE_LRU:
> > > > +                       if (retry) {
> > > > +                               lru_add_drain_all();
> > > > +                               retry = false;
> > > > +                               goto retry;
> > >
> > > I'm not sure whether the retry logic is necessary or not, do you have
> > > any data about how retry improves the success rate? You could just
> > > replace lru_add_drain() to lru_add_drain_all() and remove the retry
> > > logic IMHO. I'd prefer to keep it simple at the moment personally.
> >
> > Transparently, I've only had success hitting this logic on small vms
> > under selftests.  That said, it does happen, and I can't imagine this
> > hurting, especially on larger systems + tasks using lots of mem.
> > Originally, I didn't plan to do this, but as things shook out and we
> > had SCAN_PAGE_LRU so readily available, it seemed like we got this for
> > free.
>
> "small vms" mean small virtual machines?
>
> When the logic is hit, does lru_add_drain_all() help to improve the
> success rate?

Ya, I've been doing most dev/testing on small, 2 cpu virtual machines.
I've been mmap()ing a multi-hugepage sized region, then faulting it in
- presumably being preempted and rescheduled on another cpu while
iterating over the region and faulting. I had ran into the
lru_add_drain_all() vs lru_add_drain() during testing/dev since
*occasionally* (admittedly, not very often the IIRC the former helped
tests pass.

That said, I set out to try and repro this semi-reliably, with little
success - I'm almost always finding the pages on the LRU. Still
playing around with this..

> I don't mean this hurts anything. I'm just thinking about whether the
> extra complexity is worth it or not. And calling lru_add_drain_all()
> with holding mmap_lock might have some scalability issues since
> draining lru for all is not cheap.

Good point. At the very least, it seems like we should unlock
mmap_lock before retry.




>
>
> >
> > > > +                       }
> > > > +                       fallthrough;
> > > > +               default:
> > > > +                       last_fail = result;
> > > > +                       /* Other error, exit */
> > > > +                       goto out_maybelock;
> > > > +               }
> > > > +       }
> > > > +
> > > > +out_maybelock:
> > > > +       /* Caller expects us to hold mmap_lock on return */
> > > > +       if (!mmap_locked)
> > > > +               mmap_read_lock(mm);
> > > > +out_nolock:
> > > > +       mmap_assert_locked(mm);
> > > > +       mmdrop(mm);
> > > > +
> > > > +       return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> > > > +                       : madvise_collapse_errno(last_fail);
> > > > +}
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 46feb62ce163..eccac2620226 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
> > > >         case MADV_FREE:
> > > >         case MADV_POPULATE_READ:
> > > >         case MADV_POPULATE_WRITE:
> > > > +       case MADV_COLLAPSE:
> > > >                 return 0;
> > > >         default:
> > > >                 /* be safe, default to 1. list exceptions explicitly */
> > > > @@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > > >                 if (error)
> > > >                         goto out;
> > > >                 break;
> > > > +       case MADV_COLLAPSE:
> > > > +               return madvise_collapse(vma, prev, start, end);
> > > >         }
> > > >
> > > >         anon_name = anon_vma_name(vma);
> > > > @@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
> > > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > >         case MADV_HUGEPAGE:
> > > >         case MADV_NOHUGEPAGE:
> > > > +       case MADV_COLLAPSE:
> > > >  #endif
> > > >         case MADV_DONTDUMP:
> > > >         case MADV_DODUMP:
> > > > @@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > > >   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > > >   *             transparent huge pages so the existing pages will not be
> > > >   *             coalesced into THP and new pages will not be allocated as THP.
> > > > + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> > > >   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> > > >   *             from being included in its core dump.
> > > >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > > > --
> > > > 2.36.1.255.ge46751e96f-goog
> > > >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-06-09 17:35         ` Zach O'Keefe
@ 2022-06-09 18:51           ` Yang Shi
  2022-06-10 14:51             ` Zach O'Keefe
  0 siblings, 1 reply; 63+ messages in thread
From: Yang Shi @ 2022-06-09 18:51 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Thu, Jun 9, 2022 at 10:35 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Tue, Jun 7, 2022 at 5:39 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Tue, Jun 7, 2022 at 3:48 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > On Mon, Jun 6, 2022 at 4:53 PM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > >
> > > > > This idea was introduced by David Rientjes[1].
> > > > >
> > > > > Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
> > > > > synchronous collapse of memory at their own expense.
> > > > >
> > > > > The benefits of this approach are:
> > > > >
> > > > > * CPU is charged to the process that wants to spend the cycles for the
> > > > >   THP
> > > > > * Avoid unpredictable timing of khugepaged collapse
> > > > >
> > > > > An immediate user of this new functionality are malloc() implementations
> > > > > that manage memory in hugepage-sized chunks, but sometimes subrelease
> > > > > memory back to the system in native-sized chunks via MADV_DONTNEED;
> > > > > zapping the pmd.  Later, when the memory is hot, the implementation
> > > > > could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
> > > > > hugepage coverage and dTLB performance.  TCMalloc is such an
> > > > > implementation that could benefit from this[2].
> > > > >
> > > > > Only privately-mapped anon memory is supported for now, but it is
> > > > > expected that file and shmem support will be added later to support the
> > > > > use-case of backing executable text by THPs.  Current support provided
> > > > > by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
> > > > > which might impair services from serving at their full rated load after
> > > > > (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
> > > > > immediately realize iTLB performance prevents page sharing and demand
> > > > > paging, both of which increase steady state memory footprint.  With
> > > > > MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
> > > > > and lower RAM footprints.
> > > > >
> > > > > This call is independent of the system-wide THP sysfs settings, but will
> > > > > fail for memory marked VM_NOHUGEPAGE.
> > > > >
> > > > > THP allocation may enter direct reclaim and/or compaction.
> > > > >
> > > > > [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> > > > > [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
> > > > >
> > > > > Suggested-by: David Rientjes <rientjes@google.com>
> > > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > > ---
> > > > >  arch/alpha/include/uapi/asm/mman.h     |   2 +
> > > > >  arch/mips/include/uapi/asm/mman.h      |   2 +
> > > > >  arch/parisc/include/uapi/asm/mman.h    |   2 +
> > > > >  arch/xtensa/include/uapi/asm/mman.h    |   2 +
> > > > >  include/linux/huge_mm.h                |  12 +++
> > > > >  include/uapi/asm-generic/mman-common.h |   2 +
> > > > >  mm/khugepaged.c                        | 124 +++++++++++++++++++++++++
> > > > >  mm/madvise.c                           |   5 +
> > > > >  8 files changed, 151 insertions(+)
> > > > >
> > > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > > > > index 4aa996423b0d..763929e814e9 100644
> > > > > --- a/arch/alpha/include/uapi/asm/mman.h
> > > > > +++ b/arch/alpha/include/uapi/asm/mman.h
> > > > > @@ -76,6 +76,8 @@
> > > > >
> > > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > > >
> > > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > > +
> > > > >  /* compatibility flags */
> > > > >  #define MAP_FILE       0
> > > > >
> > > > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > > > > index 1be428663c10..c6e1fc77c996 100644
> > > > > --- a/arch/mips/include/uapi/asm/mman.h
> > > > > +++ b/arch/mips/include/uapi/asm/mman.h
> > > > > @@ -103,6 +103,8 @@
> > > > >
> > > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > > >
> > > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > > +
> > > > >  /* compatibility flags */
> > > > >  #define MAP_FILE       0
> > > > >
> > > > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > > > > index a7ea3204a5fa..22133a6a506e 100644
> > > > > --- a/arch/parisc/include/uapi/asm/mman.h
> > > > > +++ b/arch/parisc/include/uapi/asm/mman.h
> > > > > @@ -70,6 +70,8 @@
> > > > >  #define MADV_WIPEONFORK 71             /* Zero memory on fork, child only */
> > > > >  #define MADV_KEEPONFORK 72             /* Undo MADV_WIPEONFORK */
> > > > >
> > > > > +#define MADV_COLLAPSE  73              /* Synchronous hugepage collapse */
> > > > > +
> > > > >  #define MADV_HWPOISON     100          /* poison a page for testing */
> > > > >  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
> > > > >
> > > > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > > > > index 7966a58af472..1ff0c858544f 100644
> > > > > --- a/arch/xtensa/include/uapi/asm/mman.h
> > > > > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > > > > @@ -111,6 +111,8 @@
> > > > >
> > > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > > >
> > > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > > +
> > > > >  /* compatibility flags */
> > > > >  #define MAP_FILE       0
> > > > >
> > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > > > index 648cb3ce7099..2ca2f3b41fc8 100644
> > > > > --- a/include/linux/huge_mm.h
> > > > > +++ b/include/linux/huge_mm.h
> > > > > @@ -240,6 +240,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> > > > >
> > > > >  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> > > > >                      int advice);
> > > > > +int madvise_collapse(struct vm_area_struct *vma,
> > > > > +                    struct vm_area_struct **prev,
> > > > > +                    unsigned long start, unsigned long end);
> > > > >  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> > > > >                            unsigned long end, long adjust_next);
> > > > >  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > > > > @@ -395,6 +398,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> > > > >         BUG();
> > > > >         return 0;
> > > > >  }
> > > > > +
> > > > > +static inline int madvise_collapse(struct vm_area_struct *vma,
> > > > > +                                  struct vm_area_struct **prev,
> > > > > +                                  unsigned long start, unsigned long end)
> > > > > +{
> > > > > +       BUG();
> > > > > +       return 0;
> > > >
> > > > I wish -ENOSYS could have been returned, but it seems madvise()
> > > > doesn't support this return value.
> > >
> > > This is somewhat tangential, but I agree that ENOSYS (or some other
> > > errno, but ENOSYS makes most sense to me, after EINVAL, (ENOTSUP?))
> > > should be anointed the dedicated return value for "madvise mode not
> > > supported". Ran into this recently when wanting some form of feature
> > > detection for MADV_COLLAPSE where EINVAL is overloaded  (including
> > > madvise mode not supported). Happy to move this forward if others
> > > agree.
> >
> > I did a quick test by calling MADV_HUGEPAGE on !THP kernel, madvise()
> > actually returns -EINVAL by madvise_behavior_valid(). So
> > madvise_collapse() won't be called at all. So madvise_collapse() is
> > basically used to make !THP compile happy.
>
> Ya, exactly. I was thinking -ENOTSUP could be used in
> place of -ENVAL in the madvise_behavior_valid() path to tell callers
> of madvise(2) if a given madvise mode was supported or not. At the
> moment -EINVAL return could mean a number of different things. Anyways
> - that's a side conversation.
>
> > I think we could just return -EINVAL.
>
> That sounds fine - as you mention it's code that shouldn't be called
> anyways and is just there to satisfy !THP. Was just basing off
> hugepage_madvise().

You could modify huepage_madvise() to simply return -EINVAL in your
patch too. It is not worth for a separate patch IMHO.

>
> > >
> > > > > +}
> > > > > +
> > > > >  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> > > > >                                          unsigned long start,
> > > > >                                          unsigned long end,
> > > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > > > index 6c1aa92a92e4..6ce1f1ceb432 100644
> > > > > --- a/include/uapi/asm-generic/mman-common.h
> > > > > +++ b/include/uapi/asm-generic/mman-common.h
> > > > > @@ -77,6 +77,8 @@
> > > > >
> > > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > > >
> > > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > > +
> > > > >  /* compatibility flags */
> > > > >  #define MAP_FILE       0
> > > > >
> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > index 4ad04f552347..073d6bb03b37 100644
> > > > > --- a/mm/khugepaged.c
> > > > > +++ b/mm/khugepaged.c
> > > > > @@ -2404,3 +2404,127 @@ void khugepaged_min_free_kbytes_update(void)
> > > > >                 set_recommended_min_free_kbytes();
> > > > >         mutex_unlock(&khugepaged_mutex);
> > > > >  }
> > > > > +
> > > > > +static int madvise_collapse_errno(enum scan_result r)
> > > > > +{
> > > > > +       switch (r) {
> > > > > +       case SCAN_PMD_NULL:
> > > > > +       case SCAN_ADDRESS_RANGE:
> > > > > +       case SCAN_VMA_NULL:
> > > > > +       case SCAN_PTE_NON_PRESENT:
> > > > > +       case SCAN_PAGE_NULL:
> > > > > +               /*
> > > > > +                * Addresses in the specified range are not currently mapped,
> > > > > +                * or are outside the AS of the process.
> > > > > +                */
> > > > > +               return -ENOMEM;
> > > > > +       case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > > > > +       case SCAN_CGROUP_CHARGE_FAIL:
> > > > > +               /* A kernel resource was temporarily unavailable. */
> > > > > +               return -EAGAIN;
> > > >
> > > > I thought this should return -ENOMEM too.
> > >
> > > Do you mean specifically SCAN_CGROUP_CHARGE_FAIL?
> >
> > No, I mean both.
> >
> > >
> > > At least going by the comment above do_madvise(), and in the man
> > > pages, for ENOMEM: "Addresses in the specified range are not currently
> > > mapped, or are outside the address space of the process." doesn't
> > > really apply here (though I don't know if "A kernel resource was
> > > temporarily unavailable" applies any better).
> >
> > Yes, the man page does say so. But IIRC some MADV_ operations do
> > return -ENOMEM for memory allocation failure, for example,
> > MADV_POPULATE_READ/WRITE. Typically the man pages don't cover all
> > cases.
>
> Good point, I missed MADV_POPULATE_READ/WRITE didn't go through the
> -ENOMEM -> -EAGAIN remapping at the bottom of madvise_vma_behavior().
>
> > >
> > > That said, should we differentiate between allocation and charging
> > > failure? At least in the case of a userspace agent using
> > > process_madvise(2) to collapse memory on behalf of others, knowing
> > > "this memcg is at its limit" vs "no THPs available" would be valuable.
> > > Maybe the former should be EBUSY?
> >
> > IMHO we don't have to differentiate allocation and charging.
>
> After some consideration (thanks for starting this discussion and
> prompting me to do so), I do think it's very valuable for callers to
> know when THP allocation fails, and that an errno should be reserved
> for that. The caller needs to know when a generic error, specific to
> the memory being collapsed, occurs vs THP allocation failure to help
> guide next actions: fallback to other strategy, sleep, MADV_DONTNEED /
> free memory elsewhere, etc.
>
> As a concrete example, process init code that tries to back certain
> segments of text by hugepages. Some existing strategies folks use for
> this are CONFIG_READ_ONLY_THP_FOR_FS + khugepaged, and anon mremap(2)
> tricks. CONFIG_READ_ONLY_THP_FOR_FS + MADV_COLLPASE might add a third,
> and if it fails, it'd be nice to know which other option to fall back
> to, depending on how badly the user wants THP backing. If THP
> allocation fails, then likely the anon mremap(2) trick will fail too
> (unless some reclaim/compaction is done).
>
> Less immediately concrete, but a userspace agent seeking to optimize
> system-wide THP utilization surely wants to know when it's exhausted
> its precious THP supply.
>
> So I'd like to see -EAGAIN reserved for THP allocation failure
> (-ENOMEM is taken by AS errors, and it'd be nice to be consistent with
> other modes here). I think -EBUSY for memcg charging makes sense, and
> tells the caller something actionable and useful, so I'd like to see
> it differentiated from -ENOMEM.

OK, it makes some sense to differentiate from -ENOMEM. But I still
don't see too much value to differentiate allocation failure vs
charging failure. When charging is failed other tricks are unlikely to
succeed either IMHO unless more aggressive reclaim is done.

But GFP_TRANSHUGE is used by MADV_COLLAPSE, it means direct reclaim
has been tried before returning failure for both allocation and
charging.

>
> Would appreciate feedback from folks here before setting these in
> stone and preventing the errno from ever being useful to callers.
>
> > >
> > > > > +       default:
> > > > > +               return -EINVAL;
> > > > > +       }
> > > > > +}
> > > > > +
> > > > > +int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > > > +                    unsigned long start, unsigned long end)
> > > > > +{
> > > > > +       struct collapse_control cc = {
> > > > > +               .enforce_page_heuristics = false,
> > > > > +               .enforce_thp_enabled = false,
> > > > > +               .last_target_node = NUMA_NO_NODE,
> > > > > +               .gfp = GFP_TRANSHUGE | __GFP_THISNODE,
> > > > > +       };
> > > > > +       struct mm_struct *mm = vma->vm_mm;
> > > > > +       unsigned long hstart, hend, addr;
> > > > > +       int thps = 0, last_fail = SCAN_FAIL;
> > > > > +       bool mmap_locked = true;
> > > > > +
> > > > > +       BUG_ON(vma->vm_start > start);
> > > > > +       BUG_ON(vma->vm_end < end);
> > > > > +
> > > > > +       *prev = vma;
> > > > > +
> > > > > +       /* TODO: Support file/shmem */
> > > > > +       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > > > > +               return -EINVAL;
> > > > > +
> > > > > +       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> > > > > +       hend = end & HPAGE_PMD_MASK;
> > > > > +
> > > > > +       /*
> > > > > +        * Set VM_HUGEPAGE so that hugepage_vma_check() can pass even if
> > > > > +        * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
> > > > > +        * Note that hugepage_vma_check() doesn't enforce that
> > > > > +        * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
> > > > > +        * must be set (i.e. "never" mode)
> > > > > +        */
> > > > > +       if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
> > > >
> > > > hugepage_vma_check() doesn't check vma size, so MADV_COLLAPSE may be
> > > > running for a unsuitable vma, hugepage_vma_revalidate() called by
> > > > khugepaged_scan_pmd() may find it out finally, but it is a huge waste
> > > > of effort. So, it is better to check vma size upfront.
> > >
> > > This actually does check the vma size, but it's subtle. hstart and
> > > hend are clamped to the first/last
> > > hugepaged-aligned address covered by [start,end], which are themselves
> > > contained in vma->vm_start/vma->vm_end, respectively. We then check
> > > that addr = hstart < hend ; so if the main loop passes the first
> > > check, we know that vma->vm_start <= addr and addr + HPAGE_PMD_SIZE <=
> >
> > Aha, yes, I overlooked that.
> >
> > > vma->vma_end. Agreed that we might be needlessly doing mmgrab() and
> > > lru_add_drain() needlessly though.
> >
> > Yeah
> >
> > >
> > > > BTW, my series moved the vma size check in hugepage_vma_check(), so if
> > > > your series could be based on top of that, you get that for free.
> > >
> > > I'll try rebasing on top of your series, thank you!
> >
> > You don't have to do it right now. I don't know what series will be
> > merged to mm tree first. Just a heads up.
>
> Thanks! Seems beneficial here though, so I'll do that and add a note
> in the cover letter.

Thank you so much. That would make my life easier :-)

>
> > >
> > > > > +               return -EINVAL;
> > > > > +
> > > > > +       mmgrab(mm);
> > > > > +       lru_add_drain();
> > > > > +
> > > > > +       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> > > > > +               int result = SCAN_FAIL;
> > > > > +               bool retry = true;  /* Allow one retry per hugepage */
> > > > > +retry:
> > > > > +               if (!mmap_locked) {
> > > > > +                       cond_resched();
> > > > > +                       mmap_read_lock(mm);
> > > "> > +                       mmap_locked = true;
> > > > > +                       result = hugepage_vma_revalidate(mm, addr, &vma, &cc);
> > > >
> > > > How's about making hugepage_vma_revalidate() return SCAN_SUCCEED too?
> > > > It seems more consistent.
> > >
> > > Ya, I didn't like this either.  I'll add this to "mm/khugepaged: pipe
> > > enum scan_result codes back to callers"
> > >
> > > > > +                       if (result) {
> > > > > +                               last_fail = result;
> > > > > +                               goto out_nolock;
> > > > > +                       }
> > > > > +               }
> > > > > +               mmap_assert_locked(mm);
> > > > > +               memset(cc.node_load, 0, sizeof(cc.node_load));
> > > > > +               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
> > > > > +               if (!mmap_locked)
> > > > > +                       *prev = NULL;  /* Tell caller we dropped mmap_lock */
> > > > > +
> > > > > +               switch (result) {
> > > > > +               case SCAN_SUCCEED:
> > > > > +               case SCAN_PMD_MAPPED:
> > > > > +                       ++thps;
> > > > > +                       break;
> > > > > +               /* Whitelisted set of results where continuing OK */
> > > > > +               case SCAN_PMD_NULL:
> > > > > +               case SCAN_PTE_NON_PRESENT:
> > > > > +               case SCAN_PTE_UFFD_WP:
> > > > > +               case SCAN_PAGE_RO:
> > > > > +               case SCAN_LACK_REFERENCED_PAGE:
> > > > > +               case SCAN_PAGE_NULL:
> > > > > +               case SCAN_PAGE_COUNT:
> > > > > +               case SCAN_PAGE_LOCK:
> > > > > +               case SCAN_PAGE_COMPOUND:
> > > > > +                       last_fail = result;
> > > > > +                       break;
> > > > > +               case SCAN_PAGE_LRU:
> > > > > +                       if (retry) {
> > > > > +                               lru_add_drain_all();
> > > > > +                               retry = false;
> > > > > +                               goto retry;
> > > >
> > > > I'm not sure whether the retry logic is necessary or not, do you have
> > > > any data about how retry improves the success rate? You could just
> > > > replace lru_add_drain() to lru_add_drain_all() and remove the retry
> > > > logic IMHO. I'd prefer to keep it simple at the moment personally.
> > >
> > > Transparently, I've only had success hitting this logic on small vms
> > > under selftests.  That said, it does happen, and I can't imagine this
> > > hurting, especially on larger systems + tasks using lots of mem.
> > > Originally, I didn't plan to do this, but as things shook out and we
> > > had SCAN_PAGE_LRU so readily available, it seemed like we got this for
> > > free.
> >
> > "small vms" mean small virtual machines?
> >
> > When the logic is hit, does lru_add_drain_all() help to improve the
> > success rate?
>
> Ya, I've been doing most dev/testing on small, 2 cpu virtual machines.
> I've been mmap()ing a multi-hugepage sized region, then faulting it in
> - presumably being preempted and rescheduled on another cpu while
> iterating over the region and faulting. I had ran into the
> lru_add_drain_all() vs lru_add_drain() during testing/dev since
> *occasionally* (admittedly, not very often the IIRC the former helped
> tests pass.
>
> That said, I set out to try and repro this semi-reliably, with little
> success - I'm almost always finding the pages on the LRU. Still
> playing around with this..
>
> > I don't mean this hurts anything. I'm just thinking about whether the
> > extra complexity is worth it or not. And calling lru_add_drain_all()
> > with holding mmap_lock might have some scalability issues since
> > draining lru for all is not cheap.
>
> Good point. At the very least, it seems like we should unlock
> mmap_lock before retry.

You could, but it still sounds overkilling to me. All the extra
complexity is just used to optimize for small sized machines which
unlikely run with THP in real life TBH.

>
>
>
>
> >
> >
> > >
> > > > > +                       }
> > > > > +                       fallthrough;
> > > > > +               default:
> > > > > +                       last_fail = result;
> > > > > +                       /* Other error, exit */
> > > > > +                       goto out_maybelock;
> > > > > +               }
> > > > > +       }
> > > > > +
> > > > > +out_maybelock:
> > > > > +       /* Caller expects us to hold mmap_lock on return */
> > > > > +       if (!mmap_locked)
> > > > > +               mmap_read_lock(mm);
> > > > > +out_nolock:
> > > > > +       mmap_assert_locked(mm);
> > > > > +       mmdrop(mm);
> > > > > +
> > > > > +       return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> > > > > +                       : madvise_collapse_errno(last_fail);
> > > > > +}
> > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > index 46feb62ce163..eccac2620226 100644
> > > > > --- a/mm/madvise.c
> > > > > +++ b/mm/madvise.c
> > > > > @@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
> > > > >         case MADV_FREE:
> > > > >         case MADV_POPULATE_READ:
> > > > >         case MADV_POPULATE_WRITE:
> > > > > +       case MADV_COLLAPSE:
> > > > >                 return 0;
> > > > >         default:
> > > > >                 /* be safe, default to 1. list exceptions explicitly */
> > > > > @@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > > > >                 if (error)
> > > > >                         goto out;
> > > > >                 break;
> > > > > +       case MADV_COLLAPSE:
> > > > > +               return madvise_collapse(vma, prev, start, end);
> > > > >         }
> > > > >
> > > > >         anon_name = anon_vma_name(vma);
> > > > > @@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
> > > > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > >         case MADV_HUGEPAGE:
> > > > >         case MADV_NOHUGEPAGE:
> > > > > +       case MADV_COLLAPSE:
> > > > >  #endif
> > > > >         case MADV_DONTDUMP:
> > > > >         case MADV_DODUMP:
> > > > > @@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > > > >   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > > > >   *             transparent huge pages so the existing pages will not be
> > > > >   *             coalesced into THP and new pages will not be allocated as THP.
> > > > > + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> > > > >   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> > > > >   *             from being included in its core dump.
> > > > >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > > > > --
> > > > > 2.36.1.255.ge46751e96f-goog
> > > > >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-06-09 18:51           ` Yang Shi
@ 2022-06-10 14:51             ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-10 14:51 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Thu, Jun 9, 2022 at 11:52 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Jun 9, 2022 at 10:35 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Tue, Jun 7, 2022 at 5:39 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Tue, Jun 7, 2022 at 3:48 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > On Mon, Jun 6, 2022 at 4:53 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jun 3, 2022 at 5:40 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > > >
> > > > > > This idea was introduced by David Rientjes[1].
> > > > > >
> > > > > > Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
> > > > > > synchronous collapse of memory at their own expense.
> > > > > >
> > > > > > The benefits of this approach are:
> > > > > >
> > > > > > * CPU is charged to the process that wants to spend the cycles for the
> > > > > >   THP
> > > > > > * Avoid unpredictable timing of khugepaged collapse
> > > > > >
> > > > > > An immediate user of this new functionality are malloc() implementations
> > > > > > that manage memory in hugepage-sized chunks, but sometimes subrelease
> > > > > > memory back to the system in native-sized chunks via MADV_DONTNEED;
> > > > > > zapping the pmd.  Later, when the memory is hot, the implementation
> > > > > > could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
> > > > > > hugepage coverage and dTLB performance.  TCMalloc is such an
> > > > > > implementation that could benefit from this[2].
> > > > > >
> > > > > > Only privately-mapped anon memory is supported for now, but it is
> > > > > > expected that file and shmem support will be added later to support the
> > > > > > use-case of backing executable text by THPs.  Current support provided
> > > > > > by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
> > > > > > which might impair services from serving at their full rated load after
> > > > > > (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
> > > > > > immediately realize iTLB performance prevents page sharing and demand
> > > > > > paging, both of which increase steady state memory footprint.  With
> > > > > > MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
> > > > > > and lower RAM footprints.
> > > > > >
> > > > > > This call is independent of the system-wide THP sysfs settings, but will
> > > > > > fail for memory marked VM_NOHUGEPAGE.
> > > > > >
> > > > > > THP allocation may enter direct reclaim and/or compaction.
> > > > > >
> > > > > > [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> > > > > > [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
> > > > > >
> > > > > > Suggested-by: David Rientjes <rientjes@google.com>
> > > > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > > > ---
> > > > > >  arch/alpha/include/uapi/asm/mman.h     |   2 +
> > > > > >  arch/mips/include/uapi/asm/mman.h      |   2 +
> > > > > >  arch/parisc/include/uapi/asm/mman.h    |   2 +
> > > > > >  arch/xtensa/include/uapi/asm/mman.h    |   2 +
> > > > > >  include/linux/huge_mm.h                |  12 +++
> > > > > >  include/uapi/asm-generic/mman-common.h |   2 +
> > > > > >  mm/khugepaged.c                        | 124 +++++++++++++++++++++++++
> > > > > >  mm/madvise.c                           |   5 +
> > > > > >  8 files changed, 151 insertions(+)
> > > > > >
> > > > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > > > > > index 4aa996423b0d..763929e814e9 100644
> > > > > > --- a/arch/alpha/include/uapi/asm/mman.h
> > > > > > +++ b/arch/alpha/include/uapi/asm/mman.h
> > > > > > @@ -76,6 +76,8 @@
> > > > > >
> > > > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > > > >
> > > > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > > > +
> > > > > >  /* compatibility flags */
> > > > > >  #define MAP_FILE       0
> > > > > >
> > > > > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > > > > > index 1be428663c10..c6e1fc77c996 100644
> > > > > > --- a/arch/mips/include/uapi/asm/mman.h
> > > > > > +++ b/arch/mips/include/uapi/asm/mman.h
> > > > > > @@ -103,6 +103,8 @@
> > > > > >
> > > > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > > > >
> > > > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > > > +
> > > > > >  /* compatibility flags */
> > > > > >  #define MAP_FILE       0
> > > > > >
> > > > > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > > > > > index a7ea3204a5fa..22133a6a506e 100644
> > > > > > --- a/arch/parisc/include/uapi/asm/mman.h
> > > > > > +++ b/arch/parisc/include/uapi/asm/mman.h
> > > > > > @@ -70,6 +70,8 @@
> > > > > >  #define MADV_WIPEONFORK 71             /* Zero memory on fork, child only */
> > > > > >  #define MADV_KEEPONFORK 72             /* Undo MADV_WIPEONFORK */
> > > > > >
> > > > > > +#define MADV_COLLAPSE  73              /* Synchronous hugepage collapse */
> > > > > > +
> > > > > >  #define MADV_HWPOISON     100          /* poison a page for testing */
> > > > > >  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
> > > > > >
> > > > > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > > > > > index 7966a58af472..1ff0c858544f 100644
> > > > > > --- a/arch/xtensa/include/uapi/asm/mman.h
> > > > > > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > > > > > @@ -111,6 +111,8 @@
> > > > > >
> > > > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > > > >
> > > > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > > > +
> > > > > >  /* compatibility flags */
> > > > > >  #define MAP_FILE       0
> > > > > >
> > > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > > > > index 648cb3ce7099..2ca2f3b41fc8 100644
> > > > > > --- a/include/linux/huge_mm.h
> > > > > > +++ b/include/linux/huge_mm.h
> > > > > > @@ -240,6 +240,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> > > > > >
> > > > > >  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> > > > > >                      int advice);
> > > > > > +int madvise_collapse(struct vm_area_struct *vma,
> > > > > > +                    struct vm_area_struct **prev,
> > > > > > +                    unsigned long start, unsigned long end);
> > > > > >  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> > > > > >                            unsigned long end, long adjust_next);
> > > > > >  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > > > > > @@ -395,6 +398,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> > > > > >         BUG();
> > > > > >         return 0;
> > > > > >  }
> > > > > > +
> > > > > > +static inline int madvise_collapse(struct vm_area_struct *vma,
> > > > > > +                                  struct vm_area_struct **prev,
> > > > > > +                                  unsigned long start, unsigned long end)
> > > > > > +{
> > > > > > +       BUG();
> > > > > > +       return 0;
> > > > >
> > > > > I wish -ENOSYS could have been returned, but it seems madvise()
> > > > > doesn't support this return value.
> > > >
> > > > This is somewhat tangential, but I agree that ENOSYS (or some other
> > > > errno, but ENOSYS makes most sense to me, after EINVAL, (ENOTSUP?))
> > > > should be anointed the dedicated return value for "madvise mode not
> > > > supported". Ran into this recently when wanting some form of feature
> > > > detection for MADV_COLLAPSE where EINVAL is overloaded  (including
> > > > madvise mode not supported). Happy to move this forward if others
> > > > agree.
> > >
> > > I did a quick test by calling MADV_HUGEPAGE on !THP kernel, madvise()
> > > actually returns -EINVAL by madvise_behavior_valid(). So
> > > madvise_collapse() won't be called at all. So madvise_collapse() is
> > > basically used to make !THP compile happy.
> >
> > Ya, exactly. I was thinking -ENOTSUP could be used in
> > place of -ENVAL in the madvise_behavior_valid() path to tell callers
> > of madvise(2) if a given madvise mode was supported or not. At the
> > moment -EINVAL return could mean a number of different things. Anyways
> > - that's a side conversation.
> >
> > > I think we could just return -EINVAL.
> >
> > That sounds fine - as you mention it's code that shouldn't be called
> > anyways and is just there to satisfy !THP. Was just basing off
> > hugepage_madvise().
>
> You could modify huepage_madvise() to simply return -EINVAL in your
> patch too. It is not worth for a separate patch IMHO.

Sure, if you think it's a worthwhile cleanup to remove a BUG(), then I
don't see the harm.

> >
> > > >
> > > > > > +}
> > > > > > +
> > > > > >  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> > > > > >                                          unsigned long start,
> > > > > >                                          unsigned long end,
> > > > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > > > > index 6c1aa92a92e4..6ce1f1ceb432 100644
> > > > > > --- a/include/uapi/asm-generic/mman-common.h
> > > > > > +++ b/include/uapi/asm-generic/mman-common.h
> > > > > > @@ -77,6 +77,8 @@
> > > > > >
> > > > > >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> > > > > >
> > > > > > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > > > > > +
> > > > > >  /* compatibility flags */
> > > > > >  #define MAP_FILE       0
> > > > > >
> > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > index 4ad04f552347..073d6bb03b37 100644
> > > > > > --- a/mm/khugepaged.c
> > > > > > +++ b/mm/khugepaged.c
> > > > > > @@ -2404,3 +2404,127 @@ void khugepaged_min_free_kbytes_update(void)
> > > > > >                 set_recommended_min_free_kbytes();
> > > > > >         mutex_unlock(&khugepaged_mutex);
> > > > > >  }
> > > > > > +
> > > > > > +static int madvise_collapse_errno(enum scan_result r)
> > > > > > +{
> > > > > > +       switch (r) {
> > > > > > +       case SCAN_PMD_NULL:
> > > > > > +       case SCAN_ADDRESS_RANGE:
> > > > > > +       case SCAN_VMA_NULL:
> > > > > > +       case SCAN_PTE_NON_PRESENT:
> > > > > > +       case SCAN_PAGE_NULL:
> > > > > > +               /*
> > > > > > +                * Addresses in the specified range are not currently mapped,
> > > > > > +                * or are outside the AS of the process.
> > > > > > +                */
> > > > > > +               return -ENOMEM;
> > > > > > +       case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > > > > > +       case SCAN_CGROUP_CHARGE_FAIL:
> > > > > > +               /* A kernel resource was temporarily unavailable. */
> > > > > > +               return -EAGAIN;
> > > > >
> > > > > I thought this should return -ENOMEM too.
> > > >
> > > > Do you mean specifically SCAN_CGROUP_CHARGE_FAIL?
> > >
> > > No, I mean both.
> > >
> > > >
> > > > At least going by the comment above do_madvise(), and in the man
> > > > pages, for ENOMEM: "Addresses in the specified range are not currently
> > > > mapped, or are outside the address space of the process." doesn't
> > > > really apply here (though I don't know if "A kernel resource was
> > > > temporarily unavailable" applies any better).
> > >
> > > Yes, the man page does say so. But IIRC some MADV_ operations do
> > > return -ENOMEM for memory allocation failure, for example,
> > > MADV_POPULATE_READ/WRITE. Typically the man pages don't cover all
> > > cases.
> >
> > Good point, I missed MADV_POPULATE_READ/WRITE didn't go through the
> > -ENOMEM -> -EAGAIN remapping at the bottom of madvise_vma_behavior().
> >
> > > >
> > > > That said, should we differentiate between allocation and charging
> > > > failure? At least in the case of a userspace agent using
> > > > process_madvise(2) to collapse memory on behalf of others, knowing
> > > > "this memcg is at its limit" vs "no THPs available" would be valuable.
> > > > Maybe the former should be EBUSY?
> > >
> > > IMHO we don't have to differentiate allocation and charging.
> >
> > After some consideration (thanks for starting this discussion and
> > prompting me to do so), I do think it's very valuable for callers to
> > know when THP allocation fails, and that an errno should be reserved
> > for that. The caller needs to know when a generic error, specific to
> > the memory being collapsed, occurs vs THP allocation failure to help
> > guide next actions: fallback to other strategy, sleep, MADV_DONTNEED /
> > free memory elsewhere, etc.
> >
> > As a concrete example, process init code that tries to back certain
> > segments of text by hugepages. Some existing strategies folks use for
> > this are CONFIG_READ_ONLY_THP_FOR_FS + khugepaged, and anon mremap(2)
> > tricks. CONFIG_READ_ONLY_THP_FOR_FS + MADV_COLLPASE might add a third,
> > and if it fails, it'd be nice to know which other option to fall back
> > to, depending on how badly the user wants THP backing. If THP
> > allocation fails, then likely the anon mremap(2) trick will fail too
> > (unless some reclaim/compaction is done).
> >
> > Less immediately concrete, but a userspace agent seeking to optimize
> > system-wide THP utilization surely wants to know when it's exhausted
> > its precious THP supply.
> >
> > So I'd like to see -EAGAIN reserved for THP allocation failure
> > (-ENOMEM is taken by AS errors, and it'd be nice to be consistent with
> > other modes here). I think -EBUSY for memcg charging makes sense, and
> > tells the caller something actionable and useful, so I'd like to see
> > it differentiated from -ENOMEM.
>
> OK, it makes some sense to differentiate from -ENOMEM. But I still
> don't see too much value to differentiate allocation failure vs
> charging failure. When charging is failed other tricks are unlikely to
> succeed either IMHO unless more aggressive reclaim is done.
>
> But GFP_TRANSHUGE is used by MADV_COLLAPSE, it means direct reclaim
> has been tried before returning failure for both allocation and
> charging.

For fallback measures / actions regarding what to do for a single
process / memcg, I agree. However, at least in the system-wide case
where we might be responsible for process_madvise(MADV_COLLAPSE)'ing
memory from memcg A and B, a memcg charge failure for A shouldn't
impact a decision to collapse memory for B - whereas a THP allocation
failure encountered when attempting collapse of A likely means the
same would happen if we tried to collapse memory for B.

> >
> > Would appreciate feedback from folks here before setting these in
> > stone and preventing the errno from ever being useful to callers.
> >
> > > >
> > > > > > +       default:
> > > > > > +               return -EINVAL;
> > > > > > +       }
> > > > > > +}
> > > > > > +
> > > > > > +int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > > > > +                    unsigned long start, unsigned long end)
> > > > > > +{
> > > > > > +       struct collapse_control cc = {
> > > > > > +               .enforce_page_heuristics = false,
> > > > > > +               .enforce_thp_enabled = false,
> > > > > > +               .last_target_node = NUMA_NO_NODE,
> > > > > > +               .gfp = GFP_TRANSHUGE | __GFP_THISNODE,
> > > > > > +       };
> > > > > > +       struct mm_struct *mm = vma->vm_mm;
> > > > > > +       unsigned long hstart, hend, addr;
> > > > > > +       int thps = 0, last_fail = SCAN_FAIL;
> > > > > > +       bool mmap_locked = true;
> > > > > > +
> > > > > > +       BUG_ON(vma->vm_start > start);
> > > > > > +       BUG_ON(vma->vm_end < end);
> > > > > > +
> > > > > > +       *prev = vma;
> > > > > > +
> > > > > > +       /* TODO: Support file/shmem */
> > > > > > +       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > > > > > +               return -EINVAL;
> > > > > > +
> > > > > > +       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> > > > > > +       hend = end & HPAGE_PMD_MASK;
> > > > > > +
> > > > > > +       /*
> > > > > > +        * Set VM_HUGEPAGE so that hugepage_vma_check() can pass even if
> > > > > > +        * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
> > > > > > +        * Note that hugepage_vma_check() doesn't enforce that
> > > > > > +        * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
> > > > > > +        * must be set (i.e. "never" mode)
> > > > > > +        */
> > > > > > +       if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
> > > > >
> > > > > hugepage_vma_check() doesn't check vma size, so MADV_COLLAPSE may be
> > > > > running for a unsuitable vma, hugepage_vma_revalidate() called by
> > > > > khugepaged_scan_pmd() may find it out finally, but it is a huge waste
> > > > > of effort. So, it is better to check vma size upfront.
> > > >
> > > > This actually does check the vma size, but it's subtle. hstart and
> > > > hend are clamped to the first/last
> > > > hugepaged-aligned address covered by [start,end], which are themselves
> > > > contained in vma->vm_start/vma->vm_end, respectively. We then check
> > > > that addr = hstart < hend ; so if the main loop passes the first
> > > > check, we know that vma->vm_start <= addr and addr + HPAGE_PMD_SIZE <=
> > >
> > > Aha, yes, I overlooked that.
> > >
> > > > vma->vma_end. Agreed that we might be needlessly doing mmgrab() and
> > > > lru_add_drain() needlessly though.
> > >
> > > Yeah
> > >
> > > >
> > > > > BTW, my series moved the vma size check in hugepage_vma_check(), so if
> > > > > your series could be based on top of that, you get that for free.
> > > >
> > > > I'll try rebasing on top of your series, thank you!
> > >
> > > You don't have to do it right now. I don't know what series will be
> > > merged to mm tree first. Just a heads up.
> >
> > Thanks! Seems beneficial here though, so I'll do that and add a note
> > in the cover letter.
>
> Thank you so much. That would make my life easier :-)

Happy to see the cleanup! Thanks again for that.

> >
> > > >
> > > > > > +               return -EINVAL;
> > > > > > +
> > > > > > +       mmgrab(mm);
> > > > > > +       lru_add_drain();
> > > > > > +
> > > > > > +       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> > > > > > +               int result = SCAN_FAIL;
> > > > > > +               bool retry = true;  /* Allow one retry per hugepage */
> > > > > > +retry:
> > > > > > +               if (!mmap_locked) {
> > > > > > +                       cond_resched();
> > > > > > +                       mmap_read_lock(mm);
> > > > "> > +                       mmap_locked = true;
> > > > > > +                       result = hugepage_vma_revalidate(mm, addr, &vma, &cc);
> > > > >
> > > > > How's about making hugepage_vma_revalidate() return SCAN_SUCCEED too?
> > > > > It seems more consistent.
> > > >
> > > > Ya, I didn't like this either.  I'll add this to "mm/khugepaged: pipe
> > > > enum scan_result codes back to callers"
> > > >
> > > > > > +                       if (result) {
> > > > > > +                               last_fail = result;
> > > > > > +                               goto out_nolock;
> > > > > > +                       }
> > > > > > +               }
> > > > > > +               mmap_assert_locked(mm);
> > > > > > +               memset(cc.node_load, 0, sizeof(cc.node_load));
> > > > > > +               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
> > > > > > +               if (!mmap_locked)
> > > > > > +                       *prev = NULL;  /* Tell caller we dropped mmap_lock */
> > > > > > +
> > > > > > +               switch (result) {
> > > > > > +               case SCAN_SUCCEED:
> > > > > > +               case SCAN_PMD_MAPPED:
> > > > > > +                       ++thps;
> > > > > > +                       break;
> > > > > > +               /* Whitelisted set of results where continuing OK */
> > > > > > +               case SCAN_PMD_NULL:
> > > > > > +               case SCAN_PTE_NON_PRESENT:
> > > > > > +               case SCAN_PTE_UFFD_WP:
> > > > > > +               case SCAN_PAGE_RO:
> > > > > > +               case SCAN_LACK_REFERENCED_PAGE:
> > > > > > +               case SCAN_PAGE_NULL:
> > > > > > +               case SCAN_PAGE_COUNT:
> > > > > > +               case SCAN_PAGE_LOCK:
> > > > > > +               case SCAN_PAGE_COMPOUND:
> > > > > > +                       last_fail = result;
> > > > > > +                       break;
> > > > > > +               case SCAN_PAGE_LRU:
> > > > > > +                       if (retry) {
> > > > > > +                               lru_add_drain_all();
> > > > > > +                               retry = false;
> > > > > > +                               goto retry;
> > > > >
> > > > > I'm not sure whether the retry logic is necessary or not, do you have
> > > > > any data about how retry improves the success rate? You could just
> > > > > replace lru_add_drain() to lru_add_drain_all() and remove the retry
> > > > > logic IMHO. I'd prefer to keep it simple at the moment personally.
> > > >
> > > > Transparently, I've only had success hitting this logic on small vms
> > > > under selftests.  That said, it does happen, and I can't imagine this
> > > > hurting, especially on larger systems + tasks using lots of mem.
> > > > Originally, I didn't plan to do this, but as things shook out and we
> > > > had SCAN_PAGE_LRU so readily available, it seemed like we got this for
> > > > free.
> > >
> > > "small vms" mean small virtual machines?
> > >
> > > When the logic is hit, does lru_add_drain_all() help to improve the
> > > success rate?
> >
> > Ya, I've been doing most dev/testing on small, 2 cpu virtual machines.
> > I've been mmap()ing a multi-hugepage sized region, then faulting it in
> > - presumably being preempted and rescheduled on another cpu while
> > iterating over the region and faulting. I had ran into the
> > lru_add_drain_all() vs lru_add_drain() during testing/dev since
> > *occasionally* (admittedly, not very often the IIRC the former helped
> > tests pass.
> >
> > That said, I set out to try and repro this semi-reliably, with little
> > success - I'm almost always finding the pages on the LRU. Still
> > playing around with this..
> >
> > > I don't mean this hurts anything. I'm just thinking about whether the
> > > extra complexity is worth it or not. And calling lru_add_drain_all()
> > > with holding mmap_lock might have some scalability issues since
> > > draining lru for all is not cheap.
> >
> > Good point. At the very least, it seems like we should unlock
> > mmap_lock before retry.
>
> You could, but it still sounds overkilling to me. All the extra
> complexity is just used to optimize for small sized machines which
> unlikely run with THP in real life TBH.

AFAIK, this *tries* to optimize for larger machines where
lru_add_drain_all() is more costly. Likewise, tries to optimize for
larger processes that might be scheduled on multiple cpus.

This isn't likely to be in a particularly hot path - so I'm fine
reducing complexity. If / when I can gather more data at scale, we can
see if lru_add_drain_all() is too costly.

Also - again - thanks for taking time to review and help out here :)


> >
> >
> >
> >
> > >
> > >
> > > >
> > > > > > +                       }
> > > > > > +                       fallthrough;
> > > > > > +               default:
> > > > > > +                       last_fail = result;
> > > > > > +                       /* Other error, exit */
> > > > > > +                       goto out_maybelock;
> > > > > > +               }
> > > > > > +       }
> > > > > > +
> > > > > > +out_maybelock:
> > > > > > +       /* Caller expects us to hold mmap_lock on return */
> > > > > > +       if (!mmap_locked)
> > > > > > +               mmap_read_lock(mm);
> > > > > > +out_nolock:
> > > > > > +       mmap_assert_locked(mm);
> > > > > > +       mmdrop(mm);
> > > > > > +
> > > > > > +       return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> > > > > > +                       : madvise_collapse_errno(last_fail);
> > > > > > +}
> > > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > > index 46feb62ce163..eccac2620226 100644
> > > > > > --- a/mm/madvise.c
> > > > > > +++ b/mm/madvise.c
> > > > > > @@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
> > > > > >         case MADV_FREE:
> > > > > >         case MADV_POPULATE_READ:
> > > > > >         case MADV_POPULATE_WRITE:
> > > > > > +       case MADV_COLLAPSE:
> > > > > >                 return 0;
> > > > > >         default:
> > > > > >                 /* be safe, default to 1. list exceptions explicitly */
> > > > > > @@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > > > > >                 if (error)
> > > > > >                         goto out;
> > > > > >                 break;
> > > > > > +       case MADV_COLLAPSE:
> > > > > > +               return madvise_collapse(vma, prev, start, end);
> > > > > >         }
> > > > > >
> > > > > >         anon_name = anon_vma_name(vma);
> > > > > > @@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
> > > > > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > > >         case MADV_HUGEPAGE:
> > > > > >         case MADV_NOHUGEPAGE:
> > > > > > +       case MADV_COLLAPSE:
> > > > > >  #endif
> > > > > >         case MADV_DONTDUMP:
> > > > > >         case MADV_DODUMP:
> > > > > > @@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > > > > >   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > > > > >   *             transparent huge pages so the existing pages will not be
> > > > > >   *             coalesced into THP and new pages will not be allocated as THP.
> > > > > > + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> > > > > >   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> > > > > >   *             from being included in its core dump.
> > > > > >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > > > > > --
> > > > > > 2.36.1.255.ge46751e96f-goog
> > > > > >


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/15] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA
  2022-06-04  0:39 ` [PATCH v6 01/15] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Zach O'Keefe
  2022-06-06 18:25   ` Yang Shi
@ 2022-06-29 20:49   ` Peter Xu
  2022-06-30  1:15     ` Zach O'Keefe
  1 sibling, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-06-29 20:49 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Rongwei Wang, SeongJae Park,
	Song Liu, Vlastimil Babka, Yang Shi, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 03, 2022 at 05:39:50PM -0700, Zach O'Keefe wrote:
> -static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
> +/* Sleep for the first alloc fail, break the loop for the second fail */
> +static bool alloc_fail_should_sleep(struct page **hpage, bool *wait)
>  {
>  	if (IS_ERR(*hpage)) {
>  		if (!*wait)
> -			return false;
> +			return true;
>  
>  		*wait = false;
>  		*hpage = NULL;
>  		khugepaged_alloc_sleep();
> -	} else if (*hpage) {
> -		put_page(*hpage);
> -		*hpage = NULL;
>  	}
> -
> -	return true;
> +	return false;
>  }

One nitpick here:

It's weird to me to sleep in a function called XXX_should_sleep(), we'd
normally expect to sleep only if it returns true.

Meanwhile, would this be a very good chance to unwrap this function already
to remove the "bool*" reference, which looks not pretty?  Something like:

---8<---
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 16be62d493cd..807c10cd0816 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2235,9 +2235,6 @@ static void khugepaged_do_scan(void)
        lru_add_drain_all();
 
        while (progress < pages) {
-               if (!khugepaged_prealloc_page(&hpage, &wait))
-                       break;
-
                cond_resched();
 
                if (unlikely(kthread_should_stop() || try_to_freeze()))
@@ -2253,6 +2250,18 @@ static void khugepaged_do_scan(void)
                else
                        progress = pages;
                spin_unlock(&khugepaged_mm_lock);
+
+               if (IS_ERR(*hpage)) {
+                       /*
+                        * If fail to allocate the first time, try to sleep
+                        * for a while.  When hit again, cancel the scan.
+                        */
+                       if (!wait)
+                               break;
+                       wait = false;
+                       *hpage = NULL;
+                       khugepaged_alloc_sleep();
+               }
        }
---8<---

Would this look slightly better?

Thanks,

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 04/15] mm/khugepaged: dedup and simplify hugepage alloc and charging
  2022-06-04  0:39 ` [PATCH v6 04/15] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
  2022-06-06 20:50   ` Yang Shi
@ 2022-06-29 21:58   ` Peter Xu
  2022-06-30 20:14     ` Zach O'Keefe
  1 sibling, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-06-29 21:58 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Rongwei Wang, SeongJae Park,
	Song Liu, Vlastimil Babka, Yang Shi, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Fri, Jun 03, 2022 at 05:39:53PM -0700, Zach O'Keefe wrote:
> The following code is duplicated in collapse_huge_page() and
> collapse_file():
> 
>         gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> 
> 	new_page = khugepaged_alloc_page(hpage, gfp, node);
>         if (!new_page) {
>                 result = SCAN_ALLOC_HUGE_PAGE_FAIL;
>                 goto out;
>         }
> 
>         if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
>                 result = SCAN_CGROUP_CHARGE_FAIL;
>                 goto out;
>         }
>         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> 
> Also, "node" is passed as an argument to both collapse_huge_page() and
> collapse_file() and obtained the same way, via
> khugepaged_find_target_node().
> 
> Move all this into a new helper, alloc_charge_hpage(), and remove the
> duplicate code from collapse_huge_page() and collapse_file().  Also,
> simplify khugepaged_alloc_page() by returning a bool indicating
> allocation success instead of a copy of the allocated struct page.
> 
> Suggested-by: Peter Xu <peterx@redhat.com>
> 
> ---

[note: please remember to drop this "---" when repost since I think it
 could drop your sign-off when apply]

> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 01/15] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA
  2022-06-29 20:49   ` Peter Xu
@ 2022-06-30  1:15     ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-30  1:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Rongwei Wang, SeongJae Park,
	Song Liu, Vlastimil Babka, Yang Shi, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Jun 29 16:49, Peter Xu wrote:
> On Fri, Jun 03, 2022 at 05:39:50PM -0700, Zach O'Keefe wrote:
> > -static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
> > +/* Sleep for the first alloc fail, break the loop for the second fail */
> > +static bool alloc_fail_should_sleep(struct page **hpage, bool *wait)
> >  {
> >  	if (IS_ERR(*hpage)) {
> >  		if (!*wait)
> > -			return false;
> > +			return true;
> >  
> >  		*wait = false;
> >  		*hpage = NULL;
> >  		khugepaged_alloc_sleep();
> > -	} else if (*hpage) {
> > -		put_page(*hpage);
> > -		*hpage = NULL;
> >  	}
> > -
> > -	return true;
> > +	return false;
> >  }
> 
> One nitpick here:
> 
> It's weird to me to sleep in a function called XXX_should_sleep(), we'd
> normally expect to sleep only if it returns true.
> 
> Meanwhile, would this be a very good chance to unwrap this function already
> to remove the "bool*" reference, which looks not pretty?  Something like:
> 
> ---8<---
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 16be62d493cd..807c10cd0816 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2235,9 +2235,6 @@ static void khugepaged_do_scan(void)
>         lru_add_drain_all();
>  
>         while (progress < pages) {
> -               if (!khugepaged_prealloc_page(&hpage, &wait))
> -                       break;
> -
>                 cond_resched();
>  
>                 if (unlikely(kthread_should_stop() || try_to_freeze()))
> @@ -2253,6 +2250,18 @@ static void khugepaged_do_scan(void)
>                 else
>                         progress = pages;
>                 spin_unlock(&khugepaged_mm_lock);
> +
> +               if (IS_ERR(*hpage)) {
> +                       /*
> +                        * If fail to allocate the first time, try to sleep
> +                        * for a while.  When hit again, cancel the scan.
> +                        */
> +                       if (!wait)
> +                               break;
> +                       wait = false;
> +                       *hpage = NULL;
> +                       khugepaged_alloc_sleep();
> +               }
>         }
> ---8<---
> 
> Would this look slightly better?

Hey Peter,

Thanks for taking the time to review. I think open coding this looks good. One
small detail is that if we move this to the end of the loop, we'll need to check
that progress < pages still before sleeping - else we run the risk of doing an
alloc sleep and a scan sleep.

But I'll let Yang make the call since it's his patch - he's just been kind
enough to donate it for this cause :)

Thanks,
Zach

> Thanks,
> 
> -- 
> Peter Xu
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 08/15] mm/khugepaged: add flag to ignore THP sysfs enabled
       [not found]     ` <CAAa6QmRXD5KboM8=ZZRPThOmcLEPtxzf0XyjkCeY_vgR7VOPqg@mail.gmail.com>
@ 2022-06-30  2:32       ` Peter Xu
  2022-06-30 14:17         ` Zach O'Keefe
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-06-30  2:32 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, Rongwei Wang,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Wed, Jun 29, 2022 at 06:42:25PM -0700, Zach O'Keefe wrote:
> On Jun 29 19:21, Peter Xu wrote:
> > On Fri, Jun 03, 2022 at 05:39:57PM -0700, Zach O'Keefe wrote:
> > > Add enforce_thp_enabled flag to struct collapse_control that allows context
> > > to ignore constraints imposed by /sys/kernel/transparent_hugepage/enabled.
> > >
> > > This flag is set in khugepaged collapse context to preserve existing
> > > khugepaged behavior.
> > >
> > > This flag will be used (unset) when introducing madvise collapse
> > > context since the desired THP semantics of MADV_COLLAPSE aren't coupled
> > > to sysfs THP settings.  Most notably, for the purpose of eventual
> > > madvise_collapse(2) support, this allows userspace to trigger THP collapse
> > > on behalf of another processes, without adding support to meddle with
> > > the VMA flags of said process, or change sysfs THP settings.
> > >
> > > For now, limit this flag to /sys/kernel/transparent_hugepage/enabled,
> > > but it can be expanded to include
> > > /sys/kernel/transparent_hugepage/shmem_enabled later.
> > >
> > > Link: https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >  mm/khugepaged.c | 34 +++++++++++++++++++++++++++-------
> > >  1 file changed, 27 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index c3589b3e238d..4ad04f552347 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -94,6 +94,11 @@ struct collapse_control {
> > >      */
> > >     bool enforce_page_heuristics;
> > >
> > > +   /* Enforce constraints of
> > > +    * /sys/kernel/mm/transparent_hugepage/enabled
> > > +    */
> > > +   bool enforce_thp_enabled;
> >
> > Small nitpick that we could have merged the two booleans if they always
> > match, but no strong opinions if you think these two are clearer.  Or maybe
> > there's other plan of using them?
> >
> > > +
> > >     /* Num pages scanned per node */
> > >     int node_load[MAX_NUMNODES];
> > >
> > > @@ -893,10 +898,12 @@ static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > >   */
> > >
> > >  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > -           struct vm_area_struct **vmap)
> > > +                              struct vm_area_struct **vmap,
> > > +                              struct collapse_control *cc)
> > >  {
> > >     struct vm_area_struct *vma;
> > >     unsigned long hstart, hend;
> > > +   unsigned long vma_flags;
> > >
> > >     if (unlikely(khugepaged_test_exit(mm)))
> > >             return SCAN_ANY_PROCESS;
> > > @@ -909,7 +916,18 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > >     hend = vma->vm_end & HPAGE_PMD_MASK;
> > >     if (address < hstart || address + HPAGE_PMD_SIZE > hend)
> > >             return SCAN_ADDRESS_RANGE;
> > > -   if (!hugepage_vma_check(vma, vma->vm_flags))
> > > +
> > > +   /*
> > > +    * If !cc->enforce_thp_enabled, set VM_HUGEPAGE so that
> > > +    * hugepage_vma_check() can pass even if
> > > +    * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
> > > +    * Note that hugepage_vma_check() doesn't enforce that
> > > +    * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
> > > +    * must be set (i.e. "never" mode).
> > > +    */
> > > +   vma_flags = cc->enforce_thp_enabled ?  vma->vm_flags
> > > +                   : vma->vm_flags | VM_HUGEPAGE;
> >
> > Another nitpick..
> >
> > We could get a weird vm_flags when VM_NOHUGEPAGE is set.  I don't think
> > it'll go wrong since hugepage_vma_check() checks NOHUGEPAGE first, but IMHO
> > we shouldn't rely on that as it seems error prone (e.g. when accidentally
> > moved things around).
> >
> > So maybe nicer to only apply VM_HUGEPAGE if !VM_NOHUGEPAGE?  Or pass over
> > "enforce_thp_enabled" into hugepage_vma_check() should work too, iiuc.
> > Passing in the boolean has one benefit that we don't really need the
> > complicated comment above since the code should be able to explain itself.
> 
> Hey Peter, thanks again for taking the time to review.
> 
> Answering both of the above at the time:
> 
> As in this series so far, I've tried to keep context functionally-declarative -
> specifying the intended behavior (e.g. "enforce_page_heuristics") rather than
> adding "if (khugepaged) { .. } else if (madv_collapse) { .. } else if { .. }"
> around the code which, IMO, makes it difficult to follow. Unfortunately, I've
> ran into the 2 problems you've stated here:
> 
> 1) *Right now* all the behavior knobs are either off/on at the same time
> 2) For hugepage_vma_check() (now in mm/huge_memory.c and acting as the central
>    authority on THP eligibility), things are complicated enough that I
>    couldn't find a clean way to describe the parameters of the context without
>    explicitly mentioning the caller.
> 
> For (2), instead of adding another arg to specify MADV_COLLAPSE's behavior,
> I think we need to package these contexts into a single set of flags:
> 
> enum thp_ctx_flags {
>         THP_CTX_ANON_FAULT              = 1 << 1,
>         THP_CTX_KHUGEPAGED              = 1 << 2,
>         THP_CTX_SMAPS                   = 1 << 3,
>         THP_CTX_MADVISE_COLLAPSE        = 1 << 4,
> };
> 
> That will avoid hacking vma flags passed to hugepage_vma_check().
> 
> And, if we have these anyways, I might as well do away with some of the
> (semantically meaningful but functionally redundant) flags in
> struct collapse_control and just specify a single .thp_ctx_flags member. I'm
> not entirely happy with it - but that's what I'm planning.
> 
> WDYT?

Firstly I think I wrongly sent previous email privately.. :( Let me try to
add the list back..

IMHO we don't need to worry too much on the "if... else if... else",
because they shouldn't be more complicated than when you spread the
meanings into multiple flags, or how could it be? :) IMHO it should
literally be as simple as applying:

  s/enforce_{A|B|C|...}/khugepaged_initiated/g

Throughout the patches, then we squash the patches introducing enforce_X.

If you worry it's not clear on "what does khugepaged_initiated mean", we
could add whatever comment above the variable explaining A/B/C/D will be
covered when this is set, and we could postpone to do the flag split only
until there're real user.

Adding these flags could add unnecessary bit-and instructions into the code
generated at last, and if it's only about readability issue that's really
what comment is for?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 08/15] mm/khugepaged: add flag to ignore THP sysfs enabled
  2022-06-30  2:32       ` Peter Xu
@ 2022-06-30 14:17         ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-30 14:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Rongwei Wang, SeongJae Park,
	Song Liu, Vlastimil Babka, Yang Shi, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Jun 29 22:32, Peter Xu wrote:
> On Wed, Jun 29, 2022 at 06:42:25PM -0700, Zach O'Keefe wrote:
> > On Jun 29 19:21, Peter Xu wrote:
> > > On Fri, Jun 03, 2022 at 05:39:57PM -0700, Zach O'Keefe wrote:
> > > > Add enforce_thp_enabled flag to struct collapse_control that allows context
> > > > to ignore constraints imposed by /sys/kernel/transparent_hugepage/enabled.
> > > >
> > > > This flag is set in khugepaged collapse context to preserve existing
> > > > khugepaged behavior.
> > > >
> > > > This flag will be used (unset) when introducing madvise collapse
> > > > context since the desired THP semantics of MADV_COLLAPSE aren't coupled
> > > > to sysfs THP settings.  Most notably, for the purpose of eventual
> > > > madvise_collapse(2) support, this allows userspace to trigger THP collapse
> > > > on behalf of another processes, without adding support to meddle with
> > > > the VMA flags of said process, or change sysfs THP settings.
> > > >
> > > > For now, limit this flag to /sys/kernel/transparent_hugepage/enabled,
> > > > but it can be expanded to include
> > > > /sys/kernel/transparent_hugepage/shmem_enabled later.
> > > >
> > > > Link: https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/
> > > >
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >  mm/khugepaged.c | 34 +++++++++++++++++++++++++++-------
> > > >  1 file changed, 27 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index c3589b3e238d..4ad04f552347 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -94,6 +94,11 @@ struct collapse_control {
> > > >      */
> > > >     bool enforce_page_heuristics;
> > > >
> > > > +   /* Enforce constraints of
> > > > +    * /sys/kernel/mm/transparent_hugepage/enabled
> > > > +    */
> > > > +   bool enforce_thp_enabled;
> > >
> > > Small nitpick that we could have merged the two booleans if they always
> > > match, but no strong opinions if you think these two are clearer.  Or maybe
> > > there's other plan of using them?
> > >
> > > > +
> > > >     /* Num pages scanned per node */
> > > >     int node_load[MAX_NUMNODES];
> > > >
> > > > @@ -893,10 +898,12 @@ static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > > >   */
> > > >
> > > >  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > > -           struct vm_area_struct **vmap)
> > > > +                              struct vm_area_struct **vmap,
> > > > +                              struct collapse_control *cc)
> > > >  {
> > > >     struct vm_area_struct *vma;
> > > >     unsigned long hstart, hend;
> > > > +   unsigned long vma_flags;
> > > >
> > > >     if (unlikely(khugepaged_test_exit(mm)))
> > > >             return SCAN_ANY_PROCESS;
> > > > @@ -909,7 +916,18 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > >     hend = vma->vm_end & HPAGE_PMD_MASK;
> > > >     if (address < hstart || address + HPAGE_PMD_SIZE > hend)
> > > >             return SCAN_ADDRESS_RANGE;
> > > > -   if (!hugepage_vma_check(vma, vma->vm_flags))
> > > > +
> > > > +   /*
> > > > +    * If !cc->enforce_thp_enabled, set VM_HUGEPAGE so that
> > > > +    * hugepage_vma_check() can pass even if
> > > > +    * TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG is set (i.e. "madvise" mode).
> > > > +    * Note that hugepage_vma_check() doesn't enforce that
> > > > +    * TRANSPARENT_HUGEPAGE_FLAG or TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG
> > > > +    * must be set (i.e. "never" mode).
> > > > +    */
> > > > +   vma_flags = cc->enforce_thp_enabled ?  vma->vm_flags
> > > > +                   : vma->vm_flags | VM_HUGEPAGE;
> > >
> > > Another nitpick..
> > >
> > > We could get a weird vm_flags when VM_NOHUGEPAGE is set.  I don't think
> > > it'll go wrong since hugepage_vma_check() checks NOHUGEPAGE first, but IMHO
> > > we shouldn't rely on that as it seems error prone (e.g. when accidentally
> > > moved things around).
> > >
> > > So maybe nicer to only apply VM_HUGEPAGE if !VM_NOHUGEPAGE?  Or pass over
> > > "enforce_thp_enabled" into hugepage_vma_check() should work too, iiuc.
> > > Passing in the boolean has one benefit that we don't really need the
> > > complicated comment above since the code should be able to explain itself.
> > 
> > Hey Peter, thanks again for taking the time to review.
> > 
> > Answering both of the above at the time:
> > 
> > As in this series so far, I've tried to keep context functionally-declarative -
> > specifying the intended behavior (e.g. "enforce_page_heuristics") rather than
> > adding "if (khugepaged) { .. } else if (madv_collapse) { .. } else if { .. }"
> > around the code which, IMO, makes it difficult to follow. Unfortunately, I've
> > ran into the 2 problems you've stated here:
> > 
> > 1) *Right now* all the behavior knobs are either off/on at the same time
> > 2) For hugepage_vma_check() (now in mm/huge_memory.c and acting as the central
> >    authority on THP eligibility), things are complicated enough that I
> >    couldn't find a clean way to describe the parameters of the context without
> >    explicitly mentioning the caller.
> > 
> > For (2), instead of adding another arg to specify MADV_COLLAPSE's behavior,
> > I think we need to package these contexts into a single set of flags:
> > 
> > enum thp_ctx_flags {
> >         THP_CTX_ANON_FAULT              = 1 << 1,
> >         THP_CTX_KHUGEPAGED              = 1 << 2,
> >         THP_CTX_SMAPS                   = 1 << 3,
> >         THP_CTX_MADVISE_COLLAPSE        = 1 << 4,
> > };
> > 
> > That will avoid hacking vma flags passed to hugepage_vma_check().
> > 
> > And, if we have these anyways, I might as well do away with some of the
> > (semantically meaningful but functionally redundant) flags in
> > struct collapse_control and just specify a single .thp_ctx_flags member. I'm
> > not entirely happy with it - but that's what I'm planning.
> > 
> > WDYT?
> 
> Firstly I think I wrongly sent previous email privately.. :( Let me try to
> add the list back..
> 
> IMHO we don't need to worry too much on the "if... else if... else",
> because they shouldn't be more complicated than when you spread the
> meanings into multiple flags, or how could it be? :) IMHO it should
> literally be as simple as applying:
> 
>   s/enforce_{A|B|C|...}/khugepaged_initiated/g
> 
> Throughout the patches, then we squash the patches introducing enforce_X.

Right, the code today will be virtually identical. The attempt to describe
contexts in terms of behaviors is based on an unfounded assumption that
successive contexts could reuse said behaviors - but there are currently no
plans for other collapsing contexts.

> If you worry it's not clear on "what does khugepaged_initiated mean", we
> could add whatever comment above the variable explaining A/B/C/D will be
> covered when this is set, and we could postpone to do the flag split only
> until there're real user.
> 
> Adding these flags could add unnecessary bit-and instructions into the code
> generated at last, and if it's only about readability issue that's really
> what comment is for?

Ya maybe I'm overthinking it and will just do the most straightforward thing for
now.

As always, thanks for taking the time to review / offer suggestions!

Best,
Zach

> Thanks,
> 
> -- 
> Peter Xu
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v6 04/15] mm/khugepaged: dedup and simplify hugepage alloc and charging
  2022-06-29 21:58   ` Peter Xu
@ 2022-06-30 20:14     ` Zach O'Keefe
  0 siblings, 0 replies; 63+ messages in thread
From: Zach O'Keefe @ 2022-06-30 20:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Rongwei Wang, SeongJae Park,
	Song Liu, Vlastimil Babka, Yang Shi, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Jun 29 17:58, Peter Xu wrote:
> On Fri, Jun 03, 2022 at 05:39:53PM -0700, Zach O'Keefe wrote:
> > The following code is duplicated in collapse_huge_page() and
> > collapse_file():
> > 
> >         gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> > 
> > 	new_page = khugepaged_alloc_page(hpage, gfp, node);
> >         if (!new_page) {
> >                 result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> >                 goto out;
> >         }
> > 
> >         if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
> >                 result = SCAN_CGROUP_CHARGE_FAIL;
> >                 goto out;
> >         }
> >         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> > 
> > Also, "node" is passed as an argument to both collapse_huge_page() and
> > collapse_file() and obtained the same way, via
> > khugepaged_find_target_node().
> > 
> > Move all this into a new helper, alloc_charge_hpage(), and remove the
> > duplicate code from collapse_huge_page() and collapse_file().  Also,
> > simplify khugepaged_alloc_page() by returning a bool indicating
> > allocation success instead of a copy of the allocated struct page.
> > 
> > Suggested-by: Peter Xu <peterx@redhat.com>
> > 
> > ---
> 
> [note: please remember to drop this "---" when repost since I think it
>  could drop your sign-off when apply]
>

Thanks for catching this, Peter! Fixed locally!

Best,
Zach

> > 
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> Thanks,
> 
> -- 
> Peter Xu
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2022-06-30 20:14 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-04  0:39 [PATCH v6 00/15] mm: userspace hugepage collapse Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 01/15] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Zach O'Keefe
2022-06-06 18:25   ` Yang Shi
2022-06-29 20:49   ` Peter Xu
2022-06-30  1:15     ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 02/15] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
2022-06-06 20:45   ` Yang Shi
2022-06-07 16:01     ` Zach O'Keefe
2022-06-07 19:32       ` Zach O'Keefe
2022-06-07 21:27         ` Yang Shi
2022-06-08  0:27           ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 03/15] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-06-06  2:41   ` kernel test robot
2022-06-06 16:40     ` Zach O'Keefe
2022-06-06 16:40       ` Zach O'Keefe
2022-06-06 20:20       ` Yang Shi
2022-06-06 20:20         ` Yang Shi
2022-06-06 21:22         ` Yang Shi
2022-06-06 21:22           ` Yang Shi
2022-06-06 22:23       ` Andrew Morton
2022-06-06 22:23         ` Andrew Morton
2022-06-06 23:53         ` Yang Shi
2022-06-06 23:53           ` Yang Shi
2022-06-08  0:42           ` Zach O'Keefe
2022-06-08  0:42             ` Zach O'Keefe
2022-06-08  1:00             ` Yang Shi
2022-06-08  1:00               ` Yang Shi
2022-06-08  1:06               ` Zach O'Keefe
2022-06-08  1:06                 ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 04/15] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
2022-06-06 20:50   ` Yang Shi
2022-06-29 21:58   ` Peter Xu
2022-06-30 20:14     ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 05/15] mm/khugepaged: make allocation semantics context-specific Zach O'Keefe
2022-06-06 20:58   ` Yang Shi
2022-06-07 19:56     ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 06/15] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
2022-06-06 22:39   ` Yang Shi
2022-06-07  0:17     ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 07/15] mm/khugepaged: add flag to ignore khugepaged heuristics Zach O'Keefe
2022-06-06 22:51   ` Yang Shi
2022-06-04  0:39 ` [PATCH v6 08/15] mm/khugepaged: add flag to ignore THP sysfs enabled Zach O'Keefe
2022-06-06 23:02   ` Yang Shi
     [not found]   ` <YrzehlUoo2iMMLC2@xz-m1.local>
     [not found]     ` <CAAa6QmRXD5KboM8=ZZRPThOmcLEPtxzf0XyjkCeY_vgR7VOPqg@mail.gmail.com>
2022-06-30  2:32       ` Peter Xu
2022-06-30 14:17         ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 09/15] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-06-06 23:53   ` Yang Shi
2022-06-07 22:48     ` Zach O'Keefe
2022-06-08  0:39       ` Yang Shi
2022-06-09 17:35         ` Zach O'Keefe
2022-06-09 18:51           ` Yang Shi
2022-06-10 14:51             ` Zach O'Keefe
2022-06-04  0:39 ` [PATCH v6 10/15] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
2022-06-06 23:56   ` Yang Shi
2022-06-07  0:31     ` Zach O'Keefe
2022-06-04  0:40 ` [PATCH v6 11/15] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
2022-06-07 19:14   ` Yang Shi
2022-06-04  0:40 ` [PATCH v6 12/15] selftests/vm: modularize collapse selftests Zach O'Keefe
2022-06-04  0:40 ` [PATCH v6 13/15] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
2022-06-04  0:40 ` [PATCH v6 14/15] selftests/vm: add selftest to verify recollapse of THPs Zach O'Keefe
2022-06-04  0:40 ` [PATCH v6 15/15] tools headers uapi: add MADV_COLLAPSE madvise mode to tools Zach O'Keefe
2022-06-06 23:58   ` Yang Shi
2022-06-07  0:24     ` Zach O'Keefe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.