All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/12] mm: userspace hugepage collapse
@ 2022-05-04 21:44 Zach O'Keefe
  2022-05-04 21:44 ` [PATCH v5 01/13] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
                   ` (12 more replies)
  0 siblings, 13 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was introduced by David Rientjes[1], and the semantics and
implementation were introduced and discussed in a previous PATCH RFC[2].

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

process_madvise(2)

	Performs a synchronous collapse of the native pages
	mapped by the list of iovecs into transparent hugepages.

	The system-wide THP sysfs settings as well as the VMA flags of the
	memory range being collapsed are both respected for the purposes
	of determining THP eligibility.

	THP allocation may enter direct reclaim and/or compaction.

	When a range spans multiple VMAs, the semantics of the collapse
	over of each VMA is independent from the others.

	Caller must have CAP_SYS_ADMIN if not acting on self.

	Return value follows existing process_madvise(2) conventions.  A
	“success” indicates that all hugepage-sized/aligned regions
	covered by the provided range were either successfully
	collapsed, or were already pmd-mapped THPs.

madvise(2)

	Equivalent to process_madvise(2) on self, with 0 returned on
	“success”.

Current Use-Cases
--------------------------------

(1)	Immediately back executable text by THPs.  Current support provided
	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
	system which might impair services from serving at their full rated
	load after (re)starting.  Tricks like mremap(2)'ing text onto
	anonymous memory to immediately realize iTLB performance prevents
	page sharing and demand paging, both of which increase steady state
	memory footprint.  With MADV_COLLAPSE, we get the best of both
	worlds: Peak upfront performance and lower RAM footprints.  Note
	that subsequent support for file-backed memory is required here.

(2)	malloc() implementations that manage memory in hugepage-sized
	chunks, but sometimes subrelease memory back to the system in
	native-sized chunks via MADV_DONTNEED; zapping the pmd.  Later, when
	the memory is hot, the implementation could madvise(MADV_COLLAPSE)
	to re-back the memory by THPs to regain hugepage coverage and dTLB
	performance.  TCMalloc is such an implementation that could benefit
	from this[3].  A prior study of Google internal workloads during
	evaluation of Temeraire, a hugepage-aware enhancement to TCMalloc,
	showed that nearly 20% of all cpu cycles were spent in dTLB stalls,
	and that increasing hugepage coverage by even small amount can help
	with that[4].

Future work
--------------------------------

Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.

One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps.  For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[5].

Sequence of Patches
--------------------------------

Patches 1-7 perform refactoring of collapse logic within khugepaged.c,
introduce the notion of a collapse context, and add needed hooks to
specialize per-collapse context functionality.

Patches 8-10 introduces MADV_COLLAPSE, does some renaming, and adds
process_madivse(2) support.

Patches 11-13 add selftests to test the new functionality.

Applies against next-20220502.

Changelog
--------------------------------

v4 -> v5:
* Fix kernel test robot <lkp@intel.com> errors
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Fix khugepaged_alloc_page() UMA definition
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Add "fallthrough" pseudo keyword to fix -Wimplicit-fallthrough

v3 -> v4:
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Dropped pmd_none() check from find_pmd_or_thp_or_none()
  -> Moved SCAN_PMD_MAPPED after SCAN_PMD_NULL
  -> Dropped <lkp@intel.com> from sign-offs
* 'mm/khugepaged: add struct collapse_control'
  -> Updated commit description and some code comments
  -> Removed extra brackets added in khugepaged_find_target_node()
* Added 'mm/khugepaged: dedup hugepage allocation and charging code'
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Has been majorly reworked to replace ->gfp() and ->alloc_hpage()
     struct collapse_control hooks with a ->alloc_charge_hpage() hook
     which makes node-allocation, gfp flags, node scheduling, hpage
     allocation, and accounting/charging context-specific.
  -> Dropped <lkp@intel.com> from sign-offs
* Added 'mm/khugepaged: pipe enum scan_result codes back to callers'
  -> Replaces 'mm/khugepaged: add struct collapse_result'
* Dropped 'mm/khugepaged: add struct collapse_result'
* 'mm/khugepaged: add flag to ignore khugepaged_max_ptes_*'
  -> Moved before 'mm/madvise: introduce MADV_COLLAPSE sync hugepage
     collapse'
* 'mm/khugepaged: add flag to ignore page young/referenced requirement'
  -> Moved before 'mm/madvise: introduce MADV_COLLAPSE sync hugepage
     collapse'
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Moved struct collapse_control* argument to end of alloc_hpage()
  -> Some refactoring to rebase on top changes to struct
     collapse_control hook changes and other previous commits.
  -> Reworded commit description
  -> Dropped <lkp@intel.com> from sign-offs
* 'mm/khugepaged: rename prefix of shared collapse functions'
  -> Renamed from 'mm/khugepaged: remove khugepaged prefix from shared
     collapse functions'
  -> Instead of dropping "khugepaged_" prefix, replace with
     "hpage_collapse_"
  -> Dropped <lkp@intel.com> from sign-offs
* Rebased onto next-20220502

v2 -> v3:
* Collapse semantics have changed: the gfp flags used for hugepage allocation
  now are independent of khugepaged.
* Cover-letter: add primary use-cases and update description of collapse
  semantics.
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Added .gfp operation to struct collapse_control
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Added madvise context .gfp implementation.
  -> Set scan_result appropriately on early exit due to mm exit or vma
     vma revalidation.
  -> Reword patch description
* Rebased onto next-20220426

v1 -> v2:
* Cover-letter clarification and added RFC -> v1 notes
* Fixes issues reported by kernel test robot <lkp@intel.com>
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Fixed mixed code/declarations
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Fixed bad function signature in !NUMA && TRANSPARENT_HUGEPAGE configs
  -> Added doc comment to retract_page_tables() for "cc"
* 'mm/khugepaged: add struct collapse_result'
  -> Added doc comment to retract_page_tables() for "cr"
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Added MADV_COLLAPSE definitions for alpha, mips, parisc, xtensa
  -> Moved an "#ifdef NUMA" so that khugepaged_find_target_node() is
     defined in !NUMA && TRANSPARENT_HUGEPAGE configs.
* 'mm/khugepaged: remove khugepaged prefix from shared collapse'
  functions
  -> Removed khugepaged prefix from khugepaged_find_target_node on L914
* Rebased onto next-20220414

RFC -> v1:
* The series was significantly reworked from RFC and most patches are entirely
  new or reworked.
* Collapse eligibility criteria has changed: MADV_COLLAPSE now respects
  VM_NOHUGEPAGE.
* Collapse semantics have changed: the gfp flags used for hugepage allocation
  now match that of khugepaged for the same VMA, instead of the gfp flags used
  at-fault for calling process for the VMA.
* Collapse semantics have changed: The collapse semantics for multiple VMAs
  spanning a single MADV_COLLAPSE call are now independent, whereas before the
  idea was to allow direct reclaim/compaction if any spanned VMA permitted so.
* The process_madvise(2) flags, MADV_F_COLLAPSE_LIMITS and
  MADV_F_COLLAPSE_DEFRAG have been removed.
* Implementation change: the RFC implemented collapse over a range of hugepages
  in a batched-fashion with the aim of doing multiple page table updates inside
  a single mmap_lock write.  This has been changed, and the implementation now
  collapses each hugepage-aligned/sized region iteratively.  This was motivated
  by an experiment which showed that, when multiple threads were concurrently
  faulting during a MADV_COLLAPSE operation, mean and tail latency to acquire
  mmap_lock in read for threads in the fault patch was improved by using a batch
  size of 1 (batch sizes of 1, 8, 16, 32 were tested)[6].
* Added: If a collapse operation fails because a page isn't found on the LRU, do
  a lru_add_drain_all() and retry.
* Added: selftests

[1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20220308213417.1407042-1-zokeefe@google.com/
[3] https://github.com/google/tcmalloc/tree/master/tcmalloc
[4] https://research.google/pubs/pub50370/
[5] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/
[6] https://lore.kernel.org/linux-mm/CAAa6QmRc76n-dspGT7UK8DkaqZAOz-CkCsME1V7KGtQ6Yt2FqA@mail.gmail.com/

Zach O'Keefe (13):
  mm/khugepaged: separate hugepage preallocation and cleanup
  mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  mm/khugepaged: add struct collapse_control
  mm/khugepaged: make hugepage allocation context-specific
  mm/khugepaged: add struct collapse_result
  mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  mm/khugepaged: remove khugepaged prefix from shared collapse functions
  mm/khugepaged: add flag to ignore khugepaged_max_ptes_*
  mm/khugepaged: add flag to ignore page young/referenced requirement
  mm/madvise: add MADV_COLLAPSE to process_madvise()
  selftests/vm: modularize collapse selftests
  selftests/vm: add MADV_COLLAPSE collapse context to selftests
  selftests/vm: add test to verify recollapse of THPs

 include/linux/huge_mm.h                 |  12 +
 include/trace/events/huge_memory.h      |   5 +-
 include/uapi/asm-generic/mman-common.h  |   2 +
 mm/internal.h                           |   1 +
 mm/khugepaged.c                         | 598 ++++++++++++++++--------
 mm/madvise.c                            |  11 +-
 mm/rmap.c                               |  15 +-
 tools/testing/selftests/vm/khugepaged.c | 417 +++++++++++------
 8 files changed, 702 insertions(+), 359 deletions(-)

-- 
2.35.1.1178.g4f1659d476-goog



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v5 01/13] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  2022-05-18 18:41   ` Peter Xu
  2022-05-04 21:44 ` [PATCH v5 02/13] mm/khugepaged: add struct collapse_control Zach O'Keefe
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

When scanning an anon pmd to see if it's eligible for collapse, return
SCAN_PMD_MAPPED if the pmd already maps a THP.  Note that
SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
file-collapse path, since the latter might identify pte-mapped compound
pages.  This is required by MADV_COLLAPSE which necessarily needs to
know what hugepage-aligned/sized regions are already pmd-mapped.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/trace/events/huge_memory.h |  1 +
 mm/internal.h                      |  1 +
 mm/khugepaged.c                    | 30 ++++++++++++++++++++++++++----
 mm/rmap.c                          | 15 +++++++++++++--
 4 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index d651f3437367..55392bf30a03 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -11,6 +11,7 @@
 	EM( SCAN_FAIL,			"failed")			\
 	EM( SCAN_SUCCEED,		"succeeded")			\
 	EM( SCAN_PMD_NULL,		"pmd_null")			\
+	EM( SCAN_PMD_MAPPED,		"page_pmd_mapped")		\
 	EM( SCAN_EXCEED_NONE_PTE,	"exceed_none_pte")		\
 	EM( SCAN_EXCEED_SWAP_PTE,	"exceed_swap_pte")		\
 	EM( SCAN_EXCEED_SHARED_PTE,	"exceed_shared_pte")		\
diff --git a/mm/internal.h b/mm/internal.h
index 0667abd57634..51ae9f71a2a3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -172,6 +172,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
 /*
  * in mm/rmap.c:
  */
+pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address);
 extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
 
 /*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eb444fd45568..2c2ed6b4d96c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -28,6 +28,7 @@ enum scan_result {
 	SCAN_FAIL,
 	SCAN_SUCCEED,
 	SCAN_PMD_NULL,
+	SCAN_PMD_MAPPED,
 	SCAN_EXCEED_NONE_PTE,
 	SCAN_EXCEED_SWAP_PTE,
 	SCAN_EXCEED_SHARED_PTE,
@@ -977,6 +978,29 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	return 0;
 }
 
+static int find_pmd_or_thp_or_none(struct mm_struct *mm,
+				   unsigned long address,
+				   pmd_t **pmd)
+{
+	pmd_t pmde;
+
+	*pmd = mm_find_pmd_raw(mm, address);
+	if (!*pmd)
+		return SCAN_PMD_NULL;
+
+	pmde = pmd_read_atomic(*pmd);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/* See comments in pmd_none_or_trans_huge_or_clear_bad() */
+	barrier();
+#endif
+	if (!pmd_present(pmde))
+		return SCAN_PMD_NULL;
+	if (pmd_trans_huge(pmde))
+		return SCAN_PMD_MAPPED;
+	return SCAN_SUCCEED;
+}
+
 /*
  * Bring missing pages in from swap, to complete THP collapse.
  * Only done if khugepaged_scan_pmd believes it is worthwhile.
@@ -1228,11 +1252,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd) {
-		result = SCAN_PMD_NULL;
+	result = find_pmd_or_thp_or_none(mm, address, &pmd);
+	if (result != SCAN_SUCCEED)
 		goto out;
-	}
 
 	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
diff --git a/mm/rmap.c b/mm/rmap.c
index 94d6b24a1ac2..6980b4011bf8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -759,13 +759,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 	return vma_address(page, vma);
 }
 
-pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
+pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd = NULL;
-	pmd_t pmde;
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
@@ -780,6 +779,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+out:
+	return pmd;
+}
+
+pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
+{
+	pmd_t pmde;
+	pmd_t *pmd;
+
+	pmd = mm_find_pmd_raw(mm, address);
+	if (!pmd)
+		goto out;
 	/*
 	 * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
 	 * without holding anon_vma lock for write.  So when looking for a
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 02/13] mm/khugepaged: add struct collapse_control
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
  2022-05-04 21:44 ` [PATCH v5 01/13] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  2022-05-12 20:02   ` David Rientjes
  2022-05-18 20:03   ` Peter Xu
  2022-05-04 21:44 ` [PATCH v5 03/13] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Modularize hugepage collapse by introducing struct collapse_control.
This structure serves to describe the properties of the requested
collapse, as well as serve as a local scratch pad to use during the
collapse itself.

Start by moving global per-node khugepaged statistics into this
new structure, and stack allocate one for khugepaged collapse
context.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 87 ++++++++++++++++++++++++++++---------------------
 1 file changed, 49 insertions(+), 38 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2c2ed6b4d96c..d3cb670921cd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -86,6 +86,14 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
 
 #define MAX_PTE_MAPPED_THP 8
 
+struct collapse_control {
+	/* Num pages scanned per node */
+	int node_load[MAX_NUMNODES];
+
+	/* Last target selected in khugepaged_find_target_node() */
+	int last_target_node;
+};
+
 /**
  * struct mm_slot - hash lookup from mm to mm_slot
  * @hash: hash collision list
@@ -786,9 +794,7 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-static int khugepaged_node_load[MAX_NUMNODES];
-
-static bool khugepaged_scan_abort(int nid)
+static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -800,11 +806,11 @@ static bool khugepaged_scan_abort(int nid)
 		return false;
 
 	/* If there is a count for this node already, it must be acceptable */
-	if (khugepaged_node_load[nid])
+	if (cc->node_load[nid])
 		return false;
 
 	for (i = 0; i < MAX_NUMNODES; i++) {
-		if (!khugepaged_node_load[i])
+		if (!cc->node_load[i])
 			continue;
 		if (node_distance(nid, i) > node_reclaim_distance)
 			return true;
@@ -819,28 +825,27 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(void)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
-	static int last_khugepaged_target_node = NUMA_NO_NODE;
 	int nid, target_node = 0, max_value = 0;
 
 	/* find first node with max normal pages hit */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
-		if (khugepaged_node_load[nid] > max_value) {
-			max_value = khugepaged_node_load[nid];
+		if (cc->node_load[nid] > max_value) {
+			max_value = cc->node_load[nid];
 			target_node = nid;
 		}
 
 	/* do some balance if several nodes have the same hit record */
-	if (target_node <= last_khugepaged_target_node)
-		for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
-				nid++)
-			if (max_value == khugepaged_node_load[nid]) {
+	if (target_node <= cc->last_target_node)
+		for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
+		     nid++)
+			if (max_value == cc->node_load[nid]) {
 				target_node = nid;
 				break;
 			}
 
-	last_khugepaged_target_node = target_node;
+	cc->last_target_node = target_node;
 	return target_node;
 }
 
@@ -878,7 +883,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 	return *hpage;
 }
 #else
-static int khugepaged_find_target_node(void)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -1235,10 +1240,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 	return;
 }
 
-static int khugepaged_scan_pmd(struct mm_struct *mm,
-			       struct vm_area_struct *vma,
-			       unsigned long address,
-			       struct page **hpage)
+static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, struct page **hpage,
+			       struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1256,7 +1260,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+	memset(cc->node_load, 0, sizeof(cc->node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
@@ -1322,16 +1326,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 
 		/*
 		 * Record which node the original page is from and save this
-		 * information to khugepaged_node_load[].
+		 * information to cc->node_load[].
 		 * Khugepaged will allocate hugepage from the node has the max
 		 * hit record.
 		 */
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
-		khugepaged_node_load[node]++;
+		cc->node_load[node]++;
 		if (!PageLRU(page)) {
 			result = SCAN_PAGE_LRU;
 			goto out_unmap;
@@ -1382,7 +1386,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
-		node = khugepaged_find_target_node();
+		node = khugepaged_find_target_node(cc);
 		/* collapse_huge_page will return with the mmap_lock released */
 		collapse_huge_page(mm, address, hpage, node,
 				referenced, unmapped);
@@ -2033,8 +2037,9 @@ static void collapse_file(struct mm_struct *mm,
 	/* TODO: tracepoints */
 }
 
-static void khugepaged_scan_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start, struct page **hpage)
+static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
+				 pgoff_t start, struct page **hpage,
+				 struct collapse_control *cc)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2045,7 +2050,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 
 	present = 0;
 	swap = 0;
-	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+	memset(cc->node_load, 0, sizeof(cc->node_load));
 	rcu_read_lock();
 	xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
 		if (xas_retry(&xas, page))
@@ -2070,11 +2075,11 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 		}
 
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			break;
 		}
-		khugepaged_node_load[node]++;
+		cc->node_load[node]++;
 
 		if (!PageLRU(page)) {
 			result = SCAN_PAGE_LRU;
@@ -2107,7 +2112,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			node = khugepaged_find_target_node();
+			node = khugepaged_find_target_node(cc);
 			collapse_file(mm, file, start, hpage, node);
 		}
 	}
@@ -2115,8 +2120,9 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 	/* TODO: tracepoints */
 }
 #else
-static void khugepaged_scan_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start, struct page **hpage)
+static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
+				 pgoff_t start, struct page **hpage,
+				 struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -2127,7 +2133,8 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
-					    struct page **hpage)
+					    struct page **hpage,
+					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
 {
@@ -2203,12 +2210,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 
 				mmap_read_unlock(mm);
 				ret = 1;
-				khugepaged_scan_file(mm, file, pgoff, hpage);
+				khugepaged_scan_file(mm, file, pgoff, hpage,
+						     cc);
 				fput(file);
 			} else {
 				ret = khugepaged_scan_pmd(mm, vma,
 						khugepaged_scan.address,
-						hpage);
+						hpage, cc);
 			}
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
@@ -2264,7 +2272,7 @@ static int khugepaged_wait_event(void)
 		kthread_should_stop();
 }
 
-static void khugepaged_do_scan(void)
+static void khugepaged_do_scan(struct collapse_control *cc)
 {
 	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
@@ -2288,7 +2296,7 @@ static void khugepaged_do_scan(void)
 		if (khugepaged_has_work() &&
 		    pass_through_head < 2)
 			progress += khugepaged_scan_mm_slot(pages - progress,
-							    &hpage);
+							    &hpage, cc);
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
@@ -2327,12 +2335,15 @@ static void khugepaged_wait_work(void)
 static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
+	struct collapse_control cc = {
+		.last_target_node = NUMA_NO_NODE,
+	};
 
 	set_freezable();
 	set_user_nice(current, MAX_NICE);
 
 	while (!kthread_should_stop()) {
-		khugepaged_do_scan();
+		khugepaged_do_scan(&cc);
 		khugepaged_wait_work();
 	}
 
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 03/13] mm/khugepaged: dedup and simplify hugepage alloc and charging
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
  2022-05-04 21:44 ` [PATCH v5 01/13] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
  2022-05-04 21:44 ` [PATCH v5 02/13] mm/khugepaged: add struct collapse_control Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  2022-05-12 20:02   ` David Rientjes
  2022-05-04 21:44 ` [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific Zach O'Keefe
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

The following code is duplicated in collapse_huge_page() and
collapse_file():

         /* Only allocate from the target node */
        gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;

        new_page = khugepaged_alloc_page(hpage, gfp, node);
        if (!new_page) {
                result = SCAN_ALLOC_HUGE_PAGE_FAIL;
                goto out;
        }

        if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
                result = SCAN_CGROUP_CHARGE_FAIL;
                goto out;
        }
        count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);

Also, "node" is passed as an argument to both collapse_huge_page() and
collapse_file() and obtained the same way, via
khugepaged_find_target_node().

Move all this into a new helper, alloc_charge_hpage(), and remove the
duplicate code from collapse_huge_page() and collapse_file().  Also,
simplify khugepaged_alloc_page() by returning a bool indicating
allocation success instead of a copy of the (possibly) allocated
struct page.

Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>

---

This patch currently depends on 'mm/khugepaged: sched to numa node when
collapse huge page' currently being discussed upstream[1], and
anticipates that this functionality would be equally applicable to
file-backed collapse.  It also goes ahead and wraps this code in a
CONFIF_NUMA #ifdef.

[1] https://lore.kernel.org/linux-mm/20220317065024.2635069-1-maobibo@loongson.cn/

 mm/khugepaged.c | 99 +++++++++++++++++++++++--------------------------
 1 file changed, 46 insertions(+), 53 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d3cb670921cd..c94bc43dff3e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -866,8 +866,7 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 	return true;
 }
 
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
+static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 {
 	VM_BUG_ON_PAGE(*hpage, *hpage);
 
@@ -875,12 +874,12 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 		*hpage = ERR_PTR(-ENOMEM);
-		return NULL;
+		return false;
 	}
 
 	prep_transhuge_page(*hpage);
 	count_vm_event(THP_COLLAPSE_ALLOC);
-	return *hpage;
+	return true;
 }
 #else
 static int khugepaged_find_target_node(struct collapse_control *cc)
@@ -942,12 +941,11 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 	return true;
 }
 
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
+static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 {
 	VM_BUG_ON(!*hpage);
 
-	return  *hpage;
+	return true;
 }
 #endif
 
@@ -1069,10 +1067,34 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 	return true;
 }
 
-static void collapse_huge_page(struct mm_struct *mm,
-				   unsigned long address,
-				   struct page **hpage,
-				   int node, int referenced, int unmapped)
+static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
+			      struct collapse_control *cc)
+{
+#ifdef CONFIG_NUMA
+	const struct cpumask *cpumask;
+#endif
+	gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
+	int node = khugepaged_find_target_node(cc);
+
+#ifdef CONFIG_NUMA
+	/* sched to specified node before huge page memory copy */
+	if (task_node(current) != node) {
+		cpumask = cpumask_of_node(node);
+		if (!cpumask_empty(cpumask))
+			set_cpus_allowed_ptr(current, cpumask);
+	}
+#endif
+	if (!khugepaged_alloc_page(hpage, gfp, node))
+		return SCAN_ALLOC_HUGE_PAGE_FAIL;
+	if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
+		return SCAN_CGROUP_CHARGE_FAIL;
+	count_memcg_page_event(*hpage, THP_COLLAPSE_ALLOC);
+	return SCAN_SUCCEED;
+}
+
+static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
+			       struct page **hpage, int referenced,
+			       int unmapped, struct collapse_control *cc)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -1083,14 +1105,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 	int isolated = 0, result = 0;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
-	gfp_t gfp;
-	const struct cpumask *cpumask;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-	/* Only allocate from the target node */
-	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
-
 	/*
 	 * Before allocating the hugepage, release the mmap_lock read lock.
 	 * The allocation can take potentially a long time if it involves
@@ -1099,23 +1116,11 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	mmap_read_unlock(mm);
 
-	/* sched to specified node before huage page memory copy */
-	if (task_node(current) != node) {
-		cpumask = cpumask_of_node(node);
-		if (!cpumask_empty(cpumask))
-			set_cpus_allowed_ptr(current, cpumask);
-	}
-	new_page = khugepaged_alloc_page(hpage, gfp, node);
-	if (!new_page) {
-		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+	result = alloc_charge_hpage(hpage, mm, cc);
+	if (result != SCAN_SUCCEED)
 		goto out_nolock;
-	}
 
-	if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
-		result = SCAN_CGROUP_CHARGE_FAIL;
-		goto out_nolock;
-	}
-	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
+	new_page = *hpage;
 
 	mmap_read_lock(mm);
 	result = hugepage_vma_revalidate(mm, address, &vma);
@@ -1386,10 +1391,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
-		node = khugepaged_find_target_node(cc);
 		/* collapse_huge_page will return with the mmap_lock released */
-		collapse_huge_page(mm, address, hpage, node,
-				referenced, unmapped);
+		collapse_huge_page(mm, address, hpage, referenced, unmapped,
+				   cc);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
@@ -1655,7 +1659,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  * @file: file that collapse on
  * @start: collapse start address
  * @hpage: new allocated huge page for collapse
- * @node: appointed node the new huge page allocate from
+ * @cc: collapse context and scratchpad
  *
  * Basic scheme is simple, details are more complex:
  *  - allocate and lock a new huge page;
@@ -1672,12 +1676,11 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  *    + restore gaps in the page cache;
  *    + unlock and free huge page;
  */
-static void collapse_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start,
-		struct page **hpage, int node)
+static void collapse_file(struct mm_struct *mm, struct file *file,
+			  pgoff_t start, struct page **hpage,
+			  struct collapse_control *cc)
 {
 	struct address_space *mapping = file->f_mapping;
-	gfp_t gfp;
 	struct page *new_page;
 	pgoff_t index, end = start + HPAGE_PMD_NR;
 	LIST_HEAD(pagelist);
@@ -1689,20 +1692,11 @@ static void collapse_file(struct mm_struct *mm,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	/* Only allocate from the target node */
-	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
-
-	new_page = khugepaged_alloc_page(hpage, gfp, node);
-	if (!new_page) {
-		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+	result = alloc_charge_hpage(hpage, mm, cc);
+	if (result != SCAN_SUCCEED)
 		goto out;
-	}
 
-	if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
-		result = SCAN_CGROUP_CHARGE_FAIL;
-		goto out;
-	}
-	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
+	new_page = *hpage;
 
 	/*
 	 * Ensure we have slots for all the pages in the range.  This is
@@ -2112,8 +2106,7 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			node = khugepaged_find_target_node(cc);
-			collapse_file(mm, file, start, hpage, node);
+			collapse_file(mm, file, start, hpage, cc);
 		}
 	}
 
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (2 preceding siblings ...)
  2022-05-04 21:44 ` [PATCH v5 03/13] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  2022-05-12 20:02   ` David Rientjes
  2022-05-04 21:44 ` [PATCH v5 05/13] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add a hook to struct collapse_context that allows contexts to define
their own allocation semantics and charging logic.  For example,
khugepaged has specific NUMA and UMA implementations as well as gfp
flags tied to /sys/kernel/mm/transparent_hugepage/khugepaged/defrag.

Additionally, move [pre]allocated hugepage pointer into
struct collapse_context.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 90 ++++++++++++++++++++++++-------------------------
 1 file changed, 44 insertions(+), 46 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c94bc43dff3e..6095fcb3f07c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,6 +92,10 @@ struct collapse_control {
 
 	/* Last target selected in khugepaged_find_target_node() */
 	int last_target_node;
+
+	struct page *hpage;
+	int (*alloc_charge_hpage)(struct mm_struct *mm,
+				  struct collapse_control *cc);
 };
 
 /**
@@ -866,18 +870,19 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 	return true;
 }
 
-static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
+static bool khugepaged_alloc_page(gfp_t gfp, int node,
+				  struct collapse_control *cc)
 {
-	VM_BUG_ON_PAGE(*hpage, *hpage);
+	VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
 
-	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
-	if (unlikely(!*hpage)) {
+	cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
+	if (unlikely(!cc->hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-		*hpage = ERR_PTR(-ENOMEM);
+		cc->hpage = ERR_PTR(-ENOMEM);
 		return false;
 	}
 
-	prep_transhuge_page(*hpage);
+	prep_transhuge_page(cc->hpage);
 	count_vm_event(THP_COLLAPSE_ALLOC);
 	return true;
 }
@@ -941,9 +946,10 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 	return true;
 }
 
-static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
+static bool khugepaged_alloc_page(gfp_t gfp, int node,
+				  struct collapse_control *cc)
 {
-	VM_BUG_ON(!*hpage);
+	VM_BUG_ON(!cc->hpage);
 
 	return true;
 }
@@ -1067,8 +1073,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 	return true;
 }
 
-static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
-			      struct collapse_control *cc)
+static int alloc_charge_hpage(struct mm_struct *mm, struct collapse_control *cc)
 {
 #ifdef CONFIG_NUMA
 	const struct cpumask *cpumask;
@@ -1084,17 +1089,17 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
 			set_cpus_allowed_ptr(current, cpumask);
 	}
 #endif
-	if (!khugepaged_alloc_page(hpage, gfp, node))
+	if (!khugepaged_alloc_page(gfp, node, cc))
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
-	if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
+	if (unlikely(mem_cgroup_charge(page_folio(cc->hpage), mm, gfp)))
 		return SCAN_CGROUP_CHARGE_FAIL;
-	count_memcg_page_event(*hpage, THP_COLLAPSE_ALLOC);
+	count_memcg_page_event(cc->hpage, THP_COLLAPSE_ALLOC);
 	return SCAN_SUCCEED;
 }
 
 static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
-			       struct page **hpage, int referenced,
-			       int unmapped, struct collapse_control *cc)
+			       int referenced, int unmapped,
+			       struct collapse_control *cc)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -1116,11 +1121,11 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_hpage(hpage, mm, cc);
+	result = cc->alloc_charge_hpage(mm, cc);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
-	new_page = *hpage;
+	new_page = cc->hpage;
 
 	mmap_read_lock(mm);
 	result = hugepage_vma_revalidate(mm, address, &vma);
@@ -1232,21 +1237,21 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	update_mmu_cache_pmd(vma, address, pmd);
 	spin_unlock(pmd_ptl);
 
-	*hpage = NULL;
+	cc->hpage = NULL;
 
 	khugepaged_pages_collapsed++;
 	result = SCAN_SUCCEED;
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
-	if (!IS_ERR_OR_NULL(*hpage))
-		mem_cgroup_uncharge(page_folio(*hpage));
+	if (!IS_ERR_OR_NULL(cc->hpage))
+		mem_cgroup_uncharge(page_folio(cc->hpage));
 	trace_mm_collapse_huge_page(mm, isolated, result);
 	return;
 }
 
 static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-			       unsigned long address, struct page **hpage,
+			       unsigned long address,
 			       struct collapse_control *cc)
 {
 	pmd_t *pmd;
@@ -1392,8 +1397,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
 		/* collapse_huge_page will return with the mmap_lock released */
-		collapse_huge_page(mm, address, hpage, referenced, unmapped,
-				   cc);
+		collapse_huge_page(mm, address, referenced, unmapped, cc);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
@@ -1658,7 +1662,6 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  * @mm: process address space where collapse happens
  * @file: file that collapse on
  * @start: collapse start address
- * @hpage: new allocated huge page for collapse
  * @cc: collapse context and scratchpad
  *
  * Basic scheme is simple, details are more complex:
@@ -1677,8 +1680,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  *    + unlock and free huge page;
  */
 static void collapse_file(struct mm_struct *mm, struct file *file,
-			  pgoff_t start, struct page **hpage,
-			  struct collapse_control *cc)
+			  pgoff_t start, struct collapse_control *cc)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *new_page;
@@ -1692,11 +1694,11 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_hpage(hpage, mm, cc);
+	result = cc->alloc_charge_hpage(mm, cc);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-	new_page = *hpage;
+	new_page = cc->hpage;
 
 	/*
 	 * Ensure we have slots for all the pages in the range.  This is
@@ -1979,7 +1981,7 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 		 * Remove pte page tables, so we can re-fault the page as huge.
 		 */
 		retract_page_tables(mapping, start);
-		*hpage = NULL;
+		cc->hpage = NULL;
 
 		khugepaged_pages_collapsed++;
 	} else {
@@ -2026,14 +2028,13 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 	unlock_page(new_page);
 out:
 	VM_BUG_ON(!list_empty(&pagelist));
-	if (!IS_ERR_OR_NULL(*hpage))
-		mem_cgroup_uncharge(page_folio(*hpage));
+	if (!IS_ERR_OR_NULL(cc->hpage))
+		mem_cgroup_uncharge(page_folio(cc->hpage));
 	/* TODO: tracepoints */
 }
 
 static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				 pgoff_t start, struct page **hpage,
-				 struct collapse_control *cc)
+				 pgoff_t start, struct collapse_control *cc)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2106,7 +2107,7 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			collapse_file(mm, file, start, hpage, cc);
+			collapse_file(mm, file, start, cc);
 		}
 	}
 
@@ -2114,8 +2115,7 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 }
 #else
 static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				 pgoff_t start, struct page **hpage,
-				 struct collapse_control *cc)
+				 pgoff_t start, struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -2126,7 +2126,6 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
-					    struct page **hpage,
 					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
@@ -2203,13 +2202,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 
 				mmap_read_unlock(mm);
 				ret = 1;
-				khugepaged_scan_file(mm, file, pgoff, hpage,
-						     cc);
+				khugepaged_scan_file(mm, file, pgoff, cc);
 				fput(file);
 			} else {
 				ret = khugepaged_scan_pmd(mm, vma,
-						khugepaged_scan.address,
-						hpage, cc);
+						khugepaged_scan.address, cc);
 			}
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
@@ -2267,15 +2264,15 @@ static int khugepaged_wait_event(void)
 
 static void khugepaged_do_scan(struct collapse_control *cc)
 {
-	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
 	unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
 	bool wait = true;
 
+	cc->hpage = NULL;
 	lru_add_drain_all();
 
 	while (progress < pages) {
-		if (!khugepaged_prealloc_page(&hpage, &wait))
+		if (!khugepaged_prealloc_page(&cc->hpage, &wait))
 			break;
 
 		cond_resched();
@@ -2289,14 +2286,14 @@ static void khugepaged_do_scan(struct collapse_control *cc)
 		if (khugepaged_has_work() &&
 		    pass_through_head < 2)
 			progress += khugepaged_scan_mm_slot(pages - progress,
-							    &hpage, cc);
+							    cc);
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
 	}
 
-	if (!IS_ERR_OR_NULL(hpage))
-		put_page(hpage);
+	if (!IS_ERR_OR_NULL(cc->hpage))
+		put_page(cc->hpage);
 }
 
 static bool khugepaged_should_wakeup(void)
@@ -2330,6 +2327,7 @@ static int khugepaged(void *none)
 	struct mm_slot *mm_slot;
 	struct collapse_control cc = {
 		.last_target_node = NUMA_NO_NODE,
+		.alloc_charge_hpage = &alloc_charge_hpage,
 	};
 
 	set_freezable();
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 05/13] mm/khugepaged: pipe enum scan_result codes back to callers
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (3 preceding siblings ...)
  2022-05-04 21:44 ` [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  2022-05-12 20:02   ` David Rientjes
  2022-05-04 21:44 ` [PATCH v5 06/13] mm/khugepaged: add flag to ignore khugepaged_max_ptes_* Zach O'Keefe
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Pipe enum scan_result codes back through return values of functions
downstream of khugepaged_scan_file() and khugepaged_scan_pmd() to
inform callers if the operation was successful, and if not, why.

Since khugepaged_scan_pmd()'s return value already has a specific
meaning (whether mmap_lock was unlocked or not), add a bool* argument
to khugepaged_scan_pmd() to retrieve this information.

Change khugepaged to take action based on the return values of
khugepaged_scan_file() and khugepaged_scan_pmd() instead of acting
deep within the collapsing functions themselves.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 72 ++++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 33 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6095fcb3f07c..1314caed65b0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -732,13 +732,13 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(page, none_or_zero,
 						    referenced, writable, result);
-		return 1;
+		return SCAN_SUCCEED;
 	}
 out:
 	release_pte_pages(pte, _pte, compound_pagelist);
 	trace_mm_collapse_huge_page_isolate(page, none_or_zero,
 					    referenced, writable, result);
-	return 0;
+	return result;
 }
 
 static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
@@ -1097,9 +1097,9 @@ static int alloc_charge_hpage(struct mm_struct *mm, struct collapse_control *cc)
 	return SCAN_SUCCEED;
 }
 
-static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
-			       int referenced, int unmapped,
-			       struct collapse_control *cc)
+static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
+			      int referenced, int unmapped,
+			      struct collapse_control *cc)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -1107,7 +1107,7 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pgtable_t pgtable;
 	struct page *new_page;
 	spinlock_t *pmd_ptl, *pte_ptl;
-	int isolated = 0, result = 0;
+	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
 
@@ -1187,11 +1187,11 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 
 	spin_lock(pte_ptl);
-	isolated = __collapse_huge_page_isolate(vma, address, pte,
-			&compound_pagelist);
+	result =  __collapse_huge_page_isolate(vma, address, pte,
+					       &compound_pagelist);
 	spin_unlock(pte_ptl);
 
-	if (unlikely(!isolated)) {
+	if (unlikely(result != SCAN_SUCCEED)) {
 		pte_unmap(pte);
 		spin_lock(pmd_ptl);
 		BUG_ON(!pmd_none(*pmd));
@@ -1239,24 +1239,23 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	cc->hpage = NULL;
 
-	khugepaged_pages_collapsed++;
 	result = SCAN_SUCCEED;
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
 	if (!IS_ERR_OR_NULL(cc->hpage))
 		mem_cgroup_uncharge(page_folio(cc->hpage));
-	trace_mm_collapse_huge_page(mm, isolated, result);
-	return;
+	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+	return result;
 }
 
 static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-			       unsigned long address,
+			       unsigned long address, bool *mmap_locked,
 			       struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
-	int ret = 0, result = 0, referenced = 0;
+	int result = SCAN_FAIL, referenced = 0;
 	int none_or_zero = 0, shared = 0;
 	struct page *page = NULL;
 	unsigned long _address;
@@ -1391,18 +1390,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
-		ret = 1;
 	}
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
-	if (ret) {
+	if (result == SCAN_SUCCEED) {
 		/* collapse_huge_page will return with the mmap_lock released */
-		collapse_huge_page(mm, address, referenced, unmapped, cc);
+		*mmap_locked = false;
+		result = collapse_huge_page(mm, address, referenced,
+					    unmapped, cc);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
 				     none_or_zero, result, unmapped);
-	return ret;
+	return result;
 }
 
 static void collect_mm_slot(struct mm_slot *mm_slot)
@@ -1679,8 +1679,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  *    + restore gaps in the page cache;
  *    + unlock and free huge page;
  */
-static void collapse_file(struct mm_struct *mm, struct file *file,
-			  pgoff_t start, struct collapse_control *cc)
+static int collapse_file(struct mm_struct *mm, struct file *file,
+			 pgoff_t start, struct collapse_control *cc)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *new_page;
@@ -1982,8 +1982,6 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 		 */
 		retract_page_tables(mapping, start);
 		cc->hpage = NULL;
-
-		khugepaged_pages_collapsed++;
 	} else {
 		struct page *page;
 
@@ -2031,10 +2029,11 @@ static void collapse_file(struct mm_struct *mm, struct file *file,
 	if (!IS_ERR_OR_NULL(cc->hpage))
 		mem_cgroup_uncharge(page_folio(cc->hpage));
 	/* TODO: tracepoints */
+	return result;
 }
 
-static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				 pgoff_t start, struct collapse_control *cc)
+static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
+				pgoff_t start, struct collapse_control *cc)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2107,15 +2106,16 @@ static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			collapse_file(mm, file, start, cc);
+			result = collapse_file(mm, file, start, cc);
 		}
 	}
 
 	/* TODO: tracepoints */
+	return result;
 }
 #else
-static void khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				 pgoff_t start, struct collapse_control *cc)
+static int khugepaged_scan_file(struct mm_struct *mm, struct file *file, pgoff_t start,
+				struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -2187,7 +2187,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 			goto skip;
 
 		while (khugepaged_scan.address < hend) {
-			int ret;
+			int result;
+			bool mmap_locked = true;
+
 			cond_resched();
 			if (unlikely(khugepaged_test_exit(mm)))
 				goto breakouterloop;
@@ -2201,17 +2203,21 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 						khugepaged_scan.address);
 
 				mmap_read_unlock(mm);
-				ret = 1;
-				khugepaged_scan_file(mm, file, pgoff, cc);
+				mmap_locked = false;
+				result = khugepaged_scan_file(mm, file, pgoff,
+							      cc);
 				fput(file);
 			} else {
-				ret = khugepaged_scan_pmd(mm, vma,
-						khugepaged_scan.address, cc);
+				result = khugepaged_scan_pmd(mm, vma,
+							     khugepaged_scan.address,
+							     &mmap_locked, cc);
 			}
+			if (result == SCAN_SUCCEED)
+				++khugepaged_pages_collapsed;
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
 			progress += HPAGE_PMD_NR;
-			if (ret)
+			if (!mmap_locked)
 				/* we released mmap_lock so break loop */
 				goto breakouterloop_mmap_lock;
 			if (progress >= pages)
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 06/13] mm/khugepaged: add flag to ignore khugepaged_max_ptes_*
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (4 preceding siblings ...)
  2022-05-04 21:44 ` [PATCH v5 05/13] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  2022-05-12 20:03   ` David Rientjes
  2022-05-04 21:44 ` [PATCH v5 07/13] mm/khugepaged: add flag to ignore page young/referenced requirement Zach O'Keefe
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add enforce_pte_scan_limits flag to struct collapse_control that allows
context to ignore the sysfs-controlled knobs
khugepaged_max_ptes_[none|swap|shared] and set this flag in khugepaged
collapse context to preserve existing khugepaged behavior.

This flag will be used (unset) when introducing madvise collapse
context since here, the user presumably has reason to believe the
collapse will be beneficial and khugepaged heuristics shouldn't tell
the user they are wrong.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1314caed65b0..ca730aec0e3e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -87,6 +87,9 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
 #define MAX_PTE_MAPPED_THP 8
 
 struct collapse_control {
+	/* Respect khugepaged_max_ptes_[none|swap|shared] */
+	bool enforce_pte_scan_limits;
+
 	/* Num pages scanned per node */
 	int node_load[MAX_NUMNODES];
 
@@ -614,6 +617,7 @@ static bool is_refcount_suitable(struct page *page)
 static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
+					struct collapse_control *cc,
 					struct list_head *compound_pagelist)
 {
 	struct page *page = NULL;
@@ -627,7 +631,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		if (pte_none(pteval) || (pte_present(pteval) &&
 				is_zero_pfn(pte_pfn(pteval)))) {
 			if (!userfaultfd_armed(vma) &&
-			    ++none_or_zero <= khugepaged_max_ptes_none) {
+			    (++none_or_zero <= khugepaged_max_ptes_none ||
+			     !cc->enforce_pte_scan_limits)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -647,8 +652,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		VM_BUG_ON_PAGE(!PageAnon(page), page);
 
-		if (page_mapcount(page) > 1 &&
-				++shared > khugepaged_max_ptes_shared) {
+		if (cc->enforce_pte_scan_limits && page_mapcount(page) > 1 &&
+		    ++shared > khugepaged_max_ptes_shared) {
 			result = SCAN_EXCEED_SHARED_PTE;
 			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 			goto out;
@@ -1187,7 +1192,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 
 	spin_lock(pte_ptl);
-	result =  __collapse_huge_page_isolate(vma, address, pte,
+	result =  __collapse_huge_page_isolate(vma, address, pte, cc,
 					       &compound_pagelist);
 	spin_unlock(pte_ptl);
 
@@ -1275,7 +1280,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (is_swap_pte(pteval)) {
-			if (++unmapped <= khugepaged_max_ptes_swap) {
+			if (++unmapped <= khugepaged_max_ptes_swap ||
+			    !cc->enforce_pte_scan_limits) {
 				/*
 				 * Always be strict with uffd-wp
 				 * enabled swap entries.  Please see
@@ -1294,7 +1300,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			if (!userfaultfd_armed(vma) &&
-			    ++none_or_zero <= khugepaged_max_ptes_none) {
+			    (++none_or_zero <= khugepaged_max_ptes_none ||
+			     !cc->enforce_pte_scan_limits)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1324,8 +1331,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto out_unmap;
 		}
 
-		if (page_mapcount(page) > 1 &&
-				++shared > khugepaged_max_ptes_shared) {
+		if (cc->enforce_pte_scan_limits &&
+		    page_mapcount(page) > 1 &&
+		    ++shared > khugepaged_max_ptes_shared) {
 			result = SCAN_EXCEED_SHARED_PTE;
 			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 			goto out_unmap;
@@ -2051,7 +2059,8 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 			continue;
 
 		if (xa_is_value(page)) {
-			if (++swap > khugepaged_max_ptes_swap) {
+			if (cc->enforce_pte_scan_limits &&
+			    ++swap > khugepaged_max_ptes_swap) {
 				result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				break;
@@ -2102,7 +2111,8 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 	rcu_read_unlock();
 
 	if (result == SCAN_SUCCEED) {
-		if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+		if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none &&
+		    cc->enforce_pte_scan_limits) {
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
@@ -2332,6 +2342,7 @@ static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
 	struct collapse_control cc = {
+		.enforce_pte_scan_limits = true,
 		.last_target_node = NUMA_NO_NODE,
 		.alloc_charge_hpage = &alloc_charge_hpage,
 	};
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 07/13] mm/khugepaged: add flag to ignore page young/referenced requirement
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (5 preceding siblings ...)
  2022-05-04 21:44 ` [PATCH v5 06/13] mm/khugepaged: add flag to ignore khugepaged_max_ptes_* Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  2022-05-12 20:03   ` David Rientjes
  2022-05-04 21:44 ` [PATCH v5 08/13] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add enforce_young flag to struct collapse_control that allows context to
ignore requirement that some pages in region being collapsed be young or
referenced.  Set this flag in khugepaged collapse context to preserve
existing khugepaged behavior.

This flag will be used (unset) when introducing madvise collapse
context since here, the user presumably has reason to believe the
collapse will be beneficial and khugepaged heuristics shouldn't tell
the user they are wrong.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ca730aec0e3e..b14807b7002e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -90,6 +90,9 @@ struct collapse_control {
 	/* Respect khugepaged_max_ptes_[none|swap|shared] */
 	bool enforce_pte_scan_limits;
 
+	/* Require memory to be young */
+	bool enforce_young;
+
 	/* Num pages scanned per node */
 	int node_load[MAX_NUMNODES];
 
@@ -720,9 +723,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			list_add_tail(&page->lru, compound_pagelist);
 next:
 		/* There should be enough young pte to collapse the page */
-		if (pte_young(pteval) ||
-		    page_is_young(page) || PageReferenced(page) ||
-		    mmu_notifier_test_young(vma->vm_mm, address))
+		if (cc->enforce_young &&
+		    (pte_young(pteval) || page_is_young(page) ||
+		     PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
+								     address)))
 			referenced++;
 
 		if (pte_write(pteval))
@@ -731,7 +735,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 	if (unlikely(!writable)) {
 		result = SCAN_PAGE_RO;
-	} else if (unlikely(!referenced)) {
+	} else if (unlikely(cc->enforce_young && !referenced)) {
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
@@ -1387,14 +1391,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 			result = SCAN_PAGE_COUNT;
 			goto out_unmap;
 		}
-		if (pte_young(pteval) ||
-		    page_is_young(page) || PageReferenced(page) ||
-		    mmu_notifier_test_young(vma->vm_mm, address))
+		if (cc->enforce_young &&
+		    (pte_young(pteval) || page_is_young(page) ||
+		     PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
+								     address)))
 			referenced++;
 	}
 	if (!writable) {
 		result = SCAN_PAGE_RO;
-	} else if (!referenced || (unmapped && referenced < HPAGE_PMD_NR/2)) {
+	} else if (cc->enforce_young && (!referenced || (unmapped && referenced
+							 < HPAGE_PMD_NR / 2))) {
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
@@ -2343,6 +2349,7 @@ static int khugepaged(void *none)
 	struct mm_slot *mm_slot;
 	struct collapse_control cc = {
 		.enforce_pte_scan_limits = true,
+		.enforce_young = true,
 		.last_target_node = NUMA_NO_NODE,
 		.alloc_charge_hpage = &alloc_charge_hpage,
 	};
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 08/13] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (6 preceding siblings ...)
  2022-05-04 21:44 ` [PATCH v5 07/13] mm/khugepaged: add flag to ignore page young/referenced requirement Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  2022-05-05 18:50   ` Zach O'Keefe
  2022-05-04 21:44 ` [PATCH v5 09/13] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

This idea was introduced by David Rientjes[1].

Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
synchronous collapse of memory at their own expense.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the
  THP
* Avoid unpredictable timing of khugepaged collapse

An immediate user of this new functionality are malloc() implementations
that manage memory in hugepage-sized chunks, but sometimes subrelease
memory back to the system in native-sized chunks via MADV_DONTNEED;
zapping the pmd.  Later, when the memory is hot, the implementation
could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
hugepage coverage and dTLB performance.  TCMalloc is such an
implementation that could benefit from this[2].

Only privately-mapped anon memory is supported for now, but it is
expected that file and shmem support will be added later to support the
use-case of backing executable text by THPs.  Current support provided
by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
which might impair services from serving at their full rated load after
(re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
immediately realize iTLB performance prevents page sharing and demand
paging, both of which increase steady state memory footprint.  With
MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
and lower RAM footprints.

This call respects THP eligibility as determined by the system-wide
/sys/kernel/mm/transparent_hugepage/enabled sysfs settings and the VMA
flags for the memory range being collapsed.

THP allocation may enter direct reclaim and/or compaction.

[1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[2] https://github.com/google/tcmalloc/tree/master/tcmalloc

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 arch/alpha/include/uapi/asm/mman.h     |   2 +
 arch/mips/include/uapi/asm/mman.h      |   2 +
 arch/parisc/include/uapi/asm/mman.h    |   2 +
 arch/xtensa/include/uapi/asm/mman.h    |   2 +
 include/linux/huge_mm.h                |  12 ++
 include/uapi/asm-generic/mman-common.h |   2 +
 mm/khugepaged.c                        | 167 +++++++++++++++++++++++--
 mm/madvise.c                           |   5 +
 8 files changed, 182 insertions(+), 12 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 4aa996423b0d..763929e814e9 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -76,6 +76,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 1be428663c10..c6e1fc77c996 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -103,6 +103,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index a7ea3204a5fa..22133a6a506e 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -70,6 +70,8 @@
 #define MADV_WIPEONFORK 71		/* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 72		/* Undo MADV_WIPEONFORK */
 
+#define MADV_COLLAPSE	73		/* Synchronous hugepage collapse */
+
 #define MADV_HWPOISON     100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 7966a58af472..1ff0c858544f 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -111,6 +111,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9a26bd10e083..4a2ea1b5437c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -222,6 +222,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
+int madvise_collapse(struct vm_area_struct *vma,
+		     struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -378,6 +381,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
 	BUG();
 	return 0;
 }
+
+static inline int madvise_collapse(struct vm_area_struct *vma,
+				   struct vm_area_struct **prev,
+				   unsigned long start, unsigned long end)
+{
+	BUG();
+	return 0;
+}
+
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6c1aa92a92e4..6ce1f1ceb432 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -77,6 +77,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b14807b7002e..165c646ddb8f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -837,6 +837,22 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 	return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
 }
 
+static bool alloc_hpage(gfp_t gfp, int node, struct collapse_control *cc)
+{
+	VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
+
+	cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
+	if (unlikely(!cc->hpage)) {
+		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		cc->hpage = ERR_PTR(-ENOMEM);
+		return false;
+	}
+
+	prep_transhuge_page(cc->hpage);
+	count_vm_event(THP_COLLAPSE_ALLOC);
+	return true;
+}
+
 #ifdef CONFIG_NUMA
 static int khugepaged_find_target_node(struct collapse_control *cc)
 {
@@ -882,18 +898,7 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 static bool khugepaged_alloc_page(gfp_t gfp, int node,
 				  struct collapse_control *cc)
 {
-	VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
-
-	cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
-	if (unlikely(!cc->hpage)) {
-		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-		cc->hpage = ERR_PTR(-ENOMEM);
-		return false;
-	}
-
-	prep_transhuge_page(cc->hpage);
-	count_vm_event(THP_COLLAPSE_ALLOC);
-	return true;
+	return alloc_hpage(gfp, node, cc);
 }
 #else
 static int khugepaged_find_target_node(struct collapse_control *cc)
@@ -2457,3 +2462,141 @@ void khugepaged_min_free_kbytes_update(void)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
+
+static void madvise_collapse_cleanup_page(struct page **hpage)
+{
+	if (!IS_ERR(*hpage) && *hpage)
+		put_page(*hpage);
+	*hpage = NULL;
+}
+
+static int madvise_collapse_errno(enum scan_result r)
+{
+	switch (r) {
+	case SCAN_PMD_NULL:
+	case SCAN_ADDRESS_RANGE:
+	case SCAN_VMA_NULL:
+	case SCAN_PTE_NON_PRESENT:
+	case SCAN_PAGE_NULL:
+		/*
+		 * Addresses in the specified range are not currently mapped,
+		 * or are outside the AS of the process.
+		 */
+		return -ENOMEM;
+	case SCAN_ALLOC_HUGE_PAGE_FAIL:
+	case SCAN_CGROUP_CHARGE_FAIL:
+		/* A kernel resource was temporarily unavailable. */
+		return -EAGAIN;
+	default:
+		return -EINVAL;
+	}
+}
+
+static int madvise_alloc_charge_hpage(struct mm_struct *mm,
+				      struct collapse_control *cc)
+{
+	if (!alloc_hpage(GFP_TRANSHUGE, khugepaged_find_target_node(cc), cc))
+		return SCAN_ALLOC_HUGE_PAGE_FAIL;
+	if (unlikely(mem_cgroup_charge(page_folio(cc->hpage), mm,
+				       GFP_TRANSHUGE)))
+		return SCAN_CGROUP_CHARGE_FAIL;
+	count_memcg_page_event(cc->hpage, THP_COLLAPSE_ALLOC);
+	return SCAN_SUCCEED;
+}
+
+int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end)
+{
+	struct collapse_control cc = {
+		.enforce_pte_scan_limits = false,
+		.enforce_young = false,
+		.last_target_node = NUMA_NO_NODE,
+		.hpage = NULL,
+		.alloc_charge_hpage = &madvise_alloc_charge_hpage,
+	};
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long hstart, hend, addr;
+	int thps = 0, nr_hpages = 0, result = SCAN_FAIL;
+	bool mmap_locked = true;
+
+	BUG_ON(vma->vm_start > start);
+	BUG_ON(vma->vm_end < end);
+
+	*prev = vma;
+
+	if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file)
+		return -EINVAL;
+
+	hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = end & HPAGE_PMD_MASK;
+	nr_hpages = (hend - hstart) >> HPAGE_PMD_SHIFT;
+
+	if (hstart >= hend || !transparent_hugepage_active(vma))
+		return -EINVAL;
+
+	mmgrab(mm);
+	lru_add_drain();
+
+	for (addr = hstart; ; ) {
+		mmap_assert_locked(mm);
+		cond_resched();
+		result = SCAN_FAIL;
+
+		if (unlikely(khugepaged_test_exit(mm))) {
+			result = SCAN_ANY_PROCESS;
+			break;
+		}
+
+		memset(cc.node_load, 0, sizeof(cc.node_load));
+		result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
+		if (!mmap_locked)
+			*prev = NULL;  /* tell madvise we dropped mmap_lock */
+
+		switch (result) {
+		/* Whitelisted set of results where continuing OK */
+		case SCAN_SUCCEED:
+		case SCAN_PMD_MAPPED:
+			++thps;
+			fallthrough;
+		case SCAN_PMD_NULL:
+		case SCAN_PTE_NON_PRESENT:
+		case SCAN_PTE_UFFD_WP:
+		case SCAN_PAGE_RO:
+		case SCAN_LACK_REFERENCED_PAGE:
+		case SCAN_PAGE_NULL:
+		case SCAN_PAGE_COUNT:
+		case SCAN_PAGE_LOCK:
+		case SCAN_PAGE_COMPOUND:
+			break;
+		case SCAN_PAGE_LRU:
+			lru_add_drain_all();
+			goto retry;
+		default:
+			/* Other error, exit */
+			goto break_loop;
+		}
+		addr += HPAGE_PMD_SIZE;
+		if (addr >= hend)
+			break;
+retry:
+		if (!mmap_locked) {
+			mmap_read_lock(mm);
+			mmap_locked = true;
+			result = hugepage_vma_revalidate(mm, addr, &vma);
+			if (result)
+				goto out;
+		}
+		madvise_collapse_cleanup_page(&cc.hpage);
+	}
+
+break_loop:
+	/* madvise_walk_vmas() expects us to hold mmap_lock on return */
+	if (!mmap_locked)
+		mmap_read_lock(mm);
+out:
+	mmap_assert_locked(mm);
+	madvise_collapse_cleanup_page(&cc.hpage);
+	mmdrop(mm);
+
+	return thps == nr_hpages ? 0 : madvise_collapse_errno(result);
+}
diff --git a/mm/madvise.c b/mm/madvise.c
index 5f4537511532..638517952bd2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_FREE:
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
+	case MADV_COLLAPSE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -1054,6 +1055,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_COLLAPSE:
+		return madvise_collapse(vma, prev, start, end);
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1147,6 +1150,7 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
+	case MADV_COLLAPSE:
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
@@ -1336,6 +1340,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
+ *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 09/13] mm/khugepaged: rename prefix of shared collapse functions
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (7 preceding siblings ...)
  2022-05-04 21:44 ` [PATCH v5 08/13] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  2022-05-12 20:03   ` David Rientjes
  2022-05-04 21:44 ` [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

The following functions/tracepoints are shared between khugepaged and
madvise collapse contexts.  Replace the "khugepaged_" prefixe with
generic "hpage_collapse_" prefix in such cases:

huge_memory:mm_khugepaged_scan_pmd -> huge_memory:mm_hpage_collapse_scan_pmd
khugepaged_test_exit() -> hpage_collapse_test_exit()
khugepaged_scan_abort() -> hpage_collapse_scan_abort()
khugepaged_scan_pmd() -> hpage_collapse_scan_pmd()
khugepaged_find_target_node() -> hpage_collapse_find_target_node()

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/trace/events/huge_memory.h |  2 +-
 mm/khugepaged.c                    | 72 ++++++++++++++++--------------
 2 files changed, 39 insertions(+), 35 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 55392bf30a03..fb6c73632ff3 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -48,7 +48,7 @@ SCAN_STATUS
 #define EM(a, b)	{a, b},
 #define EMe(a, b)	{a, b}
 
-TRACE_EVENT(mm_khugepaged_scan_pmd,
+TRACE_EVENT(mm_hpage_collapse_scan_pmd,
 
 	TP_PROTO(struct mm_struct *mm, struct page *page, bool writable,
 		 int referenced, int none_or_zero, int status, int unmapped),
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 165c646ddb8f..44e31e072124 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -96,7 +96,7 @@ struct collapse_control {
 	/* Num pages scanned per node */
 	int node_load[MAX_NUMNODES];
 
-	/* Last target selected in khugepaged_find_target_node() */
+	/* Last target selected in hpage_collapse_find_target_node() */
 	int last_target_node;
 
 	struct page *hpage;
@@ -453,7 +453,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
 	hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
 }
 
-static inline int khugepaged_test_exit(struct mm_struct *mm)
+static inline int hpage_collapse_test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
@@ -502,7 +502,7 @@ int __khugepaged_enter(struct mm_struct *mm)
 		return -ENOMEM;
 
 	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
+	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
 		free_mm_slot(mm_slot);
 		return 0;
@@ -566,11 +566,10 @@ void __khugepaged_exit(struct mm_struct *mm)
 	} else if (mm_slot) {
 		/*
 		 * This is required to serialize against
-		 * khugepaged_test_exit() (which is guaranteed to run
-		 * under mmap sem read mode). Stop here (after we
-		 * return all pagetables will be destroyed) until
-		 * khugepaged has finished working on the pagetables
-		 * under the mmap_lock.
+		 * hpage_collapse_test_exit() (which is guaranteed to run
+		 * under mmap sem read mode). Stop here (after we return all
+		 * pagetables will be destroyed) until khugepaged has finished
+		 * working on the pagetables under the mmap_lock.
 		 */
 		mmap_write_lock(mm);
 		mmap_write_unlock(mm);
@@ -807,7 +806,7 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
+static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -854,7 +853,7 @@ static bool alloc_hpage(gfp_t gfp, int node, struct collapse_control *cc)
 }
 
 #ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(struct collapse_control *cc)
+static int hpage_collapse_find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
 
@@ -901,7 +900,7 @@ static bool khugepaged_alloc_page(gfp_t gfp, int node,
 	return alloc_hpage(gfp, node, cc);
 }
 #else
-static int khugepaged_find_target_node(struct collapse_control *cc)
+static int hpage_collapse_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -982,7 +981,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	struct vm_area_struct *vma;
 	unsigned long hstart, hend;
 
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(hpage_collapse_test_exit(mm)))
 		return SCAN_ANY_PROCESS;
 
 	*vmap = vma = find_vma(mm, address);
@@ -1026,7 +1025,7 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 
 /*
  * Bring missing pages in from swap, to complete THP collapse.
- * Only done if khugepaged_scan_pmd believes it is worthwhile.
+ * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held,
  * but with mmap_lock held to protect against vma changes.
@@ -1093,7 +1092,7 @@ static int alloc_charge_hpage(struct mm_struct *mm, struct collapse_control *cc)
 	const struct cpumask *cpumask;
 #endif
 	gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
-	int node = khugepaged_find_target_node(cc);
+	int node = hpage_collapse_find_target_node(cc);
 
 #ifdef CONFIG_NUMA
 	/* sched to specified node before huge page memory copy */
@@ -1263,9 +1262,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	return result;
 }
 
-static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-			       unsigned long address, bool *mmap_locked,
-			       struct collapse_control *cc)
+static int hpage_collapse_scan_pmd(struct mm_struct *mm,
+				   struct vm_area_struct *vma,
+				   unsigned long address, bool *mmap_locked,
+				   struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1357,7 +1357,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * hit record.
 		 */
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node, cc)) {
+		if (hpage_collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
@@ -1419,8 +1419,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 					    unmapped, cc);
 	}
 out:
-	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
-				     none_or_zero, result, unmapped);
+	trace_mm_hpage_collapse_scan_pmd(mm, page, writable, referenced,
+					 none_or_zero, result, unmapped);
 	return result;
 }
 
@@ -1430,7 +1430,7 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (khugepaged_test_exit(mm)) {
+	if (hpage_collapse_test_exit(mm)) {
 		/* free mm_slot */
 		hash_del(&mm_slot->hash);
 		list_del(&mm_slot->mm_node);
@@ -1601,7 +1601,7 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 	if (!mmap_write_trylock(mm))
 		return;
 
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(hpage_collapse_test_exit(mm)))
 		goto out;
 
 	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
@@ -1664,7 +1664,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			 * it'll always mapped in small page size for uffd-wp
 			 * registered ranges.
 			 */
-			if (!khugepaged_test_exit(mm) && !userfaultfd_wp(vma))
+			if (!hpage_collapse_test_exit(mm) &&
+			    !userfaultfd_wp(vma))
 				collapse_and_free_pmd(mm, vma, addr, pmd);
 			mmap_write_unlock(mm);
 		} else {
@@ -2089,7 +2090,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 		}
 
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node, cc)) {
+		if (hpage_collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			break;
 		}
@@ -2178,7 +2179,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 	vma = NULL;
 	if (unlikely(!mmap_read_trylock(mm)))
 		goto breakouterloop_mmap_lock;
-	if (likely(!khugepaged_test_exit(mm)))
+	if (likely(!hpage_collapse_test_exit(mm)))
 		vma = find_vma(mm, khugepaged_scan.address);
 
 	progress++;
@@ -2186,7 +2187,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 		unsigned long hstart, hend;
 
 		cond_resched();
-		if (unlikely(khugepaged_test_exit(mm))) {
+		if (unlikely(hpage_collapse_test_exit(mm))) {
 			progress++;
 			break;
 		}
@@ -2212,7 +2213,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 			bool mmap_locked = true;
 
 			cond_resched();
-			if (unlikely(khugepaged_test_exit(mm)))
+			if (unlikely(hpage_collapse_test_exit(mm)))
 				goto breakouterloop;
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2229,9 +2230,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 							      cc);
 				fput(file);
 			} else {
-				result = khugepaged_scan_pmd(mm, vma,
-							     khugepaged_scan.address,
-							     &mmap_locked, cc);
+				result = hpage_collapse_scan_pmd(mm, vma,
+								 khugepaged_scan.address,
+								 &mmap_locked,
+								 cc);
 			}
 			if (result == SCAN_SUCCEED)
 				++khugepaged_pages_collapsed;
@@ -2255,7 +2257,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (khugepaged_test_exit(mm) || !vma) {
+	if (hpage_collapse_test_exit(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
@@ -2495,7 +2497,8 @@ static int madvise_collapse_errno(enum scan_result r)
 static int madvise_alloc_charge_hpage(struct mm_struct *mm,
 				      struct collapse_control *cc)
 {
-	if (!alloc_hpage(GFP_TRANSHUGE, khugepaged_find_target_node(cc), cc))
+	if (!alloc_hpage(GFP_TRANSHUGE, hpage_collapse_find_target_node(cc),
+			 cc))
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 	if (unlikely(mem_cgroup_charge(page_folio(cc->hpage), mm,
 				       GFP_TRANSHUGE)))
@@ -2542,13 +2545,14 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		cond_resched();
 		result = SCAN_FAIL;
 
-		if (unlikely(khugepaged_test_exit(mm))) {
+		if (unlikely(hpage_collapse_test_exit(mm))) {
 			result = SCAN_ANY_PROCESS;
 			break;
 		}
 
 		memset(cc.node_load, 0, sizeof(cc.node_load));
-		result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
+		result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
+						 &cc);
 		if (!mmap_locked)
 			*prev = NULL;  /* tell madvise we dropped mmap_lock */
 
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (8 preceding siblings ...)
  2022-05-04 21:44 ` [PATCH v5 09/13] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  2022-05-11  0:49   ` Rongwei Wang
  2022-05-12 20:03   ` David Rientjes
  2022-05-04 21:44 ` [PATCH v5 11/13] selftests/vm: modularize collapse selftests Zach O'Keefe
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
CAP_SYS_ADMIN or is requesting collapse of it's own memory.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/madvise.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 638517952bd2..08c11217025a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
 }
 
 static bool
-process_madvise_behavior_valid(int behavior)
+process_madvise_behavior_valid(int behavior, struct task_struct *task)
 {
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
 	case MADV_WILLNEED:
 		return true;
+	case MADV_COLLAPSE:
+		return task == current || capable(CAP_SYS_ADMIN);
 	default:
 		return false;
 	}
@@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto free_iov;
 	}
 
-	if (!process_madvise_behavior_valid(behavior)) {
+	if (!process_madvise_behavior_valid(behavior, task)) {
 		ret = -EINVAL;
 		goto release_task;
 	}
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 11/13] selftests/vm: modularize collapse selftests
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (9 preceding siblings ...)
  2022-05-04 21:44 ` [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  2022-05-04 21:44 ` [PATCH v5 12/13] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
  2022-05-04 21:44 ` [PATCH v5 13/13] selftests/vm: add test to verify recollapse of THPs Zach O'Keefe
  12 siblings, 0 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Modularize the collapse action of khugepaged collapse selftests by
introducing a struct collapse_context which specifies how to collapse a
given memory range and the expected semantics of the collapse.  This
can be reused later to test other collapse contexts.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 257 +++++++++++-------------
 1 file changed, 116 insertions(+), 141 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 155120b67a16..c59d832fee96 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -23,6 +23,12 @@ static int hpage_pmd_nr;
 #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
 #define PID_SMAPS "/proc/self/smaps"
 
+struct collapse_context {
+	const char *name;
+	void (*collapse)(const char *msg, char *p, bool expect);
+	bool enforce_pte_scan_limits;
+};
+
 enum thp_enabled {
 	THP_ALWAYS,
 	THP_MADVISE,
@@ -528,53 +534,39 @@ static void alloc_at_fault(void)
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_full(void)
+static void collapse_full(struct collapse_context *context)
 {
 	void *p;
 
 	p = alloc_mapping();
 	fill_memory(p, 0, hpage_pmd_size);
-	if (wait_for_scan("Collapse fully populated PTE table", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	context->collapse("Collapse fully populated PTE table", p, true);
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_empty(void)
+static void collapse_empty(struct collapse_context *context)
 {
 	void *p;
 
 	p = alloc_mapping();
-	if (wait_for_scan("Do not collapse empty PTE table", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		fail("Fail");
-	else
-		success("OK");
+	context->collapse("Do not collapse empty PTE table", p, false);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_single_pte_entry(void)
+static void collapse_single_pte_entry(struct collapse_context *context)
 {
 	void *p;
 
 	p = alloc_mapping();
 	fill_memory(p, 0, page_size);
-	if (wait_for_scan("Collapse PTE table with single PTE entry present", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	context->collapse("Collapse PTE table with single PTE entry present", p,
+			  true);
 	validate_memory(p, 0, page_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_none(void)
+static void collapse_max_ptes_none(struct collapse_context *context)
 {
 	int max_ptes_none = hpage_pmd_nr / 2;
 	struct settings settings = default_settings;
@@ -586,28 +578,23 @@ static void collapse_max_ptes_none(void)
 	p = alloc_mapping();
 
 	fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
-	if (wait_for_scan("Do not collapse with max_ptes_none exceeded", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		fail("Fail");
-	else
-		success("OK");
+	context->collapse("Maybe collapse with max_ptes_none exceeded", p,
+			  !context->enforce_pte_scan_limits);
 	validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
 
-	fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
-	if (wait_for_scan("Collapse with max_ptes_none PTEs empty", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
-	validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
+	if (context->enforce_pte_scan_limits) {
+		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
+		context->collapse("Collapse with max_ptes_none PTEs empty", p,
+				  true);
+		validate_memory(p, 0,
+				(hpage_pmd_nr - max_ptes_none) * page_size);
+	}
 
 	munmap(p, hpage_pmd_size);
 	write_settings(&default_settings);
 }
 
-static void collapse_swapin_single_pte(void)
+static void collapse_swapin_single_pte(struct collapse_context *context)
 {
 	void *p;
 	p = alloc_mapping();
@@ -625,18 +612,14 @@ static void collapse_swapin_single_pte(void)
 		goto out;
 	}
 
-	if (wait_for_scan("Collapse with swapping in single PTE entry", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	context->collapse("Collapse with swapping in single PTE entry",
+			  p, true);
 	validate_memory(p, 0, hpage_pmd_size);
 out:
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_swap(void)
+static void collapse_max_ptes_swap(struct collapse_context *context)
 {
 	int max_ptes_swap = read_num("khugepaged/max_ptes_swap");
 	void *p;
@@ -656,39 +639,34 @@ static void collapse_max_ptes_swap(void)
 		goto out;
 	}
 
-	if (wait_for_scan("Do not collapse with max_ptes_swap exceeded", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		fail("Fail");
-	else
-		success("OK");
+	context->collapse("Maybe collapse with max_ptes_swap exceeded",
+			      p, !context->enforce_pte_scan_limits);
 	validate_memory(p, 0, hpage_pmd_size);
 
-	fill_memory(p, 0, hpage_pmd_size);
-	printf("Swapout %d of %d pages...", max_ptes_swap, hpage_pmd_nr);
-	if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) {
-		perror("madvise(MADV_PAGEOUT)");
-		exit(EXIT_FAILURE);
-	}
-	if (check_swap(p, max_ptes_swap * page_size)) {
-		success("OK");
-	} else {
-		fail("Fail");
-		goto out;
-	}
+	if (context->enforce_pte_scan_limits) {
+		fill_memory(p, 0, hpage_pmd_size);
+		printf("Swapout %d of %d pages...", max_ptes_swap,
+		       hpage_pmd_nr);
+		if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) {
+			perror("madvise(MADV_PAGEOUT)");
+			exit(EXIT_FAILURE);
+		}
+		if (check_swap(p, max_ptes_swap * page_size)) {
+			success("OK");
+		} else {
+			fail("Fail");
+			goto out;
+		}
 
-	if (wait_for_scan("Collapse with max_ptes_swap pages swapped out", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
-	validate_memory(p, 0, hpage_pmd_size);
+		context->collapse("Collapse with max_ptes_swap pages swapped out",
+				  p, true);
+		validate_memory(p, 0, hpage_pmd_size);
+	}
 out:
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_single_pte_entry_compound(void)
+static void collapse_single_pte_entry_compound(struct collapse_context *context)
 {
 	void *p;
 
@@ -710,17 +688,13 @@ static void collapse_single_pte_entry_compound(void)
 	else
 		fail("Fail");
 
-	if (wait_for_scan("Collapse PTE table with single PTE mapping compound page", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	context->collapse("Collapse PTE table with single PTE mapping compound page",
+			  p, true);
 	validate_memory(p, 0, page_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_full_of_compound(void)
+static void collapse_full_of_compound(struct collapse_context *context)
 {
 	void *p;
 
@@ -742,17 +716,12 @@ static void collapse_full_of_compound(void)
 	else
 		fail("Fail");
 
-	if (wait_for_scan("Collapse PTE table full of compound pages", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	context->collapse("Collapse PTE table full of compound pages", p, true);
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_compound_extreme(void)
+static void collapse_compound_extreme(struct collapse_context *context)
 {
 	void *p;
 	int i;
@@ -798,18 +767,14 @@ static void collapse_compound_extreme(void)
 	else
 		fail("Fail");
 
-	if (wait_for_scan("Collapse PTE table full of different compound pages", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	context->collapse("Collapse PTE table full of different compound pages",
+			  p, true);
 
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_fork(void)
+static void collapse_fork(struct collapse_context *context)
 {
 	int wstatus;
 	void *p;
@@ -835,13 +800,8 @@ static void collapse_fork(void)
 			fail("Fail");
 
 		fill_memory(p, page_size, 2 * page_size);
-
-		if (wait_for_scan("Collapse PTE table with single page shared with parent process", p))
-			fail("Timeout");
-		else if (check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
+		context->collapse("Collapse PTE table with single page shared with parent process",
+				  p, true);
 
 		validate_memory(p, 0, page_size);
 		munmap(p, hpage_pmd_size);
@@ -860,7 +820,7 @@ static void collapse_fork(void)
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_fork_compound(void)
+static void collapse_fork_compound(struct collapse_context *context)
 {
 	int wstatus;
 	void *p;
@@ -896,14 +856,10 @@ static void collapse_fork_compound(void)
 		fill_memory(p, 0, page_size);
 
 		write_num("khugepaged/max_ptes_shared", hpage_pmd_nr - 1);
-		if (wait_for_scan("Collapse PTE table full of compound pages in child", p))
-			fail("Timeout");
-		else if (check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
+		context->collapse("Collapse PTE table full of compound pages in child",
+				  p, true);
 		write_num("khugepaged/max_ptes_shared",
-				default_settings.khugepaged.max_ptes_shared);
+			  default_settings.khugepaged.max_ptes_shared);
 
 		validate_memory(p, 0, hpage_pmd_size);
 		munmap(p, hpage_pmd_size);
@@ -922,7 +878,7 @@ static void collapse_fork_compound(void)
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_shared()
+static void collapse_max_ptes_shared(struct collapse_context *context)
 {
 	int max_ptes_shared = read_num("khugepaged/max_ptes_shared");
 	int wstatus;
@@ -957,28 +913,22 @@ static void collapse_max_ptes_shared()
 		else
 			fail("Fail");
 
-		if (wait_for_scan("Do not collapse with max_ptes_shared exceeded", p))
-			fail("Timeout");
-		else if (!check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
-
-		printf("Trigger CoW on page %d of %d...",
-				hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
-		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) * page_size);
-		if (!check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
-
-
-		if (wait_for_scan("Collapse with max_ptes_shared PTEs shared", p))
-			fail("Timeout");
-		else if (check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
+		context->collapse("Maybe collapse with max_ptes_shared exceeded",
+				  p, !context->enforce_pte_scan_limits);
+
+		if (context->enforce_pte_scan_limits) {
+			printf("Trigger CoW on page %d of %d...",
+			       hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
+			fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) *
+				    page_size);
+			if (!check_huge(p))
+				success("OK");
+			else
+				fail("Fail");
+
+			context->collapse("Collapse with max_ptes_shared PTEs shared",
+					  p, true);
+		}
 
 		validate_memory(p, 0, hpage_pmd_size);
 		munmap(p, hpage_pmd_size);
@@ -997,8 +947,27 @@ static void collapse_max_ptes_shared()
 	munmap(p, hpage_pmd_size);
 }
 
+static void khugepaged_collapse(const char *msg, char *p, bool expect)
+{
+	if (wait_for_scan(msg, p))
+		fail("Timeout");
+	else if (check_huge(p) == expect)
+		success("OK");
+	else
+		fail("Fail");
+}
+
 int main(void)
 {
+	struct collapse_context contexts[] = {
+		{
+			.name = "khugepaged",
+			.collapse = &khugepaged_collapse,
+			.enforce_pte_scan_limits = true,
+		},
+	};
+	int i;
+
 	setbuf(stdout, NULL);
 
 	page_size = getpagesize();
@@ -1014,18 +983,24 @@ int main(void)
 	adjust_settings();
 
 	alloc_at_fault();
-	collapse_full();
-	collapse_empty();
-	collapse_single_pte_entry();
-	collapse_max_ptes_none();
-	collapse_swapin_single_pte();
-	collapse_max_ptes_swap();
-	collapse_single_pte_entry_compound();
-	collapse_full_of_compound();
-	collapse_compound_extreme();
-	collapse_fork();
-	collapse_fork_compound();
-	collapse_max_ptes_shared();
+
+	for (i = 0; i < sizeof(contexts) / sizeof(contexts[0]); ++i) {
+		struct collapse_context *c = &contexts[i];
+
+		printf("\n*** Testing context: %s ***\n", c->name);
+		collapse_full(c);
+		collapse_empty(c);
+		collapse_single_pte_entry(c);
+		collapse_max_ptes_none(c);
+		collapse_swapin_single_pte(c);
+		collapse_max_ptes_swap(c);
+		collapse_single_pte_entry_compound(c);
+		collapse_full_of_compound(c);
+		collapse_compound_extreme(c);
+		collapse_fork(c);
+		collapse_fork_compound(c);
+		collapse_max_ptes_shared(c);
+	}
 
 	restore_settings(0);
 }
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 12/13] selftests/vm: add MADV_COLLAPSE collapse context to selftests
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (10 preceding siblings ...)
  2022-05-04 21:44 ` [PATCH v5 11/13] selftests/vm: modularize collapse selftests Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  2022-05-04 21:44 ` [PATCH v5 13/13] selftests/vm: add test to verify recollapse of THPs Zach O'Keefe
  12 siblings, 0 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add MADV_COLLAPSE selftests.  Extend struct collapse_context to support
context initialization/cleanup.  This is used by madvise collapse
context to "disable" and "enable" khugepaged, since it would otherwise
interfere with the tests.

The mechanism used to "disable" khugepaged is a hack: it sets
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs to a
large value and feeds khugepaged enough suitable VMAs/pages to keep
khugepaged sleeping for the duration of the madvise collapse tests.
Since khugepaged is woken when this file is written, enough VMAs must be
queued to put khugepaged back to sleep when the tests write to this file
in write_settings().

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 133 ++++++++++++++++++++++--
 1 file changed, 125 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index c59d832fee96..e0ccc9443f78 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -14,17 +14,23 @@
 #ifndef MADV_PAGEOUT
 #define MADV_PAGEOUT 21
 #endif
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
 
 #define BASE_ADDR ((void *)(1UL << 30))
 static unsigned long hpage_pmd_size;
 static unsigned long page_size;
 static int hpage_pmd_nr;
+static int num_khugepaged_wakeups;
 
 #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
 #define PID_SMAPS "/proc/self/smaps"
 
 struct collapse_context {
 	const char *name;
+	bool (*init_context)(void);
+	bool (*cleanup_context)(void);
 	void (*collapse)(const char *msg, char *p, bool expect);
 	bool enforce_pte_scan_limits;
 };
@@ -264,6 +270,17 @@ static void write_num(const char *name, unsigned long num)
 	}
 }
 
+/*
+ * Use this macro instead of write_settings inside tests, and should
+ * be called at most once per callsite.
+ *
+ * Hack to statically count the number of times khugepaged is woken up due to
+ * writes to
+ * /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs,
+ * and is stored in __COUNTER__.
+ */
+#define WRITE_SETTINGS(s) do { __COUNTER__; write_settings(s); } while (0)
+
 static void write_settings(struct settings *settings)
 {
 	struct khugepaged_settings *khugepaged = &settings->khugepaged;
@@ -332,7 +349,7 @@ static void adjust_settings(void)
 {
 
 	printf("Adjust settings...");
-	write_settings(&default_settings);
+	WRITE_SETTINGS(&default_settings);
 	success("OK");
 }
 
@@ -440,20 +457,25 @@ static bool check_swap(void *addr, unsigned long size)
 	return swap;
 }
 
-static void *alloc_mapping(void)
+static void *alloc_mapping_at(void *at, size_t size)
 {
 	void *p;
 
-	p = mmap(BASE_ADDR, hpage_pmd_size, PROT_READ | PROT_WRITE,
-			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
-	if (p != BASE_ADDR) {
-		printf("Failed to allocate VMA at %p\n", BASE_ADDR);
+	p = mmap(at, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE,
+		 -1, 0);
+	if (p != at) {
+		printf("Failed to allocate VMA at %p\n", at);
 		exit(EXIT_FAILURE);
 	}
 
 	return p;
 }
 
+static void *alloc_mapping(void)
+{
+	return alloc_mapping_at(BASE_ADDR, hpage_pmd_size);
+}
+
 static void fill_memory(int *p, unsigned long start, unsigned long end)
 {
 	int i;
@@ -573,7 +595,7 @@ static void collapse_max_ptes_none(struct collapse_context *context)
 	void *p;
 
 	settings.khugepaged.max_ptes_none = max_ptes_none;
-	write_settings(&settings);
+	WRITE_SETTINGS(&settings);
 
 	p = alloc_mapping();
 
@@ -591,7 +613,7 @@ static void collapse_max_ptes_none(struct collapse_context *context)
 	}
 
 	munmap(p, hpage_pmd_size);
-	write_settings(&default_settings);
+	WRITE_SETTINGS(&default_settings);
 }
 
 static void collapse_swapin_single_pte(struct collapse_context *context)
@@ -947,6 +969,87 @@ static void collapse_max_ptes_shared(struct collapse_context *context)
 	munmap(p, hpage_pmd_size);
 }
 
+static void madvise_collapse(const char *msg, char *p, bool expect)
+{
+	int ret;
+
+	printf("%s...", msg);
+	/* Sanity check */
+	if (check_huge(p)) {
+		printf("Unexpected huge page\n");
+		exit(EXIT_FAILURE);
+	}
+
+	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+	ret = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+	if (((bool)ret) == expect)
+		fail("Fail: Bad return value");
+	else if (check_huge(p) != expect)
+		fail("Fail: check_huge()");
+	else
+		success("OK");
+}
+
+static struct khugepaged_disable_state {
+	void *p;
+	size_t map_size;
+} khugepaged_disable_state;
+
+static bool disable_khugepaged(void)
+{
+	/*
+	 * Hack to "disable" khugepaged by setting
+	 * /transparent_hugepage/khugepaged/scan_sleep_millisecs to some large
+	 * value, then feeding it enough suitable VMAs to scan and subsequently
+	 * sleep.
+	 *
+	 * khugepaged is woken up on writes to
+	 * /transparent_hugepage/khugepaged/scan_sleep_millisecs, so care must
+	 * be taken to not inadvertently wake khugepaged in these tests.
+	 *
+	 * Feed khugepaged 1 hugepage-sized VMA to scan and sleep on, then
+	 * N more for each time khugepaged would be woken up.
+	 */
+	size_t map_size = (num_khugepaged_wakeups + 1) * hpage_pmd_size;
+	void *p;
+	bool ret = true;
+	int full_scans;
+	int timeout = 6;  /* 3 seconds */
+
+	default_settings.khugepaged.scan_sleep_millisecs = 1000 * 60 * 10;
+	default_settings.khugepaged.pages_to_scan = 1;
+	write_settings(&default_settings);
+
+	p = alloc_mapping_at(((char *)BASE_ADDR) + (1UL << 30), map_size);
+	fill_memory(p, 0, map_size);
+
+	full_scans = read_num("khugepaged/full_scans") + 2;
+
+	printf("disabling khugepaged...");
+	while (timeout--) {
+		if (read_num("khugepaged/full_scans") >= full_scans) {
+			fail("Fail");
+			ret = false;
+			break;
+		}
+		printf(".");
+		usleep(TICK);
+	}
+	success("OK");
+	khugepaged_disable_state.p = p;
+	khugepaged_disable_state.map_size = map_size;
+	return ret;
+}
+
+static bool enable_khugepaged(void)
+{
+	printf("enabling khugepaged...");
+	munmap(khugepaged_disable_state.p, khugepaged_disable_state.map_size);
+	write_settings(&saved_settings);
+	success("OK");
+	return true;
+}
+
 static void khugepaged_collapse(const char *msg, char *p, bool expect)
 {
 	if (wait_for_scan(msg, p))
@@ -962,9 +1065,18 @@ int main(void)
 	struct collapse_context contexts[] = {
 		{
 			.name = "khugepaged",
+			.init_context = NULL,
+			.cleanup_context = NULL,
 			.collapse = &khugepaged_collapse,
 			.enforce_pte_scan_limits = true,
 		},
+		{
+			.name = "madvise",
+			.init_context = &disable_khugepaged,
+			.cleanup_context = &enable_khugepaged,
+			.collapse = &madvise_collapse,
+			.enforce_pte_scan_limits = false,
+		},
 	};
 	int i;
 
@@ -973,6 +1085,7 @@ int main(void)
 	page_size = getpagesize();
 	hpage_pmd_size = read_num("hpage_pmd_size");
 	hpage_pmd_nr = hpage_pmd_size / page_size;
+	num_khugepaged_wakeups = __COUNTER__;
 
 	default_settings.khugepaged.max_ptes_none = hpage_pmd_nr - 1;
 	default_settings.khugepaged.max_ptes_swap = hpage_pmd_nr / 8;
@@ -988,6 +1101,8 @@ int main(void)
 		struct collapse_context *c = &contexts[i];
 
 		printf("\n*** Testing context: %s ***\n", c->name);
+		if (c->init_context && !c->init_context())
+			continue;
 		collapse_full(c);
 		collapse_empty(c);
 		collapse_single_pte_entry(c);
@@ -1000,6 +1115,8 @@ int main(void)
 		collapse_fork(c);
 		collapse_fork_compound(c);
 		collapse_max_ptes_shared(c);
+		if (c->cleanup_context && !c->cleanup_context())
+			break;
 	}
 
 	restore_settings(0);
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 13/13] selftests/vm: add test to verify recollapse of THPs
  2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (11 preceding siblings ...)
  2022-05-04 21:44 ` [PATCH v5 12/13] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
@ 2022-05-04 21:44 ` Zach O'Keefe
  12 siblings, 0 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-04 21:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add selftest specific to madvise collapse context that tests
MADV_COLLAPSE is "successful" if a hugepage-algined/sized region is
already pmd-mapped.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 32 +++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index e0ccc9443f78..c36d04218083 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -969,6 +969,32 @@ static void collapse_max_ptes_shared(struct collapse_context *context)
 	munmap(p, hpage_pmd_size);
 }
 
+static void madvise_collapse_existing_thps(void)
+{
+	void *p;
+	int err;
+
+	p = alloc_mapping();
+	fill_memory(p, 0, hpage_pmd_size);
+
+	printf("Collapse fully populated PTE table...");
+	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+	err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+	if (err == 0 && check_huge(p)) {
+		success("OK");
+		printf("Re-collapse PMD-mapped hugepage");
+		err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+		if (err == 0 && check_huge(p))
+			success("OK");
+		else
+			fail("Fail");
+	} else {
+		fail("Fail");
+	}
+	validate_memory(p, 0, hpage_pmd_size);
+	munmap(p, hpage_pmd_size);
+}
+
 static void madvise_collapse(const char *msg, char *p, bool expect)
 {
 	int ret;
@@ -1097,6 +1123,7 @@ int main(void)
 
 	alloc_at_fault();
 
+	/* Shared tests */
 	for (i = 0; i < sizeof(contexts) / sizeof(contexts[0]); ++i) {
 		struct collapse_context *c = &contexts[i];
 
@@ -1119,5 +1146,10 @@ int main(void)
 			break;
 	}
 
+	/* madvise-specific tests */
+	disable_khugepaged();
+	madvise_collapse_existing_thps();
+	enable_khugepaged();
+
 	restore_settings(0);
 }
-- 
2.36.0.464.gb9c8b46e94-goog



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 08/13] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-05-04 21:44 ` [PATCH v5 08/13] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
@ 2022-05-05 18:50   ` Zach O'Keefe
  2022-05-05 18:58     ` Zach O'Keefe
  0 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-05 18:50 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Wed, May 4, 2022 at 2:45 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> This idea was introduced by David Rientjes[1].
>
> Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
> synchronous collapse of memory at their own expense.
>
> The benefits of this approach are:
>
> * CPU is charged to the process that wants to spend the cycles for the
>   THP
> * Avoid unpredictable timing of khugepaged collapse
>
> An immediate user of this new functionality are malloc() implementations
> that manage memory in hugepage-sized chunks, but sometimes subrelease
> memory back to the system in native-sized chunks via MADV_DONTNEED;
> zapping the pmd.  Later, when the memory is hot, the implementation
> could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
> hugepage coverage and dTLB performance.  TCMalloc is such an
> implementation that could benefit from this[2].
>
> Only privately-mapped anon memory is supported for now, but it is
> expected that file and shmem support will be added later to support the
> use-case of backing executable text by THPs.  Current support provided
> by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
> which might impair services from serving at their full rated load after
> (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
> immediately realize iTLB performance prevents page sharing and demand
> paging, both of which increase steady state memory footprint.  With
> MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
> and lower RAM footprints.
>
> This call respects THP eligibility as determined by the system-wide
> /sys/kernel/mm/transparent_hugepage/enabled sysfs settings and the VMA
> flags for the memory range being collapsed.
>
> THP allocation may enter direct reclaim and/or compaction.
>
> [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
>
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  arch/alpha/include/uapi/asm/mman.h     |   2 +
>  arch/mips/include/uapi/asm/mman.h      |   2 +
>  arch/parisc/include/uapi/asm/mman.h    |   2 +
>  arch/xtensa/include/uapi/asm/mman.h    |   2 +
>  include/linux/huge_mm.h                |  12 ++
>  include/uapi/asm-generic/mman-common.h |   2 +
>  mm/khugepaged.c                        | 167 +++++++++++++++++++++++--
>  mm/madvise.c                           |   5 +
>  8 files changed, 182 insertions(+), 12 deletions(-)
>
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 4aa996423b0d..763929e814e9 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -76,6 +76,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index 1be428663c10..c6e1fc77c996 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -103,6 +103,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index a7ea3204a5fa..22133a6a506e 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -70,6 +70,8 @@
>  #define MADV_WIPEONFORK 71             /* Zero memory on fork, child only */
>  #define MADV_KEEPONFORK 72             /* Undo MADV_WIPEONFORK */
>
> +#define MADV_COLLAPSE  73              /* Synchronous hugepage collapse */
> +
>  #define MADV_HWPOISON     100          /* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
>
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 7966a58af472..1ff0c858544f 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -111,6 +111,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 9a26bd10e083..4a2ea1b5437c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -222,6 +222,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>
>  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
>                      int advice);
> +int madvise_collapse(struct vm_area_struct *vma,
> +                    struct vm_area_struct **prev,
> +                    unsigned long start, unsigned long end);
>  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
>                            unsigned long end, long adjust_next);
>  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> @@ -378,6 +381,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
>         BUG();
>         return 0;
>  }
> +
> +static inline int madvise_collapse(struct vm_area_struct *vma,
> +                                  struct vm_area_struct **prev,
> +                                  unsigned long start, unsigned long end)
> +{
> +       BUG();
> +       return 0;
> +}
> +
>  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
>                                          unsigned long start,
>                                          unsigned long end,
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6c1aa92a92e4..6ce1f1ceb432 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -77,6 +77,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b14807b7002e..165c646ddb8f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -837,6 +837,22 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
>         return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
>  }
>
> +static bool alloc_hpage(gfp_t gfp, int node, struct collapse_control *cc)
> +{
> +       VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
> +
> +       cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> +       if (unlikely(!cc->hpage)) {
> +               count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> +               cc->hpage = ERR_PTR(-ENOMEM);
> +               return false;
> +       }
> +
> +       prep_transhuge_page(cc->hpage);
> +       count_vm_event(THP_COLLAPSE_ALLOC);
> +       return true;
> +}
> +
>  #ifdef CONFIG_NUMA
>  static int khugepaged_find_target_node(struct collapse_control *cc)
>  {
> @@ -882,18 +898,7 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
>  static bool khugepaged_alloc_page(gfp_t gfp, int node,
>                                   struct collapse_control *cc)
>  {
> -       VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
> -
> -       cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> -       if (unlikely(!cc->hpage)) {
> -               count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> -               cc->hpage = ERR_PTR(-ENOMEM);
> -               return false;
> -       }
> -
> -       prep_transhuge_page(cc->hpage);
> -       count_vm_event(THP_COLLAPSE_ALLOC);
> -       return true;
> +       return alloc_hpage(gfp, node, cc);
>  }
>  #else
>  static int khugepaged_find_target_node(struct collapse_control *cc)
> @@ -2457,3 +2462,141 @@ void khugepaged_min_free_kbytes_update(void)
>                 set_recommended_min_free_kbytes();
>         mutex_unlock(&khugepaged_mutex);
>  }
> +
> +static void madvise_collapse_cleanup_page(struct page **hpage)
> +{
> +       if (!IS_ERR(*hpage) && *hpage)
> +               put_page(*hpage);
> +       *hpage = NULL;
> +}
> +
> +static int madvise_collapse_errno(enum scan_result r)
> +{
> +       switch (r) {
> +       case SCAN_PMD_NULL:
> +       case SCAN_ADDRESS_RANGE:
> +       case SCAN_VMA_NULL:
> +       case SCAN_PTE_NON_PRESENT:
> +       case SCAN_PAGE_NULL:
> +               /*
> +                * Addresses in the specified range are not currently mapped,
> +                * or are outside the AS of the process.
> +                */
> +               return -ENOMEM;
> +       case SCAN_ALLOC_HUGE_PAGE_FAIL:
> +       case SCAN_CGROUP_CHARGE_FAIL:
> +               /* A kernel resource was temporarily unavailable. */
> +               return -EAGAIN;
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
> +static int madvise_alloc_charge_hpage(struct mm_struct *mm,
> +                                     struct collapse_control *cc)
> +{
> +       if (!alloc_hpage(GFP_TRANSHUGE, khugepaged_find_target_node(cc), cc))
> +               return SCAN_ALLOC_HUGE_PAGE_FAIL;
> +       if (unlikely(mem_cgroup_charge(page_folio(cc->hpage), mm,
> +                                      GFP_TRANSHUGE)))
> +               return SCAN_CGROUP_CHARGE_FAIL;
> +       count_memcg_page_event(cc->hpage, THP_COLLAPSE_ALLOC);
> +       return SCAN_SUCCEED;
> +}
> +
> +int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> +                    unsigned long start, unsigned long end)
> +{
> +       struct collapse_control cc = {
> +               .enforce_pte_scan_limits = false,
> +               .enforce_young = false,
> +               .last_target_node = NUMA_NO_NODE,
> +               .hpage = NULL,
> +               .alloc_charge_hpage = &madvise_alloc_charge_hpage,
> +       };
> +       struct mm_struct *mm = vma->vm_mm;
> +       unsigned long hstart, hend, addr;
> +       int thps = 0, nr_hpages = 0, result = SCAN_FAIL;
> +       bool mmap_locked = true;
> +
> +       BUG_ON(vma->vm_start > start);
> +       BUG_ON(vma->vm_end < end);
> +
> +       *prev = vma;
> +
> +       if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file)
> +               return -EINVAL;
> +
> +       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> +       hend = end & HPAGE_PMD_MASK;
> +       nr_hpages = (hend - hstart) >> HPAGE_PMD_SHIFT;
> +
> +       if (hstart >= hend || !transparent_hugepage_active(vma))
> +               return -EINVAL;
> +
> +       mmgrab(mm);
> +       lru_add_drain();
> +
> +       for (addr = hstart; ; ) {
> +               mmap_assert_locked(mm);
> +               cond_resched();
> +               result = SCAN_FAIL;
> +

Made a last minute change to khugepaged_scan_pmd() which no longer
cleared mmap_locked, so it needs to be reset here. Sorry about that -
testing was only collapsing 1 huge page so it wasn't caught.

> +               if (unlikely(khugepaged_test_exit(mm))) {
> +                       result = SCAN_ANY_PROCESS;
> +                       break;
> +               }
> +
> +               memset(cc.node_load, 0, sizeof(cc.node_load));
> +               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
> +               if (!mmap_locked)
> +                       *prev = NULL;  /* tell madvise we dropped mmap_lock */
> +
> +               switch (result) {
> +               /* Whitelisted set of results where continuing OK */
> +               case SCAN_SUCCEED:
> +               case SCAN_PMD_MAPPED:
> +                       ++thps;
> +                       fallthrough;
> +               case SCAN_PMD_NULL:
> +               case SCAN_PTE_NON_PRESENT:
> +               case SCAN_PTE_UFFD_WP:
> +               case SCAN_PAGE_RO:
> +               case SCAN_LACK_REFERENCED_PAGE:
> +               case SCAN_PAGE_NULL:
> +               case SCAN_PAGE_COUNT:
> +               case SCAN_PAGE_LOCK:
> +               case SCAN_PAGE_COMPOUND:
> +                       break;
> +               case SCAN_PAGE_LRU:
> +                       lru_add_drain_all();
> +                       goto retry;
> +               default:
> +                       /* Other error, exit */
> +                       goto break_loop;
> +               }
> +               addr += HPAGE_PMD_SIZE;
> +               if (addr >= hend)
> +                       break;
> +retry:
> +               if (!mmap_locked) {
> +                       mmap_read_lock(mm);
> +                       mmap_locked = true;
> +                       result = hugepage_vma_revalidate(mm, addr, &vma);
> +                       if (result)
> +                               goto out;
> +               }
> +               madvise_collapse_cleanup_page(&cc.hpage);
> +       }
> +
> +break_loop:
> +       /* madvise_walk_vmas() expects us to hold mmap_lock on return */
> +       if (!mmap_locked)
> +               mmap_read_lock(mm);
> +out:
> +       mmap_assert_locked(mm);
> +       madvise_collapse_cleanup_page(&cc.hpage);
> +       mmdrop(mm);
> +
> +       return thps == nr_hpages ? 0 : madvise_collapse_errno(result);
> +}
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 5f4537511532..638517952bd2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
>         case MADV_FREE:
>         case MADV_POPULATE_READ:
>         case MADV_POPULATE_WRITE:
> +       case MADV_COLLAPSE:
>                 return 0;
>         default:
>                 /* be safe, default to 1. list exceptions explicitly */
> @@ -1054,6 +1055,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>                 if (error)
>                         goto out;
>                 break;
> +       case MADV_COLLAPSE:
> +               return madvise_collapse(vma, prev, start, end);
>         }
>
>         anon_name = anon_vma_name(vma);
> @@ -1147,6 +1150,7 @@ madvise_behavior_valid(int behavior)
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         case MADV_HUGEPAGE:
>         case MADV_NOHUGEPAGE:
> +       case MADV_COLLAPSE:
>  #endif
>         case MADV_DONTDUMP:
>         case MADV_DODUMP:
> @@ -1336,6 +1340,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
>   *             transparent huge pages so the existing pages will not be
>   *             coalesced into THP and new pages will not be allocated as THP.
> + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *             from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> --
> 2.36.0.464.gb9c8b46e94-goog
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 08/13] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-05-05 18:50   ` Zach O'Keefe
@ 2022-05-05 18:58     ` Zach O'Keefe
  0 siblings, 0 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-05 18:58 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer

On Thu, May 5, 2022 at 11:50 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Wed, May 4, 2022 at 2:45 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > This idea was introduced by David Rientjes[1].
> >
> > Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
> > synchronous collapse of memory at their own expense.
> >
> > The benefits of this approach are:
> >
> > * CPU is charged to the process that wants to spend the cycles for the
> >   THP
> > * Avoid unpredictable timing of khugepaged collapse
> >
> > An immediate user of this new functionality are malloc() implementations
> > that manage memory in hugepage-sized chunks, but sometimes subrelease
> > memory back to the system in native-sized chunks via MADV_DONTNEED;
> > zapping the pmd.  Later, when the memory is hot, the implementation
> > could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
> > hugepage coverage and dTLB performance.  TCMalloc is such an
> > implementation that could benefit from this[2].
> >
> > Only privately-mapped anon memory is supported for now, but it is
> > expected that file and shmem support will be added later to support the
> > use-case of backing executable text by THPs.  Current support provided
> > by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
> > which might impair services from serving at their full rated load after
> > (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
> > immediately realize iTLB performance prevents page sharing and demand
> > paging, both of which increase steady state memory footprint.  With
> > MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
> > and lower RAM footprints.
> >
> > This call respects THP eligibility as determined by the system-wide
> > /sys/kernel/mm/transparent_hugepage/enabled sysfs settings and the VMA
> > flags for the memory range being collapsed.
> >
> > THP allocation may enter direct reclaim and/or compaction.
> >
> > [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> > [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
> >
> > Suggested-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  arch/alpha/include/uapi/asm/mman.h     |   2 +
> >  arch/mips/include/uapi/asm/mman.h      |   2 +
> >  arch/parisc/include/uapi/asm/mman.h    |   2 +
> >  arch/xtensa/include/uapi/asm/mman.h    |   2 +
> >  include/linux/huge_mm.h                |  12 ++
> >  include/uapi/asm-generic/mman-common.h |   2 +
> >  mm/khugepaged.c                        | 167 +++++++++++++++++++++++--
> >  mm/madvise.c                           |   5 +
> >  8 files changed, 182 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > index 4aa996423b0d..763929e814e9 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -76,6 +76,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > index 1be428663c10..c6e1fc77c996 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -103,6 +103,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > index a7ea3204a5fa..22133a6a506e 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -70,6 +70,8 @@
> >  #define MADV_WIPEONFORK 71             /* Zero memory on fork, child only */
> >  #define MADV_KEEPONFORK 72             /* Undo MADV_WIPEONFORK */
> >
> > +#define MADV_COLLAPSE  73              /* Synchronous hugepage collapse */
> > +
> >  #define MADV_HWPOISON     100          /* poison a page for testing */
> >  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
> >
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > index 7966a58af472..1ff0c858544f 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -111,6 +111,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 9a26bd10e083..4a2ea1b5437c 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -222,6 +222,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >
> >  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> >                      int advice);
> > +int madvise_collapse(struct vm_area_struct *vma,
> > +                    struct vm_area_struct **prev,
> > +                    unsigned long start, unsigned long end);
> >  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> >                            unsigned long end, long adjust_next);
> >  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > @@ -378,6 +381,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> >         BUG();
> >         return 0;
> >  }
> > +
> > +static inline int madvise_collapse(struct vm_area_struct *vma,
> > +                                  struct vm_area_struct **prev,
> > +                                  unsigned long start, unsigned long end)
> > +{
> > +       BUG();
> > +       return 0;
> > +}
> > +
> >  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> >                                          unsigned long start,
> >                                          unsigned long end,
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 6c1aa92a92e4..6ce1f1ceb432 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -77,6 +77,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b14807b7002e..165c646ddb8f 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -837,6 +837,22 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
> >         return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
> >  }
> >
> > +static bool alloc_hpage(gfp_t gfp, int node, struct collapse_control *cc)
> > +{
> > +       VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
> > +
> > +       cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> > +       if (unlikely(!cc->hpage)) {
> > +               count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> > +               cc->hpage = ERR_PTR(-ENOMEM);
> > +               return false;
> > +       }
> > +
> > +       prep_transhuge_page(cc->hpage);
> > +       count_vm_event(THP_COLLAPSE_ALLOC);
> > +       return true;
> > +}
> > +
> >  #ifdef CONFIG_NUMA
> >  static int khugepaged_find_target_node(struct collapse_control *cc)
> >  {
> > @@ -882,18 +898,7 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
> >  static bool khugepaged_alloc_page(gfp_t gfp, int node,
> >                                   struct collapse_control *cc)
> >  {
> > -       VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
> > -
> > -       cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> > -       if (unlikely(!cc->hpage)) {
> > -               count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> > -               cc->hpage = ERR_PTR(-ENOMEM);
> > -               return false;
> > -       }
> > -
> > -       prep_transhuge_page(cc->hpage);
> > -       count_vm_event(THP_COLLAPSE_ALLOC);
> > -       return true;
> > +       return alloc_hpage(gfp, node, cc);
> >  }
> >  #else
> >  static int khugepaged_find_target_node(struct collapse_control *cc)
> > @@ -2457,3 +2462,141 @@ void khugepaged_min_free_kbytes_update(void)
> >                 set_recommended_min_free_kbytes();
> >         mutex_unlock(&khugepaged_mutex);
> >  }
> > +
> > +static void madvise_collapse_cleanup_page(struct page **hpage)
> > +{
> > +       if (!IS_ERR(*hpage) && *hpage)
> > +               put_page(*hpage);
> > +       *hpage = NULL;
> > +}
> > +
> > +static int madvise_collapse_errno(enum scan_result r)
> > +{
> > +       switch (r) {
> > +       case SCAN_PMD_NULL:
> > +       case SCAN_ADDRESS_RANGE:
> > +       case SCAN_VMA_NULL:
> > +       case SCAN_PTE_NON_PRESENT:
> > +       case SCAN_PAGE_NULL:
> > +               /*
> > +                * Addresses in the specified range are not currently mapped,
> > +                * or are outside the AS of the process.
> > +                */
> > +               return -ENOMEM;
> > +       case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > +       case SCAN_CGROUP_CHARGE_FAIL:
> > +               /* A kernel resource was temporarily unavailable. */
> > +               return -EAGAIN;
> > +       default:
> > +               return -EINVAL;
> > +       }
> > +}
> > +
> > +static int madvise_alloc_charge_hpage(struct mm_struct *mm,
> > +                                     struct collapse_control *cc)
> > +{
> > +       if (!alloc_hpage(GFP_TRANSHUGE, khugepaged_find_target_node(cc), cc))
> > +               return SCAN_ALLOC_HUGE_PAGE_FAIL;
> > +       if (unlikely(mem_cgroup_charge(page_folio(cc->hpage), mm,
> > +                                      GFP_TRANSHUGE)))
> > +               return SCAN_CGROUP_CHARGE_FAIL;
> > +       count_memcg_page_event(cc->hpage, THP_COLLAPSE_ALLOC);
> > +       return SCAN_SUCCEED;
> > +}
> > +
> > +int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > +                    unsigned long start, unsigned long end)
> > +{
> > +       struct collapse_control cc = {
> > +               .enforce_pte_scan_limits = false,
> > +               .enforce_young = false,
> > +               .last_target_node = NUMA_NO_NODE,
> > +               .hpage = NULL,
> > +               .alloc_charge_hpage = &madvise_alloc_charge_hpage,
> > +       };
> > +       struct mm_struct *mm = vma->vm_mm;
> > +       unsigned long hstart, hend, addr;
> > +       int thps = 0, nr_hpages = 0, result = SCAN_FAIL;
> > +       bool mmap_locked = true;
> > +
> > +       BUG_ON(vma->vm_start > start);
> > +       BUG_ON(vma->vm_end < end);
> > +
> > +       *prev = vma;
> > +
> > +       if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file)
> > +               return -EINVAL;
> > +
> > +       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> > +       hend = end & HPAGE_PMD_MASK;
> > +       nr_hpages = (hend - hstart) >> HPAGE_PMD_SHIFT;
> > +
> > +       if (hstart >= hend || !transparent_hugepage_active(vma))
> > +               return -EINVAL;
> > +
> > +       mmgrab(mm);
> > +       lru_add_drain();
> > +
> > +       for (addr = hstart; ; ) {
> > +               mmap_assert_locked(mm);
> > +               cond_resched();
> > +               result = SCAN_FAIL;
> > +
>
> Made a last minute change to khugepaged_scan_pmd() which no longer
> cleared mmap_locked, so it needs to be reset here. Sorry about that -
> testing was only collapsing 1 huge page so it wasn't caught.
>

Please disregard - I reset it at the end of the loop when reacquiring
lock ; I did something else silly in a merge. Sorry for the noise.

> > +               if (unlikely(khugepaged_test_exit(mm))) {
> > +                       result = SCAN_ANY_PROCESS;
> > +                       break;
> > +               }
> > +
> > +               memset(cc.node_load, 0, sizeof(cc.node_load));
> > +               result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
> > +               if (!mmap_locked)
> > +                       *prev = NULL;  /* tell madvise we dropped mmap_lock */
> > +
> > +               switch (result) {
> > +               /* Whitelisted set of results where continuing OK */
> > +               case SCAN_SUCCEED:
> > +               case SCAN_PMD_MAPPED:
> > +                       ++thps;
> > +                       fallthrough;
> > +               case SCAN_PMD_NULL:
> > +               case SCAN_PTE_NON_PRESENT:
> > +               case SCAN_PTE_UFFD_WP:
> > +               case SCAN_PAGE_RO:
> > +               case SCAN_LACK_REFERENCED_PAGE:
> > +               case SCAN_PAGE_NULL:
> > +               case SCAN_PAGE_COUNT:
> > +               case SCAN_PAGE_LOCK:
> > +               case SCAN_PAGE_COMPOUND:
> > +                       break;
> > +               case SCAN_PAGE_LRU:
> > +                       lru_add_drain_all();
> > +                       goto retry;
> > +               default:
> > +                       /* Other error, exit */
> > +                       goto break_loop;
> > +               }
> > +               addr += HPAGE_PMD_SIZE;
> > +               if (addr >= hend)
> > +                       break;
> > +retry:
> > +               if (!mmap_locked) {
> > +                       mmap_read_lock(mm);
> > +                       mmap_locked = true;
> > +                       result = hugepage_vma_revalidate(mm, addr, &vma);
> > +                       if (result)
> > +                               goto out;
> > +               }
> > +               madvise_collapse_cleanup_page(&cc.hpage);
> > +       }
> > +
> > +break_loop:
> > +       /* madvise_walk_vmas() expects us to hold mmap_lock on return */
> > +       if (!mmap_locked)
> > +               mmap_read_lock(mm);
> > +out:
> > +       mmap_assert_locked(mm);
> > +       madvise_collapse_cleanup_page(&cc.hpage);
> > +       mmdrop(mm);
> > +
> > +       return thps == nr_hpages ? 0 : madvise_collapse_errno(result);
> > +}
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 5f4537511532..638517952bd2 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
> >         case MADV_FREE:
> >         case MADV_POPULATE_READ:
> >         case MADV_POPULATE_WRITE:
> > +       case MADV_COLLAPSE:
> >                 return 0;
> >         default:
> >                 /* be safe, default to 1. list exceptions explicitly */
> > @@ -1054,6 +1055,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> >                 if (error)
> >                         goto out;
> >                 break;
> > +       case MADV_COLLAPSE:
> > +               return madvise_collapse(vma, prev, start, end);
> >         }
> >
> >         anon_name = anon_vma_name(vma);
> > @@ -1147,6 +1150,7 @@ madvise_behavior_valid(int behavior)
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >         case MADV_HUGEPAGE:
> >         case MADV_NOHUGEPAGE:
> > +       case MADV_COLLAPSE:
> >  #endif
> >         case MADV_DONTDUMP:
> >         case MADV_DODUMP:
> > @@ -1336,6 +1340,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> >   *             transparent huge pages so the existing pages will not be
> >   *             coalesced into THP and new pages will not be allocated as THP.
> > + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> >   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> >   *             from being included in its core dump.
> >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > --
> > 2.36.0.464.gb9c8b46e94-goog
> >


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-05-04 21:44 ` [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
@ 2022-05-11  0:49   ` Rongwei Wang
  2022-05-11 15:34     ` Zach O'Keefe
  2022-05-12 20:03   ` David Rientjes
  1 sibling, 1 reply; 44+ messages in thread
From: Rongwei Wang @ 2022-05-11  0:49 UTC (permalink / raw)
  To: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Michal Hocko, Pasha Tatashin, Peter Xu,
	SeongJae Park, Song Liu, Vlastimil Babka, Yang Shi, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	calling

Hi, Zach

Thanks for your great patchset!
Recently, We also try to collapse THP in this way, likes performance
degradation due to using too much hugepages in our scenes.

And there is a doubt about process_madvise(MADV_COLLAPSE) when we test 
this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg', 
process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss 
something, please let me know.

If so, how about introducing process_madvise(MADV_HUGEPAGE) or 
process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
vma with 'hg', and the collapse process can be finished completely with 
the help of other processes. the latter could let some special vma avoid
collapsing when setting 'THP=always'.

Best regards,
-wrw

On 5/5/22 5:44 AM, Zach O'Keefe wrote:
> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>   mm/madvise.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 638517952bd2..08c11217025a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
>   }
>   
>   static bool
> -process_madvise_behavior_valid(int behavior)
> +process_madvise_behavior_valid(int behavior, struct task_struct *task)
>   {
>   	switch (behavior) {
>   	case MADV_COLD:
>   	case MADV_PAGEOUT:
>   	case MADV_WILLNEED:
>   		return true;
> +	case MADV_COLLAPSE:
> +		return task == current || capable(CAP_SYS_ADMIN);
>   	default:
>   		return false;
>   	}
> @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>   		goto free_iov;
>   	}
>   
> -	if (!process_madvise_behavior_valid(behavior)) {
> +	if (!process_madvise_behavior_valid(behavior, task)) {
>   		ret = -EINVAL;
>   		goto release_task;
>   	}


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-05-11  0:49   ` Rongwei Wang
@ 2022-05-11 15:34     ` Zach O'Keefe
  2022-05-12 15:53       ` Rongwei Wang
  2022-05-12 20:03       ` David Rientjes
  0 siblings, 2 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-11 15:34 UTC (permalink / raw)
  To: Rongwei Wang
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, calling

Hey Rongwei,

Thanks for taking the time to review!

On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
<rongwei.wang@linux.alibaba.com> wrote:
>
> Hi, Zach
>
> Thanks for your great patchset!
> Recently, We also try to collapse THP in this way, likes performance
> degradation due to using too much hugepages in our scenes.
>
> And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
> this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
> madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
> process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
> something, please let me know.
>

I tried to have MADV_COLLAPSE follow the same THP eligibility
semantics as khugepaged and at-fault: either THP=always, or
THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
out.

If I understand you correctly, the usefulness of
process_madvise(MADV_COLLAPSE) is limited in the case where
THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
behalf of another process since they don't have a way to mark the
target memory as eligible (which requires VM_HUGEPAGE).

If so, I think that's a valid point, and your suggestion below of a
supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
the sake of exploring all options, I'll mention that there was also a
previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
VM_HUGEPAGE[1].

Since it's possible supporting MADV_[NO]HUGEPAGE for
process_madivse(2) has applications outside a subsequent
MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
be in a hot path, I'd vote in favor of your suggestion and include
process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.

Thanks again for your review and your suggestion!
Zach

[1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/

> If so, how about introducing process_madvise(MADV_HUGEPAGE) or
> process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
> vma with 'hg', and the collapse process can be finished completely with
> the help of other processes. the latter could let some special vma avoid
> collapsing when setting 'THP=always'.
>
> Best regards,
> -wrw
>
> On 5/5/22 5:44 AM, Zach O'Keefe wrote:
> > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> > CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >   mm/madvise.c | 6 ++++--
> >   1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 638517952bd2..08c11217025a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
> >   }
> >
> >   static bool
> > -process_madvise_behavior_valid(int behavior)
> > +process_madvise_behavior_valid(int behavior, struct task_struct *task)
> >   {
> >       switch (behavior) {
> >       case MADV_COLD:
> >       case MADV_PAGEOUT:
> >       case MADV_WILLNEED:
> >               return true;
> > +     case MADV_COLLAPSE:
> > +             return task == current || capable(CAP_SYS_ADMIN);
> >       default:
> >               return false;
> >       }
> > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> >               goto free_iov;
> >       }
> >
> > -     if (!process_madvise_behavior_valid(behavior)) {
> > +     if (!process_madvise_behavior_valid(behavior, task)) {
> >               ret = -EINVAL;
> >               goto release_task;
> >       }


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-05-11 15:34     ` Zach O'Keefe
@ 2022-05-12 15:53       ` Rongwei Wang
  2022-05-12 20:03       ` David Rientjes
  1 sibling, 0 replies; 44+ messages in thread
From: Rongwei Wang @ 2022-05-12 15:53 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, calling



On 5/11/22 11:34 PM, Zach O'Keefe wrote:
> Hey Rongwei,
> 
> Thanks for taking the time to review!
> 
> On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
> <rongwei.wang@linux.alibaba.com> wrote:
>>
>> Hi, Zach
>>
>> Thanks for your great patchset!
>> Recently, We also try to collapse THP in this way, likes performance
>> degradation due to using too much hugepages in our scenes.
>>
>> And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
>> this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
>> madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
>> process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
>> something, please let me know.
>>
> 
> I tried to have MADV_COLLAPSE follow the same THP eligibility
> semantics as khugepaged and at-fault: either THP=always, or
> THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
> out.
> 
> If I understand you correctly, the usefulness of
> process_madvise(MADV_COLLAPSE) is limited in the case where
> THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
> behalf of another process since they don't have a way to mark the
> target memory as eligible (which requires VM_HUGEPAGE).
Hi Zach

Yes, that's my means. It seems that MADV_[NO]HUGEPAGE for 
process_madivse(2) just needs little codes.

Anyway, looking forward to your next patchset.

Thanks
-wrw

> 
> If so, I think that's a valid point, and your suggestion below of a
> supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
> the sake of exploring all options, I'll mention that there was also a
> previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
> VM_HUGEPAGE[1].
> 
> Since it's possible supporting MADV_[NO]HUGEPAGE for
> process_madivse(2) has applications outside a subsequent
> MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
> be in a hot path, I'd vote in favor of your suggestion and include
> process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.
> 
> Thanks again for your review and your suggestion!
> Zach
> 
> [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/
> 
>> If so, how about introducing process_madvise(MADV_HUGEPAGE) or
>> process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
>> vma with 'hg', and the collapse process can be finished completely with
>> the help of other processes. the latter could let some special vma avoid
>> collapsing when setting 'THP=always'.
>>
>> Best regards,
>> -wrw
>>
>> On 5/5/22 5:44 AM, Zach O'Keefe wrote:
>>> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
>>> CAP_SYS_ADMIN or is requesting collapse of it's own memory.
>>>
>>> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
>>> ---
>>>    mm/madvise.c | 6 ++++--
>>>    1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>> index 638517952bd2..08c11217025a 100644
>>> --- a/mm/madvise.c
>>> +++ b/mm/madvise.c
>>> @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
>>>    }
>>>
>>>    static bool
>>> -process_madvise_behavior_valid(int behavior)
>>> +process_madvise_behavior_valid(int behavior, struct task_struct *task)
>>>    {
>>>        switch (behavior) {
>>>        case MADV_COLD:
>>>        case MADV_PAGEOUT:
>>>        case MADV_WILLNEED:
>>>                return true;
>>> +     case MADV_COLLAPSE:
>>> +             return task == current || capable(CAP_SYS_ADMIN);
>>>        default:
>>>                return false;
>>>        }
>>> @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>>>                goto free_iov;
>>>        }
>>>
>>> -     if (!process_madvise_behavior_valid(behavior)) {
>>> +     if (!process_madvise_behavior_valid(behavior, task)) {
>>>                ret = -EINVAL;
>>>                goto release_task;
>>>        }


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 02/13] mm/khugepaged: add struct collapse_control
  2022-05-04 21:44 ` [PATCH v5 02/13] mm/khugepaged: add struct collapse_control Zach O'Keefe
@ 2022-05-12 20:02   ` David Rientjes
  2022-05-18 20:03   ` Peter Xu
  1 sibling, 0 replies; 44+ messages in thread
From: David Rientjes @ 2022-05-12 20:02 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, Matthew Wilcox, Michal Hocko,
	Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, 4 May 2022, Zach O'Keefe wrote:

> Modularize hugepage collapse by introducing struct collapse_control.
> This structure serves to describe the properties of the requested
> collapse, as well as serve as a local scratch pad to use during the
> collapse itself.
> 
> Start by moving global per-node khugepaged statistics into this
> new structure, and stack allocate one for khugepaged collapse
> context.
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 03/13] mm/khugepaged: dedup and simplify hugepage alloc and charging
  2022-05-04 21:44 ` [PATCH v5 03/13] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
@ 2022-05-12 20:02   ` David Rientjes
  2022-05-13 18:26     ` Zach O'Keefe
  0 siblings, 1 reply; 44+ messages in thread
From: David Rientjes @ 2022-05-12 20:02 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, Matthew Wilcox, Michal Hocko,
	Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, 4 May 2022, Zach O'Keefe wrote:

> @@ -1069,10 +1067,34 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>  	return true;
>  }
>  
> -static void collapse_huge_page(struct mm_struct *mm,
> -				   unsigned long address,
> -				   struct page **hpage,
> -				   int node, int referenced, int unmapped)
> +static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> +			      struct collapse_control *cc)
> +{
> +#ifdef CONFIG_NUMA
> +	const struct cpumask *cpumask;
> +#endif
> +	gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> +	int node = khugepaged_find_target_node(cc);
> +
> +#ifdef CONFIG_NUMA
> +	/* sched to specified node before huge page memory copy */
> +	if (task_node(current) != node) {
> +		cpumask = cpumask_of_node(node);
> +		if (!cpumask_empty(cpumask))
> +			set_cpus_allowed_ptr(current, cpumask);
> +	}
> +#endif
> +	if (!khugepaged_alloc_page(hpage, gfp, node))
> +		return SCAN_ALLOC_HUGE_PAGE_FAIL;
> +	if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
> +		return SCAN_CGROUP_CHARGE_FAIL;
> +	count_memcg_page_event(*hpage, THP_COLLAPSE_ALLOC);
> +	return SCAN_SUCCEED;
> +}

Lots of ifdefs here, I wonder if we can define a helper function that is a 
no-op when CONFIG_NUMA is disabled that we can call into here instead.

Otherwise this looks like a nice unification.  After this is cleaned up 
feel free to add

	Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific
  2022-05-04 21:44 ` [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific Zach O'Keefe
@ 2022-05-12 20:02   ` David Rientjes
  2022-05-13 23:04     ` Zach O'Keefe
  0 siblings, 1 reply; 44+ messages in thread
From: David Rientjes @ 2022-05-12 20:02 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, Matthew Wilcox, Michal Hocko,
	Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, 4 May 2022, Zach O'Keefe wrote:

> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c94bc43dff3e..6095fcb3f07c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -92,6 +92,10 @@ struct collapse_control {
>  
>  	/* Last target selected in khugepaged_find_target_node() */
>  	int last_target_node;
> +
> +	struct page *hpage;
> +	int (*alloc_charge_hpage)(struct mm_struct *mm,
> +				  struct collapse_control *cc);
>  };
>  
>  /**

Embedding this function pointer into collapse_contol seems like it would 
need some pretty strong rationale.  Not to say that it should be a 
non-starter, but I think the changelog needs to clearly indicate why this 
is better/cleaner than embedding the needed info for a single allocation 
and charge function to use.  If the callbacks would truly be so different 
that unifying them would be more complex, I think this makes sense.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 05/13] mm/khugepaged: pipe enum scan_result codes back to callers
  2022-05-04 21:44 ` [PATCH v5 05/13] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
@ 2022-05-12 20:02   ` David Rientjes
  0 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2022-05-12 20:02 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, Matthew Wilcox, Michal Hocko,
	Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, 4 May 2022, Zach O'Keefe wrote:

> Pipe enum scan_result codes back through return values of functions
> downstream of khugepaged_scan_file() and khugepaged_scan_pmd() to
> inform callers if the operation was successful, and if not, why.
> 
> Since khugepaged_scan_pmd()'s return value already has a specific
> meaning (whether mmap_lock was unlocked or not), add a bool* argument
> to khugepaged_scan_pmd() to retrieve this information.
> 
> Change khugepaged to take action based on the return values of
> khugepaged_scan_file() and khugepaged_scan_pmd() instead of acting
> deep within the collapsing functions themselves.
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 06/13] mm/khugepaged: add flag to ignore khugepaged_max_ptes_*
  2022-05-04 21:44 ` [PATCH v5 06/13] mm/khugepaged: add flag to ignore khugepaged_max_ptes_* Zach O'Keefe
@ 2022-05-12 20:03   ` David Rientjes
  0 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2022-05-12 20:03 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, Matthew Wilcox, Michal Hocko,
	Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, 4 May 2022, Zach O'Keefe wrote:

> Add enforce_pte_scan_limits flag to struct collapse_control that allows
> context to ignore the sysfs-controlled knobs
> khugepaged_max_ptes_[none|swap|shared] and set this flag in khugepaged
> collapse context to preserve existing khugepaged behavior.
> 
> This flag will be used (unset) when introducing madvise collapse
> context since here, the user presumably has reason to believe the
> collapse will be beneficial and khugepaged heuristics shouldn't tell
> the user they are wrong.
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 07/13] mm/khugepaged: add flag to ignore page young/referenced requirement
  2022-05-04 21:44 ` [PATCH v5 07/13] mm/khugepaged: add flag to ignore page young/referenced requirement Zach O'Keefe
@ 2022-05-12 20:03   ` David Rientjes
  2022-05-13 18:17     ` Zach O'Keefe
  0 siblings, 1 reply; 44+ messages in thread
From: David Rientjes @ 2022-05-12 20:03 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, Matthew Wilcox, Michal Hocko,
	Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, 4 May 2022, Zach O'Keefe wrote:

> Add enforce_young flag to struct collapse_control that allows context to
> ignore requirement that some pages in region being collapsed be young or
> referenced.  Set this flag in khugepaged collapse context to preserve
> existing khugepaged behavior.
> 
> This flag will be used (unset) when introducing madvise collapse
> context since here, the user presumably has reason to believe the
> collapse will be beneficial and khugepaged heuristics shouldn't tell
> the user they are wrong.
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/khugepaged.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ca730aec0e3e..b14807b7002e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -90,6 +90,9 @@ struct collapse_control {
>  	/* Respect khugepaged_max_ptes_[none|swap|shared] */
>  	bool enforce_pte_scan_limits;
>  
> +	/* Require memory to be young */
> +	bool enforce_young;
> +
>  	/* Num pages scanned per node */
>  	int node_load[MAX_NUMNODES];
>  

Small nit: is it possible to unify these into a single bool khugepaged 
flag?

Either way:

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 09/13] mm/khugepaged: rename prefix of shared collapse functions
  2022-05-04 21:44 ` [PATCH v5 09/13] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
@ 2022-05-12 20:03   ` David Rientjes
  0 siblings, 0 replies; 44+ messages in thread
From: David Rientjes @ 2022-05-12 20:03 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, Matthew Wilcox, Michal Hocko,
	Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, 4 May 2022, Zach O'Keefe wrote:

> The following functions/tracepoints are shared between khugepaged and
> madvise collapse contexts.  Replace the "khugepaged_" prefixe with
> generic "hpage_collapse_" prefix in such cases:
> 
> huge_memory:mm_khugepaged_scan_pmd -> huge_memory:mm_hpage_collapse_scan_pmd
> khugepaged_test_exit() -> hpage_collapse_test_exit()
> khugepaged_scan_abort() -> hpage_collapse_scan_abort()
> khugepaged_scan_pmd() -> hpage_collapse_scan_pmd()
> khugepaged_find_target_node() -> hpage_collapse_find_target_node()
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-05-04 21:44 ` [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
  2022-05-11  0:49   ` Rongwei Wang
@ 2022-05-12 20:03   ` David Rientjes
  1 sibling, 0 replies; 44+ messages in thread
From: David Rientjes @ 2022-05-12 20:03 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, Matthew Wilcox, Michal Hocko,
	Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, 4 May 2022, Zach O'Keefe wrote:

> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-05-11 15:34     ` Zach O'Keefe
  2022-05-12 15:53       ` Rongwei Wang
@ 2022-05-12 20:03       ` David Rientjes
  2022-05-13 21:06         ` Zach O'Keefe
  2022-05-16  3:56         ` Rongwei Wang
  1 sibling, 2 replies; 44+ messages in thread
From: David Rientjes @ 2022-05-12 20:03 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Rongwei Wang, Alex Shi, David Hildenbrand, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, calling

On Wed, 11 May 2022, Zach O'Keefe wrote:

> Hey Rongwei,
> 
> Thanks for taking the time to review!
> 
> On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
> <rongwei.wang@linux.alibaba.com> wrote:
> >
> > Hi, Zach
> >
> > Thanks for your great patchset!
> > Recently, We also try to collapse THP in this way, likes performance
> > degradation due to using too much hugepages in our scenes.
> >

Rongwei, could you elaborate on this?  I can understand undesired overhead 
for allocation of a hugepage at the time of fault if there is not enough 
benefit derived by the hugepages over the long term (like for a database 
workload), is this the performance degradation you're referring to?

Otherwise I'm unfamiliar with performance degradation for using too much 
hugepages after they have been allocated :)  Maybe RSS grows too much and 
we run into memory pressure?

It would be good to know more if you can share details here.

> > And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
> > this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
> > madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
> > process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
> > something, please let me know.
> >
> 
> I tried to have MADV_COLLAPSE follow the same THP eligibility
> semantics as khugepaged and at-fault: either THP=always, or
> THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
> out.
> 
> If I understand you correctly, the usefulness of
> process_madvise(MADV_COLLAPSE) is limited in the case where
> THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
> behalf of another process since they don't have a way to mark the
> target memory as eligible (which requires VM_HUGEPAGE).
> 
> If so, I think that's a valid point, and your suggestion below of a
> supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
> the sake of exploring all options, I'll mention that there was also a
> previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
> VM_HUGEPAGE[1].
> 

If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary 
to need to do MADV_HUGEPAGE before that regardless of system-wide settings 
that it may not control?

Same point for a root user doing this on behalf of another user.  It could 
either do MADV_HUGEPAGE and change the behavior that the user perhaps 
requested behind its back (not desired) or it could temporarily set the 
system-wide setting to allow THP always before doing the MADV_COLLAPSE 
(it's root).

We can simply allow MADV_COLLAPSE to always collapse in either context 
regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings?

> Since it's possible supporting MADV_[NO]HUGEPAGE for
> process_madivse(2) has applications outside a subsequent
> MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
> be in a hot path, I'd vote in favor of your suggestion and include
> process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.
> 
> Thanks again for your review and your suggestion!
> Zach
> 
> [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/
> 
> > If so, how about introducing process_madvise(MADV_HUGEPAGE) or
> > process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
> > vma with 'hg', and the collapse process can be finished completely with
> > the help of other processes. the latter could let some special vma avoid
> > collapsing when setting 'THP=always'.
> >
> > Best regards,
> > -wrw
> >
> > On 5/5/22 5:44 AM, Zach O'Keefe wrote:
> > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> > > CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >   mm/madvise.c | 6 ++++--
> > >   1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 638517952bd2..08c11217025a 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
> > >   }
> > >
> > >   static bool
> > > -process_madvise_behavior_valid(int behavior)
> > > +process_madvise_behavior_valid(int behavior, struct task_struct *task)
> > >   {
> > >       switch (behavior) {
> > >       case MADV_COLD:
> > >       case MADV_PAGEOUT:
> > >       case MADV_WILLNEED:
> > >               return true;
> > > +     case MADV_COLLAPSE:
> > > +             return task == current || capable(CAP_SYS_ADMIN);
> > >       default:
> > >               return false;
> > >       }
> > > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > >               goto free_iov;
> > >       }
> > >
> > > -     if (!process_madvise_behavior_valid(behavior)) {
> > > +     if (!process_madvise_behavior_valid(behavior, task)) {
> > >               ret = -EINVAL;
> > >               goto release_task;
> > >       }
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 07/13] mm/khugepaged: add flag to ignore page young/referenced requirement
  2022-05-12 20:03   ` David Rientjes
@ 2022-05-13 18:17     ` Zach O'Keefe
  0 siblings, 0 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-13 18:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alex Shi, David Hildenbrand, Matthew Wilcox, Michal Hocko,
	Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

Thanks for the review, David!

On Thu, May 12, 2022 at 1:03 PM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 4 May 2022, Zach O'Keefe wrote:
>
> > Add enforce_young flag to struct collapse_control that allows context to
> > ignore requirement that some pages in region being collapsed be young or
> > referenced.  Set this flag in khugepaged collapse context to preserve
> > existing khugepaged behavior.
> >
> > This flag will be used (unset) when introducing madvise collapse
> > context since here, the user presumably has reason to believe the
> > collapse will be beneficial and khugepaged heuristics shouldn't tell
> > the user they are wrong.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  mm/khugepaged.c | 23 +++++++++++++++--------
> >  1 file changed, 15 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index ca730aec0e3e..b14807b7002e 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -90,6 +90,9 @@ struct collapse_control {
> >       /* Respect khugepaged_max_ptes_[none|swap|shared] */
> >       bool enforce_pte_scan_limits;
> >
> > +     /* Require memory to be young */
> > +     bool enforce_young;
> > +
> >       /* Num pages scanned per node */
> >       int node_load[MAX_NUMNODES];
> >
>
> Small nit: is it possible to unify these into a single bool khugepaged
> flag?
>

I think this is fine with a "enforce_page_heuristics" flag or
something. Ofc, functionally unifying the flags is a noop, but I was
trying to to avoid "if (khugepagd) .. else .." and, where possible,
describe the variable properties of the collapse in struct
collapse_control itself. If there's no objections, I'll make that
change for v6.

> Either way:
>
> Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 03/13] mm/khugepaged: dedup and simplify hugepage alloc and charging
  2022-05-12 20:02   ` David Rientjes
@ 2022-05-13 18:26     ` Zach O'Keefe
  0 siblings, 0 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-13 18:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alex Shi, David Hildenbrand, Matthew Wilcox, Michal Hocko,
	Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Thu, May 12, 2022 at 1:02 PM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 4 May 2022, Zach O'Keefe wrote:
>
> > @@ -1069,10 +1067,34 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> >       return true;
> >  }
> >
> > -static void collapse_huge_page(struct mm_struct *mm,
> > -                                unsigned long address,
> > -                                struct page **hpage,
> > -                                int node, int referenced, int unmapped)
> > +static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> > +                           struct collapse_control *cc)
> > +{
> > +#ifdef CONFIG_NUMA
> > +     const struct cpumask *cpumask;
> > +#endif
> > +     gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> > +     int node = khugepaged_find_target_node(cc);
> > +
> > +#ifdef CONFIG_NUMA
> > +     /* sched to specified node before huge page memory copy */
> > +     if (task_node(current) != node) {
> > +             cpumask = cpumask_of_node(node);
> > +             if (!cpumask_empty(cpumask))
> > +                     set_cpus_allowed_ptr(current, cpumask);
> > +     }
> > +#endif
> > +     if (!khugepaged_alloc_page(hpage, gfp, node))
> > +             return SCAN_ALLOC_HUGE_PAGE_FAIL;
> > +     if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
> > +             return SCAN_CGROUP_CHARGE_FAIL;
> > +     count_memcg_page_event(*hpage, THP_COLLAPSE_ALLOC);
> > +     return SCAN_SUCCEED;
> > +}
>
> Lots of ifdefs here, I wonder if we can define a helper function that is a
> no-op when CONFIG_NUMA is disabled that we can call into here instead.
>

Thanks for the review, David!

So, it seems like the patch that forced some of the #ifdefs,
'mm/khugepaged: sched to numa node when collapse huge page', has been
dropped until some of the kinks have been worked out[1]. When I rebase
on linux-next again for v6, this will be cleaned up.

[1] https://lore.kernel.org/linux-mm/20220317065024.2635069-1-maobibo@loongson.cn/

Thanks!
Zach

> Otherwise this looks like a nice unification.  After this is cleaned up
> feel free to add
>
>         Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-05-12 20:03       ` David Rientjes
@ 2022-05-13 21:06         ` Zach O'Keefe
  2022-05-16  3:56         ` Rongwei Wang
  1 sibling, 0 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-13 21:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rongwei Wang, Alex Shi, David Hildenbrand, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, calling

On Thu, May 12, 2022 at 1:03 PM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 11 May 2022, Zach O'Keefe wrote:
>
> > Hey Rongwei,
> >
> > Thanks for taking the time to review!
> >
> > On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
> > <rongwei.wang@linux.alibaba.com> wrote:
> > >
> > > Hi, Zach
> > >
> > > Thanks for your great patchset!
> > > Recently, We also try to collapse THP in this way, likes performance
> > > degradation due to using too much hugepages in our scenes.
> > >
>
> Rongwei, could you elaborate on this?  I can understand undesired overhead
> for allocation of a hugepage at the time of fault if there is not enough
> benefit derived by the hugepages over the long term (like for a database
> workload), is this the performance degradation you're referring to?
>
> Otherwise I'm unfamiliar with performance degradation for using too much
> hugepages after they have been allocated :)  Maybe RSS grows too much and
> we run into memory pressure?
>
> It would be good to know more if you can share details here.
>
> > > And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
> > > this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
> > > madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
> > > process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
> > > something, please let me know.
> > >
> >
> > I tried to have MADV_COLLAPSE follow the same THP eligibility
> > semantics as khugepaged and at-fault: either THP=always, or
> > THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
> > out.
> >
> > If I understand you correctly, the usefulness of
> > process_madvise(MADV_COLLAPSE) is limited in the case where
> > THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
> > behalf of another process since they don't have a way to mark the
> > target memory as eligible (which requires VM_HUGEPAGE).
> >
> > If so, I think that's a valid point, and your suggestion below of a
> > supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
> > the sake of exploring all options, I'll mention that there was also a
> > previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
> > VM_HUGEPAGE[1].
> >
>
> If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary
> to need to do MADV_HUGEPAGE before that regardless of system-wide settings
> that it may not control?
>

If THP=always, this isn't necessary - but yes, if THP=madvise then, as
proposed, we'd need MADV_HUGEPAGE to opt-in to THPs, just like
at-fault and khugepaged.

If MADV_COLLAPSE also sets VM_HUGEPAGE, then in THP=madvise mode, we
also opt-in that range to khugepaged scanning. Most likely this is
what we want anyways (if the collapsed range is split after
MADV_COLLAPSE, khugepaged could come along and fix things) but it does
couple the two more. We can always MADV_NOHUGEPAGE the range after
MADV_COLLAPSE if this was ever a concern, however.

> Same point for a root user doing this on behalf of another user.  It could
> either do MADV_HUGEPAGE and change the behavior that the user perhaps
> requested behind its back (not desired) or it could temporarily set the
> system-wide setting to allow THP always before doing the MADV_COLLAPSE
> (it's root).
>
> We can simply allow MADV_COLLAPSE to always collapse in either context
> regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings?
>

I think we need to respect VM_NOHUGEPAGE as it can be set from
non-madvise code, as David H mentioned for kvm on s390 in the PATCH
RFC[1]. It's not clear to me how this would break, however. If
MADV_HUGEPAGE'ing the areas marked by
arch/s390/mm/gmap.c:thp_split_mm() would also break kvm (assuming
subsequent collapse by khugepaged), then MADV_COLLAPSE ignoring
VM_NOHUGEPAGE doesn't really increase the failure space much. Else, I
think we should really be respecting VM_NOHUGEPAGE.

[1] https://lore.kernel.org/linux-mm/30571216-5a6a-7a11-3b2c-77d914025f6d@redhat.com/


> > Since it's possible supporting MADV_[NO]HUGEPAGE for
> > process_madivse(2) has applications outside a subsequent
> > MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
> > be in a hot path, I'd vote in favor of your suggestion and include
> > process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.
> >
> > Thanks again for your review and your suggestion!
> > Zach
> >
> > [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/
> >
> > > If so, how about introducing process_madvise(MADV_HUGEPAGE) or
> > > process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
> > > vma with 'hg', and the collapse process can be finished completely with
> > > the help of other processes. the latter could let some special vma avoid
> > > collapsing when setting 'THP=always'.
> > >
> > > Best regards,
> > > -wrw
> > >
> > > On 5/5/22 5:44 AM, Zach O'Keefe wrote:
> > > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> > > > CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> > > >
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >   mm/madvise.c | 6 ++++--
> > > >   1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 638517952bd2..08c11217025a 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
> > > >   }
> > > >
> > > >   static bool
> > > > -process_madvise_behavior_valid(int behavior)
> > > > +process_madvise_behavior_valid(int behavior, struct task_struct *task)
> > > >   {
> > > >       switch (behavior) {
> > > >       case MADV_COLD:
> > > >       case MADV_PAGEOUT:
> > > >       case MADV_WILLNEED:
> > > >               return true;
> > > > +     case MADV_COLLAPSE:
> > > > +             return task == current || capable(CAP_SYS_ADMIN);
> > > >       default:
> > > >               return false;
> > > >       }
> > > > @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > > >               goto free_iov;
> > > >       }
> > > >
> > > > -     if (!process_madvise_behavior_valid(behavior)) {
> > > > +     if (!process_madvise_behavior_valid(behavior, task)) {
> > > >               ret = -EINVAL;
> > > >               goto release_task;
> > > >       }
> >


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific
  2022-05-12 20:02   ` David Rientjes
@ 2022-05-13 23:04     ` Zach O'Keefe
  2022-05-13 23:17       ` Yang Shi
  0 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-13 23:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alex Shi, David Hildenbrand, Matthew Wilcox, Michal Hocko,
	Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

Thanks for the review, David!

On Thu, May 12, 2022 at 1:02 PM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 4 May 2022, Zach O'Keefe wrote:
>
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index c94bc43dff3e..6095fcb3f07c 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -92,6 +92,10 @@ struct collapse_control {
> >
> >       /* Last target selected in khugepaged_find_target_node() */
> >       int last_target_node;
> > +
> > +     struct page *hpage;
> > +     int (*alloc_charge_hpage)(struct mm_struct *mm,
> > +                               struct collapse_control *cc);
> >  };
> >
> >  /**
>
> Embedding this function pointer into collapse_contol seems like it would
> need some pretty strong rationale.  Not to say that it should be a
> non-starter, but I think the changelog needs to clearly indicate why this
> is better/cleaner than embedding the needed info for a single allocation
> and charge function to use.  If the callbacks would truly be so different
> that unifying them would be more complex, I think this makes sense.

Mostly, this boils down to khugepaged having different a allocation
pattern for NUMA/UMA ; the former scans the pages first to determine
the right node, the latter preallocates before scanning. khugepaged
has the luxury on UMA systems of just holding onto a hugepage
indefinitely for the next collapse target.

For MADV_COLLAPSE, we never preallocate, and so its pattern doesn't
depend on NUMA or UMA configs. Trying to avoid "if (khugepaged) ...
else" casing, defining this as a context-defined operation seemed
appropriate.

Collapsing both alloc and charging together was mostly a code
cleanliness decision resulting from not wanting to embed a ->gfp()
hook (gfp flags are used both by allocation and memcg charging).
Alternatively, a .gfp member could exist - it would just need to be
refreshed periodically in the khugepaged codepath.

That all said - let me take another crack at seeing if I can make this
work without the need for a function pointer here.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific
  2022-05-13 23:04     ` Zach O'Keefe
@ 2022-05-13 23:17       ` Yang Shi
  2022-05-13 23:55         ` Zach O'Keefe
  0 siblings, 1 reply; 44+ messages in thread
From: Yang Shi @ 2022-05-13 23:17 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: David Rientjes, Alex Shi, David Hildenbrand, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Zi Yan, Linux MM, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Fri, May 13, 2022 at 4:05 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Thanks for the review, David!
>
> On Thu, May 12, 2022 at 1:02 PM David Rientjes <rientjes@google.com> wrote:
> >
> > On Wed, 4 May 2022, Zach O'Keefe wrote:
> >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index c94bc43dff3e..6095fcb3f07c 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -92,6 +92,10 @@ struct collapse_control {
> > >
> > >       /* Last target selected in khugepaged_find_target_node() */
> > >       int last_target_node;
> > > +
> > > +     struct page *hpage;
> > > +     int (*alloc_charge_hpage)(struct mm_struct *mm,
> > > +                               struct collapse_control *cc);
> > >  };
> > >
> > >  /**
> >
> > Embedding this function pointer into collapse_contol seems like it would
> > need some pretty strong rationale.  Not to say that it should be a
> > non-starter, but I think the changelog needs to clearly indicate why this
> > is better/cleaner than embedding the needed info for a single allocation
> > and charge function to use.  If the callbacks would truly be so different
> > that unifying them would be more complex, I think this makes sense.
>
> Mostly, this boils down to khugepaged having different a allocation
> pattern for NUMA/UMA ; the former scans the pages first to determine
> the right node, the latter preallocates before scanning. khugepaged
> has the luxury on UMA systems of just holding onto a hugepage
> indefinitely for the next collapse target.
>
> For MADV_COLLAPSE, we never preallocate, and so its pattern doesn't
> depend on NUMA or UMA configs. Trying to avoid "if (khugepaged) ...
> else" casing, defining this as a context-defined operation seemed
> appropriate.
>
> Collapsing both alloc and charging together was mostly a code
> cleanliness decision resulting from not wanting to embed a ->gfp()
> hook (gfp flags are used both by allocation and memcg charging).
> Alternatively, a .gfp member could exist - it would just need to be
> refreshed periodically in the khugepaged codepath.
>
> That all said - let me take another crack at seeing if I can make this
> work without the need for a function pointer here.

I had a patch that removed UMA allocation, please refer to
https://lore.kernel.org/linux-mm/20210817202146.3218-1-shy828301@gmail.com/#t

It was not made upstream due to some requests for further cleanup, but
unfortunately I haven't got time to look into it yet.

If this page were merged, would that make your life easier?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific
  2022-05-13 23:17       ` Yang Shi
@ 2022-05-13 23:55         ` Zach O'Keefe
  2022-05-17 17:18           ` Yang Shi
  0 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-13 23:55 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Rientjes, Alex Shi, David Hildenbrand, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Zi Yan, Linux MM, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Fri, May 13, 2022 at 4:17 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, May 13, 2022 at 4:05 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > Thanks for the review, David!
> >
> > On Thu, May 12, 2022 at 1:02 PM David Rientjes <rientjes@google.com> wrote:
> > >
> > > On Wed, 4 May 2022, Zach O'Keefe wrote:
> > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index c94bc43dff3e..6095fcb3f07c 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -92,6 +92,10 @@ struct collapse_control {
> > > >
> > > >       /* Last target selected in khugepaged_find_target_node() */
> > > >       int last_target_node;
> > > > +
> > > > +     struct page *hpage;
> > > > +     int (*alloc_charge_hpage)(struct mm_struct *mm,
> > > > +                               struct collapse_control *cc);
> > > >  };
> > > >
> > > >  /**
> > >
> > > Embedding this function pointer into collapse_contol seems like it would
> > > need some pretty strong rationale.  Not to say that it should be a
> > > non-starter, but I think the changelog needs to clearly indicate why this
> > > is better/cleaner than embedding the needed info for a single allocation
> > > and charge function to use.  If the callbacks would truly be so different
> > > that unifying them would be more complex, I think this makes sense.
> >
> > Mostly, this boils down to khugepaged having different a allocation
> > pattern for NUMA/UMA ; the former scans the pages first to determine
> > the right node, the latter preallocates before scanning. khugepaged
> > has the luxury on UMA systems of just holding onto a hugepage
> > indefinitely for the next collapse target.
> >
> > For MADV_COLLAPSE, we never preallocate, and so its pattern doesn't
> > depend on NUMA or UMA configs. Trying to avoid "if (khugepaged) ...
> > else" casing, defining this as a context-defined operation seemed
> > appropriate.
> >
> > Collapsing both alloc and charging together was mostly a code
> > cleanliness decision resulting from not wanting to embed a ->gfp()
> > hook (gfp flags are used both by allocation and memcg charging).
> > Alternatively, a .gfp member could exist - it would just need to be
> > refreshed periodically in the khugepaged codepath.
> >
> > That all said - let me take another crack at seeing if I can make this
> > work without the need for a function pointer here.
>
> I had a patch that removed UMA allocation, please refer to
> https://lore.kernel.org/linux-mm/20210817202146.3218-1-shy828301@gmail.com/#t
>
> It was not made upstream due to some requests for further cleanup, but
> unfortunately I haven't got time to look into it yet.
>
> If this page were merged, would that make your life easier?

Hey Yang,

First, sorry for missing that patch in the first place. I actually
have some patches queued up that do a similar cleanup of
khugepaged_prealloc_page() that was mentioned, but decided to not
include them here.

Second, removing the NUMA/UMA story does make this patch easier, I
think (esp since the sched change was dropped for now). This is
something I wanted while writing this series, but without the larger
context referenced in your patch (most users don't build NUMA=n even
on single node systems, and the pcp hugepage lists optimization)
couldn't justify myself.

Best,
Zach


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-05-12 20:03       ` David Rientjes
  2022-05-13 21:06         ` Zach O'Keefe
@ 2022-05-16  3:56         ` Rongwei Wang
  1 sibling, 0 replies; 44+ messages in thread
From: Rongwei Wang @ 2022-05-16  3:56 UTC (permalink / raw)
  To: David Rientjes, Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, Matthew Wilcox, Michal Hocko,
	Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, calling



On 5/13/22 4:03 AM, David Rientjes wrote:
> On Wed, 11 May 2022, Zach O'Keefe wrote:
> 
>> Hey Rongwei,
>>
>> Thanks for taking the time to review!
>>
>> On Tue, May 10, 2022 at 5:49 PM Rongwei Wang
>> <rongwei.wang@linux.alibaba.com> wrote:
>>>
>>> Hi, Zach
>>>
>>> Thanks for your great patchset!
>>> Recently, We also try to collapse THP in this way, likes performance
>>> degradation due to using too much hugepages in our scenes.
>>>
> 
> Rongwei, could you elaborate on this?  I can understand undesired overhead
> for allocation of a hugepage at the time of fault if there is not enough
> benefit derived by the hugepages over the long term (like for a database
> workload), is this the performance degradation you're referring to?
> 
Hi David

Sorry for the late reply.

Actually, I'm not sure that the degradation is caused by using too much 
hugepages. Plus, I also referrd to Google's doc[1], and the TLB pressure 
mentioned here. Maybe the process_madvise(MADV_COLLAPSE) can help us to 
solve these issues.

[1] 
https://dyninst.github.io/scalable_tools_workshop/petascale2018/assets/slides/CSCADS%202018%20perf_events%20status%20update.pdf

-wrw

> Otherwise I'm unfamiliar with performance degradation for using too much
> hugepages after they have been allocated :)  Maybe RSS grows too much and
> we run into memory pressure?
> 
> It would be good to know more if you can share details here.
> 
>>> And there is a doubt about process_madvise(MADV_COLLAPSE) when we test
>>> this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on
>>> madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg',
>>> process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss
>>> something, please let me know.
>>>
>>
>> I tried to have MADV_COLLAPSE follow the same THP eligibility
>> semantics as khugepaged and at-fault: either THP=always, or
>> THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point
>> out.
>>
>> If I understand you correctly, the usefulness of
>> process_madvise(MADV_COLLAPSE) is limited in the case where
>> THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of
>> behalf of another process since they don't have a way to mark the
>> target memory as eligible (which requires VM_HUGEPAGE).
>>
>> If so, I think that's a valid point, and your suggestion below of a
>> supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For
>> the sake of exploring all options, I'll mention that there was also a
>> previous idea suggested by Yang Shi where MADV_COLLAPSE could also set
>> VM_HUGEPAGE[1].
>>
> 
> If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary
> to need to do MADV_HUGEPAGE before that regardless of system-wide settings
> that it may not control?
> 
> Same point for a root user doing this on behalf of another user.  It could
> either do MADV_HUGEPAGE and change the behavior that the user perhaps
> requested behind its back (not desired) or it could temporarily set the
> system-wide setting to allow THP always before doing the MADV_COLLAPSE
> (it's root).
> 
> We can simply allow MADV_COLLAPSE to always collapse in either context
> regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings?
> 
>> Since it's possible supporting MADV_[NO]HUGEPAGE for
>> process_madivse(2) has applications outside a subsequent
>> MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to
>> be in a hot path, I'd vote in favor of your suggestion and include
>> process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object.
>>
>> Thanks again for your review and your suggestion!
>> Zach
>>
>> [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/
>>
>>> If so, how about introducing process_madvise(MADV_HUGEPAGE) or
>>> process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target
>>> vma with 'hg', and the collapse process can be finished completely with
>>> the help of other processes. the latter could let some special vma avoid
>>> collapsing when setting 'THP=always'.
>>>
>>> Best regards,
>>> -wrw
>>>
>>> On 5/5/22 5:44 AM, Zach O'Keefe wrote:
>>>> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
>>>> CAP_SYS_ADMIN or is requesting collapse of it's own memory.
>>>>
>>>> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
>>>> ---
>>>>    mm/madvise.c | 6 ++++--
>>>>    1 file changed, 4 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>> index 638517952bd2..08c11217025a 100644
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
>>>>    }
>>>>
>>>>    static bool
>>>> -process_madvise_behavior_valid(int behavior)
>>>> +process_madvise_behavior_valid(int behavior, struct task_struct *task)
>>>>    {
>>>>        switch (behavior) {
>>>>        case MADV_COLD:
>>>>        case MADV_PAGEOUT:
>>>>        case MADV_WILLNEED:
>>>>                return true;
>>>> +     case MADV_COLLAPSE:
>>>> +             return task == current || capable(CAP_SYS_ADMIN);
>>>>        default:
>>>>                return false;
>>>>        }
>>>> @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>>>>                goto free_iov;
>>>>        }
>>>>
>>>> -     if (!process_madvise_behavior_valid(behavior)) {
>>>> +     if (!process_madvise_behavior_valid(behavior, task)) {
>>>>                ret = -EINVAL;
>>>>                goto release_task;
>>>>        }
>>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific
  2022-05-13 23:55         ` Zach O'Keefe
@ 2022-05-17 17:18           ` Yang Shi
  2022-05-17 22:35             ` Zach O'Keefe
  0 siblings, 1 reply; 44+ messages in thread
From: Yang Shi @ 2022-05-17 17:18 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: David Rientjes, Alex Shi, David Hildenbrand, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Zi Yan, Linux MM, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Fri, May 13, 2022 at 4:56 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Fri, May 13, 2022 at 4:17 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Fri, May 13, 2022 at 4:05 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > Thanks for the review, David!
> > >
> > > On Thu, May 12, 2022 at 1:02 PM David Rientjes <rientjes@google.com> wrote:
> > > >
> > > > On Wed, 4 May 2022, Zach O'Keefe wrote:
> > > >
> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > index c94bc43dff3e..6095fcb3f07c 100644
> > > > > --- a/mm/khugepaged.c
> > > > > +++ b/mm/khugepaged.c
> > > > > @@ -92,6 +92,10 @@ struct collapse_control {
> > > > >
> > > > >       /* Last target selected in khugepaged_find_target_node() */
> > > > >       int last_target_node;
> > > > > +
> > > > > +     struct page *hpage;
> > > > > +     int (*alloc_charge_hpage)(struct mm_struct *mm,
> > > > > +                               struct collapse_control *cc);
> > > > >  };
> > > > >
> > > > >  /**
> > > >
> > > > Embedding this function pointer into collapse_contol seems like it would
> > > > need some pretty strong rationale.  Not to say that it should be a
> > > > non-starter, but I think the changelog needs to clearly indicate why this
> > > > is better/cleaner than embedding the needed info for a single allocation
> > > > and charge function to use.  If the callbacks would truly be so different
> > > > that unifying them would be more complex, I think this makes sense.
> > >
> > > Mostly, this boils down to khugepaged having different a allocation
> > > pattern for NUMA/UMA ; the former scans the pages first to determine
> > > the right node, the latter preallocates before scanning. khugepaged
> > > has the luxury on UMA systems of just holding onto a hugepage
> > > indefinitely for the next collapse target.
> > >
> > > For MADV_COLLAPSE, we never preallocate, and so its pattern doesn't
> > > depend on NUMA or UMA configs. Trying to avoid "if (khugepaged) ...
> > > else" casing, defining this as a context-defined operation seemed
> > > appropriate.
> > >
> > > Collapsing both alloc and charging together was mostly a code
> > > cleanliness decision resulting from not wanting to embed a ->gfp()
> > > hook (gfp flags are used both by allocation and memcg charging).
> > > Alternatively, a .gfp member could exist - it would just need to be
> > > refreshed periodically in the khugepaged codepath.
> > >
> > > That all said - let me take another crack at seeing if I can make this
> > > work without the need for a function pointer here.
> >
> > I had a patch that removed UMA allocation, please refer to
> > https://lore.kernel.org/linux-mm/20210817202146.3218-1-shy828301@gmail.com/#t
> >
> > It was not made upstream due to some requests for further cleanup, but
> > unfortunately I haven't got time to look into it yet.
> >
> > If this page were merged, would that make your life easier?
>
> Hey Yang,
>
> First, sorry for missing that patch in the first place. I actually
> have some patches queued up that do a similar cleanup of
> khugepaged_prealloc_page() that was mentioned, but decided to not
> include them here.
>
> Second, removing the NUMA/UMA story does make this patch easier, I
> think (esp since the sched change was dropped for now). This is
> something I wanted while writing this series, but without the larger
> context referenced in your patch (most users don't build NUMA=n even
> on single node systems, and the pcp hugepage lists optimization)
> couldn't justify myself.

Thanks, it would be better to add this patch into your series as a
prerequisite so that you could make the MADV_COLLAPSE easier.

I don't think I will be able to find time to rework the patch and
solve all the review comments at any time soon. If you'd like to take
it, please do it.

>
> Best,
> Zach


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific
  2022-05-17 17:18           ` Yang Shi
@ 2022-05-17 22:35             ` Zach O'Keefe
  2022-05-25 17:58               ` Yang Shi
  0 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-17 22:35 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Rientjes, Alex Shi, David Hildenbrand, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Zi Yan, Linux MM, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Tue, May 17, 2022 at 10:51 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, May 13, 2022 at 4:56 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Fri, May 13, 2022 at 4:17 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Fri, May 13, 2022 at 4:05 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > Thanks for the review, David!
> > > >
> > > > On Thu, May 12, 2022 at 1:02 PM David Rientjes <rientjes@google.com> wrote:
> > > > >
> > > > > On Wed, 4 May 2022, Zach O'Keefe wrote:
> > > > >
> > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > index c94bc43dff3e..6095fcb3f07c 100644
> > > > > > --- a/mm/khugepaged.c
> > > > > > +++ b/mm/khugepaged.c
> > > > > > @@ -92,6 +92,10 @@ struct collapse_control {
> > > > > >
> > > > > >       /* Last target selected in khugepaged_find_target_node() */
> > > > > >       int last_target_node;
> > > > > > +
> > > > > > +     struct page *hpage;
> > > > > > +     int (*alloc_charge_hpage)(struct mm_struct *mm,
> > > > > > +                               struct collapse_control *cc);
> > > > > >  };
> > > > > >
> > > > > >  /**
> > > > >
> > > > > Embedding this function pointer into collapse_contol seems like it would
> > > > > need some pretty strong rationale.  Not to say that it should be a
> > > > > non-starter, but I think the changelog needs to clearly indicate why this
> > > > > is better/cleaner than embedding the needed info for a single allocation
> > > > > and charge function to use.  If the callbacks would truly be so different
> > > > > that unifying them would be more complex, I think this makes sense.
> > > >
> > > > Mostly, this boils down to khugepaged having different a allocation
> > > > pattern for NUMA/UMA ; the former scans the pages first to determine
> > > > the right node, the latter preallocates before scanning. khugepaged
> > > > has the luxury on UMA systems of just holding onto a hugepage
> > > > indefinitely for the next collapse target.
> > > >
> > > > For MADV_COLLAPSE, we never preallocate, and so its pattern doesn't
> > > > depend on NUMA or UMA configs. Trying to avoid "if (khugepaged) ...
> > > > else" casing, defining this as a context-defined operation seemed
> > > > appropriate.
> > > >
> > > > Collapsing both alloc and charging together was mostly a code
> > > > cleanliness decision resulting from not wanting to embed a ->gfp()
> > > > hook (gfp flags are used both by allocation and memcg charging).
> > > > Alternatively, a .gfp member could exist - it would just need to be
> > > > refreshed periodically in the khugepaged codepath.
> > > >
> > > > That all said - let me take another crack at seeing if I can make this
> > > > work without the need for a function pointer here.
> > >
> > > I had a patch that removed UMA allocation, please refer to
> > > https://lore.kernel.org/linux-mm/20210817202146.3218-1-shy828301@gmail.com/#t
> > >
> > > It was not made upstream due to some requests for further cleanup, but
> > > unfortunately I haven't got time to look into it yet.
> > >
> > > If this page were merged, would that make your life easier?
> >
> > Hey Yang,
> >
> > First, sorry for missing that patch in the first place. I actually
> > have some patches queued up that do a similar cleanup of
> > khugepaged_prealloc_page() that was mentioned, but decided to not
> > include them here.
> >
> > Second, removing the NUMA/UMA story does make this patch easier, I
> > think (esp since the sched change was dropped for now). This is
> > something I wanted while writing this series, but without the larger
> > context referenced in your patch (most users don't build NUMA=n even
> > on single node systems, and the pcp hugepage lists optimization)
> > couldn't justify myself.
>
> Thanks, it would be better to add this patch into your series as a
> prerequisite so that you could make the MADV_COLLAPSE easier.
>
> I don't think I will be able to find time to rework the patch and
> solve all the review comments at any time soon. If you'd like to take
> it, please do it.
>

Sounds good, Yang. I'll take a crack at it for v6.

Thanks for taking the time,
Zach

> >
> > Best,
> > Zach


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 01/13] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-05-04 21:44 ` [PATCH v5 01/13] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
@ 2022-05-18 18:41   ` Peter Xu
  2022-05-19 21:06     ` Zach O'Keefe
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Xu @ 2022-05-18 18:41 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, May 04, 2022 at 02:44:25PM -0700, Zach O'Keefe wrote:
> +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> +				   unsigned long address,
> +				   pmd_t **pmd)
> +{
> +	pmd_t pmde;
> +
> +	*pmd = mm_find_pmd_raw(mm, address);
> +	if (!*pmd)
> +		return SCAN_PMD_NULL;
> +
> +	pmde = pmd_read_atomic(*pmd);

It seems to be correct on using the atomic fetcher here.  Though irrelevant
to this patchset.. does it also mean that we miss that on mm_find_pmd()?  I
meant a separate fix like this one:

---8<---
diff --git a/mm/rmap.c b/mm/rmap.c
index 69416072b1a6..61309718640f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -785,7 +785,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
         * without holding anon_vma lock for write.  So when looking for a
         * genuine pmde (in which to find pte), test present and !THP together.
         */
-       pmde = *pmd;
+       pmde = pmd_read_atomic(pmd);
        barrier();
        if (!pmd_present(pmde) || pmd_trans_huge(pmde))
                pmd = NULL;
---8<---

As otherwise it seems it's also prone to PAE race conditions when reading
pmd out, but I could be missing something.

> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	/* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> +	barrier();
> +#endif
> +	if (!pmd_present(pmde))
> +		return SCAN_PMD_NULL;
> +	if (pmd_trans_huge(pmde))
> +		return SCAN_PMD_MAPPED;

Would it be safer to check pmd_bad()?  I think not all mm pmd paths check
that, frankly I don't really know what's the major cause of a bad pmd
(either software bugs or corrupted mem), but just to check with you,
because potentially a bad pmd can be read as SCAN_SUCCEED and go through.

> +	return SCAN_SUCCEED;
> +}

The rest looks good to me, thanks.

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 02/13] mm/khugepaged: add struct collapse_control
  2022-05-04 21:44 ` [PATCH v5 02/13] mm/khugepaged: add struct collapse_control Zach O'Keefe
  2022-05-12 20:02   ` David Rientjes
@ 2022-05-18 20:03   ` Peter Xu
  2022-05-18 20:11     ` Zach O'Keefe
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Xu @ 2022-05-18 20:03 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, May 04, 2022 at 02:44:26PM -0700, Zach O'Keefe wrote:
> Modularize hugepage collapse by introducing struct collapse_control.
> This structure serves to describe the properties of the requested
> collapse, as well as serve as a local scratch pad to use during the
> collapse itself.
> 
> Start by moving global per-node khugepaged statistics into this
> new structure, and stack allocate one for khugepaged collapse
> context.
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

There's already some UMA changes that seems to be not needed if you're
going with Yang's UMA patch as pre-requisite.  And I also see that there's
some other changes to settle and rebase (e.g. -mm temporarily removed the
khugepaged patch to move the thread around in next patch), so I figured
maybe I'll just stop here and read you next version.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 02/13] mm/khugepaged: add struct collapse_control
  2022-05-18 20:03   ` Peter Xu
@ 2022-05-18 20:11     ` Zach O'Keefe
  0 siblings, 0 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-18 20:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, May 18, 2022 at 1:03 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, May 04, 2022 at 02:44:26PM -0700, Zach O'Keefe wrote:
> > Modularize hugepage collapse by introducing struct collapse_control.
> > This structure serves to describe the properties of the requested
> > collapse, as well as serve as a local scratch pad to use during the
> > collapse itself.
> >
> > Start by moving global per-node khugepaged statistics into this
> > new structure, and stack allocate one for khugepaged collapse
> > context.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> There's already some UMA changes that seems to be not needed if you're
> going with Yang's UMA patch as pre-requisite.  And I also see that there's
> some other changes to settle and rebase (e.g. -mm temporarily removed the
> khugepaged patch to move the thread around in next patch), so I figured
> maybe I'll just stop here and read you next version.
>

Yep that makes sense to me - I should be getting v6 out
today/tomorrow. Thanks for taking the time!

Best,
Zach

> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 01/13] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-05-18 18:41   ` Peter Xu
@ 2022-05-19 21:06     ` Zach O'Keefe
  2022-05-20  1:12       ` Peter Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-19 21:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

Thanks again for the review, Peter.

On Wed, May 18, 2022 at 11:41 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, May 04, 2022 at 02:44:25PM -0700, Zach O'Keefe wrote:
> > +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > +                                unsigned long address,
> > +                                pmd_t **pmd)
> > +{
> > +     pmd_t pmde;
> > +
> > +     *pmd = mm_find_pmd_raw(mm, address);
> > +     if (!*pmd)
> > +             return SCAN_PMD_NULL;
> > +
> > +     pmde = pmd_read_atomic(*pmd);
>
> It seems to be correct on using the atomic fetcher here.  Though irrelevant
> to this patchset.. does it also mean that we miss that on mm_find_pmd()?  I
> meant a separate fix like this one:
>
> ---8<---
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 69416072b1a6..61309718640f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -785,7 +785,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>          * without holding anon_vma lock for write.  So when looking for a
>          * genuine pmde (in which to find pte), test present and !THP together.
>          */
> -       pmde = *pmd;
> +       pmde = pmd_read_atomic(pmd);
>         barrier();
>         if (!pmd_present(pmde) || pmd_trans_huge(pmde))
>                 pmd = NULL;
> ---8<---
>
> As otherwise it seems it's also prone to PAE race conditions when reading
> pmd out, but I could be missing something.
>

This is a good question. I took some time to look into this, but it's
very complicated and unfortunately I couldn't reach a conclusion in
the time I allotted to myself. My working (unverified) assumption is
that mm_find_pmd() is called in places where it doesn't care if the
pmd isn't read atomically. If so, does that also mean MADV_COLLAPSE is
safe? I'm not sure. These i386 PAE + THP racing issues were most
recently discussed when considering if READ_ONCE() should be used
instead of pmd_read_atomic() [1].

[1] https://lore.kernel.org/linux-mm/594c1f0-d396-5346-1f36-606872cddb18@google.com/

> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +     /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > +     barrier();
> > +#endif
> > +     if (!pmd_present(pmde))
> > +             return SCAN_PMD_NULL;
> > +     if (pmd_trans_huge(pmde))
> > +             return SCAN_PMD_MAPPED;
>
> Would it be safer to check pmd_bad()?  I think not all mm pmd paths check
> that, frankly I don't really know what's the major cause of a bad pmd
> (either software bugs or corrupted mem), but just to check with you,
> because potentially a bad pmd can be read as SCAN_SUCCEED and go through.
>

Likewise, I'm not sure what the cause of "bad pmds" is.

Do you mean to check pmd_bad() instead of pmd_trans_huge()? I.e. b/c a
pmd-mapped thp counts as "bad" (at least on x86 since PSE set) or do
you mean to additionally check pmd_bad() after the pmd_trans_huge()
check?

If it's the former, I'd say we can't claim !pmd_bad() == memory
already backed by thps / our job here is done.

If it's the latter, I don't see it hurting much (but I can't argue
intelligently about why it's needed) and can include the check in v6.

Thanks again,
Zach

> > +     return SCAN_SUCCEED;
> > +}
>
> The rest looks good to me, thanks.
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 01/13] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-05-19 21:06     ` Zach O'Keefe
@ 2022-05-20  1:12       ` Peter Xu
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Xu @ 2022-05-20  1:12 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Thu, May 19, 2022 at 02:06:25PM -0700, Zach O'Keefe wrote:
> Thanks again for the review, Peter.
> 
> On Wed, May 18, 2022 at 11:41 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, May 04, 2022 at 02:44:25PM -0700, Zach O'Keefe wrote:
> > > +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > > +                                unsigned long address,
> > > +                                pmd_t **pmd)
> > > +{
> > > +     pmd_t pmde;
> > > +
> > > +     *pmd = mm_find_pmd_raw(mm, address);
> > > +     if (!*pmd)
> > > +             return SCAN_PMD_NULL;
> > > +
> > > +     pmde = pmd_read_atomic(*pmd);
> >
> > It seems to be correct on using the atomic fetcher here.  Though irrelevant
> > to this patchset.. does it also mean that we miss that on mm_find_pmd()?  I
> > meant a separate fix like this one:
> >
> > ---8<---
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 69416072b1a6..61309718640f 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -785,7 +785,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> >          * without holding anon_vma lock for write.  So when looking for a
> >          * genuine pmde (in which to find pte), test present and !THP together.
> >          */
> > -       pmde = *pmd;
> > +       pmde = pmd_read_atomic(pmd);
> >         barrier();
> >         if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> >                 pmd = NULL;
> > ---8<---
> >
> > As otherwise it seems it's also prone to PAE race conditions when reading
> > pmd out, but I could be missing something.
> >
> 
> This is a good question. I took some time to look into this, but it's
> very complicated and unfortunately I couldn't reach a conclusion in
> the time I allotted to myself. My working (unverified) assumption is
> that mm_find_pmd() is called in places where it doesn't care if the
> pmd isn't read atomically.

Frankly I am still not sure on that.  Say the immediate check in
mm_find_pmd() is pmd_trans_huge() afterwards, and AFAICT that is:

static inline int pmd_trans_huge(pmd_t pmd)
{
	return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
}

Where we have:

#define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page */
#define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
#define _PAGE_BIT_SOFTW4	58	/* available for programmer */

I think it means we're checking both the lower (bit 7) and upper (bit 58)
of the 32bits values of the whole 64bit pmd on PAE, and I can't explain how
that'll not require atomicity..

pmd_read_atomic() plays with a magic that we'll either:

  - Get atomic results for none or pgtable pmds (which is stable), or
  - Can get not-atomic result for thps (unstable)

However here we're checking exactly for thps.. I think we don't need a
stable PFN but we'll need stable _PAGE_PSE|_PAGE_DEVMAP bits.

It should not be like that when pmd_read_atomic() was introduced because
that is 2012 (in 2012, 26c191788f18129) and AFAIU DEVMAP came in 2016+.
IOW I had a feeling that pmd_read_atomic() used to be bug-free but after
devmap it's harder to be justified at least, it seems to me.

So far I don't see a clean way to fix it but use atomic64_read() because
pmd_read_atomic() isn't atomic for thp checkings using pmd_trans_huge()
afaict, however the last piece of puzzle is atomic64_read() is failing
somehow on Xen and that's where we come up with the pmd_read_atomic()
trick to read lower then upper 32bits:

    commit e4eed03fd06578571c01d4f1478c874bb432c815
    Author: Andrea Arcangeli <aarcange@redhat.com>
    Date:   Wed Jun 20 12:52:57 2012 -0700

    thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE
    
    In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding the
    mmap_sem for reading, cmpxchg8b cannot be used to read pmd contents under
    Xen.

I believe Andrea has a solid reason to do so at that time, but I'm still
not sure why Xen won't work with atomic64_read() since it's not mentioned
in the commit..

Please go ahead with your patchset without being blocked by this, because
AFAICT pmd_read_atomic() is already better than pmde=*pmdp anyway, and it
seems a more general problem even if it existed or I could also have missed
something.

> If so, does that also mean MADV_COLLAPSE is
> safe? I'm not sure. These i386 PAE + THP racing issues were most
> recently discussed when considering if READ_ONCE() should be used
> instead of pmd_read_atomic() [1].
> 
> [1] https://lore.kernel.org/linux-mm/594c1f0-d396-5346-1f36-606872cddb18@google.com/
> 
> > > +
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > +     /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > > +     barrier();
> > > +#endif
> > > +     if (!pmd_present(pmde))
> > > +             return SCAN_PMD_NULL;
> > > +     if (pmd_trans_huge(pmde))
> > > +             return SCAN_PMD_MAPPED;
> >
> > Would it be safer to check pmd_bad()?  I think not all mm pmd paths check
> > that, frankly I don't really know what's the major cause of a bad pmd
> > (either software bugs or corrupted mem), but just to check with you,
> > because potentially a bad pmd can be read as SCAN_SUCCEED and go through.
> >
> 
> Likewise, I'm not sure what the cause of "bad pmds" is.
> 
> Do you mean to check pmd_bad() instead of pmd_trans_huge()? I.e. b/c a
> pmd-mapped thp counts as "bad" (at least on x86 since PSE set) or do
> you mean to additionally check pmd_bad() after the pmd_trans_huge()
> check?
> 
> If it's the former, I'd say we can't claim !pmd_bad() == memory
> already backed by thps / our job here is done.
> 
> If it's the latter, I don't see it hurting much (but I can't argue
> intelligently about why it's needed) and can include the check in v6.

The latter.  I don't think it's strongly necessary, but looks good to have
if you also agree.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific
  2022-05-17 22:35             ` Zach O'Keefe
@ 2022-05-25 17:58               ` Yang Shi
  2022-05-25 18:27                 ` Zach O'Keefe
  0 siblings, 1 reply; 44+ messages in thread
From: Yang Shi @ 2022-05-25 17:58 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: David Rientjes, Alex Shi, David Hildenbrand, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Zi Yan, Linux MM, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Tue, May 17, 2022 at 3:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Tue, May 17, 2022 at 10:51 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Fri, May 13, 2022 at 4:56 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > On Fri, May 13, 2022 at 4:17 PM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Fri, May 13, 2022 at 4:05 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > >
> > > > > Thanks for the review, David!
> > > > >
> > > > > On Thu, May 12, 2022 at 1:02 PM David Rientjes <rientjes@google.com> wrote:
> > > > > >
> > > > > > On Wed, 4 May 2022, Zach O'Keefe wrote:
> > > > > >
> > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > index c94bc43dff3e..6095fcb3f07c 100644
> > > > > > > --- a/mm/khugepaged.c
> > > > > > > +++ b/mm/khugepaged.c
> > > > > > > @@ -92,6 +92,10 @@ struct collapse_control {
> > > > > > >
> > > > > > >       /* Last target selected in khugepaged_find_target_node() */
> > > > > > >       int last_target_node;
> > > > > > > +
> > > > > > > +     struct page *hpage;
> > > > > > > +     int (*alloc_charge_hpage)(struct mm_struct *mm,
> > > > > > > +                               struct collapse_control *cc);
> > > > > > >  };
> > > > > > >
> > > > > > >  /**
> > > > > >
> > > > > > Embedding this function pointer into collapse_contol seems like it would
> > > > > > need some pretty strong rationale.  Not to say that it should be a
> > > > > > non-starter, but I think the changelog needs to clearly indicate why this
> > > > > > is better/cleaner than embedding the needed info for a single allocation
> > > > > > and charge function to use.  If the callbacks would truly be so different
> > > > > > that unifying them would be more complex, I think this makes sense.
> > > > >
> > > > > Mostly, this boils down to khugepaged having different a allocation
> > > > > pattern for NUMA/UMA ; the former scans the pages first to determine
> > > > > the right node, the latter preallocates before scanning. khugepaged
> > > > > has the luxury on UMA systems of just holding onto a hugepage
> > > > > indefinitely for the next collapse target.
> > > > >
> > > > > For MADV_COLLAPSE, we never preallocate, and so its pattern doesn't
> > > > > depend on NUMA or UMA configs. Trying to avoid "if (khugepaged) ...
> > > > > else" casing, defining this as a context-defined operation seemed
> > > > > appropriate.
> > > > >
> > > > > Collapsing both alloc and charging together was mostly a code
> > > > > cleanliness decision resulting from not wanting to embed a ->gfp()
> > > > > hook (gfp flags are used both by allocation and memcg charging).
> > > > > Alternatively, a .gfp member could exist - it would just need to be
> > > > > refreshed periodically in the khugepaged codepath.
> > > > >
> > > > > That all said - let me take another crack at seeing if I can make this
> > > > > work without the need for a function pointer here.
> > > >
> > > > I had a patch that removed UMA allocation, please refer to
> > > > https://lore.kernel.org/linux-mm/20210817202146.3218-1-shy828301@gmail.com/#t
> > > >
> > > > It was not made upstream due to some requests for further cleanup, but
> > > > unfortunately I haven't got time to look into it yet.
> > > >
> > > > If this page were merged, would that make your life easier?
> > >
> > > Hey Yang,
> > >
> > > First, sorry for missing that patch in the first place. I actually
> > > have some patches queued up that do a similar cleanup of
> > > khugepaged_prealloc_page() that was mentioned, but decided to not
> > > include them here.
> > >
> > > Second, removing the NUMA/UMA story does make this patch easier, I
> > > think (esp since the sched change was dropped for now). This is
> > > something I wanted while writing this series, but without the larger
> > > context referenced in your patch (most users don't build NUMA=n even
> > > on single node systems, and the pcp hugepage lists optimization)
> > > couldn't justify myself.
> >
> > Thanks, it would be better to add this patch into your series as a
> > prerequisite so that you could make the MADV_COLLAPSE easier.
> >
> > I don't think I will be able to find time to rework the patch and
> > solve all the review comments at any time soon. If you'd like to take
> > it, please do it.
> >
>
> Sounds good, Yang. I'll take a crack at it for v6.

Hi Zach,

Fortunately I got some time, if you haven't spent too much effort in
this, I could rework the patch then resubmit. Just let me know.

>
> Thanks for taking the time,
> Zach
>
> > >
> > > Best,
> > > Zach


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific
  2022-05-25 17:58               ` Yang Shi
@ 2022-05-25 18:27                 ` Zach O'Keefe
  0 siblings, 0 replies; 44+ messages in thread
From: Zach O'Keefe @ 2022-05-25 18:27 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Rientjes, Alex Shi, David Hildenbrand, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Zi Yan, Linux MM, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, May 25, 2022 at 10:59 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, May 17, 2022 at 3:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Tue, May 17, 2022 at 10:51 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Fri, May 13, 2022 at 4:56 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > On Fri, May 13, 2022 at 4:17 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Fri, May 13, 2022 at 4:05 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > > >
> > > > > > Thanks for the review, David!
> > > > > >
> > > > > > On Thu, May 12, 2022 at 1:02 PM David Rientjes <rientjes@google.com> wrote:
> > > > > > >
> > > > > > > On Wed, 4 May 2022, Zach O'Keefe wrote:
> > > > > > >
> > > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > > index c94bc43dff3e..6095fcb3f07c 100644
> > > > > > > > --- a/mm/khugepaged.c
> > > > > > > > +++ b/mm/khugepaged.c
> > > > > > > > @@ -92,6 +92,10 @@ struct collapse_control {
> > > > > > > >
> > > > > > > >       /* Last target selected in khugepaged_find_target_node() */
> > > > > > > >       int last_target_node;
> > > > > > > > +
> > > > > > > > +     struct page *hpage;
> > > > > > > > +     int (*alloc_charge_hpage)(struct mm_struct *mm,
> > > > > > > > +                               struct collapse_control *cc);
> > > > > > > >  };
> > > > > > > >
> > > > > > > >  /**
> > > > > > >
> > > > > > > Embedding this function pointer into collapse_contol seems like it would
> > > > > > > need some pretty strong rationale.  Not to say that it should be a
> > > > > > > non-starter, but I think the changelog needs to clearly indicate why this
> > > > > > > is better/cleaner than embedding the needed info for a single allocation
> > > > > > > and charge function to use.  If the callbacks would truly be so different
> > > > > > > that unifying them would be more complex, I think this makes sense.
> > > > > >
> > > > > > Mostly, this boils down to khugepaged having different a allocation
> > > > > > pattern for NUMA/UMA ; the former scans the pages first to determine
> > > > > > the right node, the latter preallocates before scanning. khugepaged
> > > > > > has the luxury on UMA systems of just holding onto a hugepage
> > > > > > indefinitely for the next collapse target.
> > > > > >
> > > > > > For MADV_COLLAPSE, we never preallocate, and so its pattern doesn't
> > > > > > depend on NUMA or UMA configs. Trying to avoid "if (khugepaged) ...
> > > > > > else" casing, defining this as a context-defined operation seemed
> > > > > > appropriate.
> > > > > >
> > > > > > Collapsing both alloc and charging together was mostly a code
> > > > > > cleanliness decision resulting from not wanting to embed a ->gfp()
> > > > > > hook (gfp flags are used both by allocation and memcg charging).
> > > > > > Alternatively, a .gfp member could exist - it would just need to be
> > > > > > refreshed periodically in the khugepaged codepath.
> > > > > >
> > > > > > That all said - let me take another crack at seeing if I can make this
> > > > > > work without the need for a function pointer here.
> > > > >
> > > > > I had a patch that removed UMA allocation, please refer to
> > > > > https://lore.kernel.org/linux-mm/20210817202146.3218-1-shy828301@gmail.com/#t
> > > > >
> > > > > It was not made upstream due to some requests for further cleanup, but
> > > > > unfortunately I haven't got time to look into it yet.
> > > > >
> > > > > If this page were merged, would that make your life easier?
> > > >
> > > > Hey Yang,
> > > >
> > > > First, sorry for missing that patch in the first place. I actually
> > > > have some patches queued up that do a similar cleanup of
> > > > khugepaged_prealloc_page() that was mentioned, but decided to not
> > > > include them here.
> > > >
> > > > Second, removing the NUMA/UMA story does make this patch easier, I
> > > > think (esp since the sched change was dropped for now). This is
> > > > something I wanted while writing this series, but without the larger
> > > > context referenced in your patch (most users don't build NUMA=n even
> > > > on single node systems, and the pcp hugepage lists optimization)
> > > > couldn't justify myself.
> > >
> > > Thanks, it would be better to add this patch into your series as a
> > > prerequisite so that you could make the MADV_COLLAPSE easier.
> > >
> > > I don't think I will be able to find time to rework the patch and
> > > solve all the review comments at any time soon. If you'd like to take
> > > it, please do it.
> > >
> >
> > Sounds good, Yang. I'll take a crack at it for v6.
>
> Hi Zach,
>
> Fortunately I got some time, if you haven't spent too much effort in
> this, I could rework the patch then resubmit. Just let me know.
>

Hey Yang,

Thanks for letting me know. In my v6 series, I actually ended up
incorporating this work *afterwards*. Reason being there are some
patches in my series, like "mm/khugepaged: pipe enum scan_result codes
back to callers" which makes cleanup of khugepaged_scan_mm_slot() and
khugepaged_scan_mm_slot() easier, IMO[1].

I'll privately send what I have planned for v6 and we can discuss.

Thanks,
Zach

[1] https://lore.kernel.org/linux-mm/20220504214437.2850685-6-zokeefe@google.com/


> >
> > Thanks for taking the time,
> > Zach
> >
> > > >
> > > > Best,
> > > > Zach


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2022-05-25 18:27 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-04 21:44 [PATCH v5 00/12] mm: userspace hugepage collapse Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 01/13] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
2022-05-18 18:41   ` Peter Xu
2022-05-19 21:06     ` Zach O'Keefe
2022-05-20  1:12       ` Peter Xu
2022-05-04 21:44 ` [PATCH v5 02/13] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-05-12 20:02   ` David Rientjes
2022-05-18 20:03   ` Peter Xu
2022-05-18 20:11     ` Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 03/13] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
2022-05-12 20:02   ` David Rientjes
2022-05-13 18:26     ` Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 04/13] mm/khugepaged: make hugepage allocation context-specific Zach O'Keefe
2022-05-12 20:02   ` David Rientjes
2022-05-13 23:04     ` Zach O'Keefe
2022-05-13 23:17       ` Yang Shi
2022-05-13 23:55         ` Zach O'Keefe
2022-05-17 17:18           ` Yang Shi
2022-05-17 22:35             ` Zach O'Keefe
2022-05-25 17:58               ` Yang Shi
2022-05-25 18:27                 ` Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 05/13] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
2022-05-12 20:02   ` David Rientjes
2022-05-04 21:44 ` [PATCH v5 06/13] mm/khugepaged: add flag to ignore khugepaged_max_ptes_* Zach O'Keefe
2022-05-12 20:03   ` David Rientjes
2022-05-04 21:44 ` [PATCH v5 07/13] mm/khugepaged: add flag to ignore page young/referenced requirement Zach O'Keefe
2022-05-12 20:03   ` David Rientjes
2022-05-13 18:17     ` Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 08/13] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-05-05 18:50   ` Zach O'Keefe
2022-05-05 18:58     ` Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 09/13] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
2022-05-12 20:03   ` David Rientjes
2022-05-04 21:44 ` [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
2022-05-11  0:49   ` Rongwei Wang
2022-05-11 15:34     ` Zach O'Keefe
2022-05-12 15:53       ` Rongwei Wang
2022-05-12 20:03       ` David Rientjes
2022-05-13 21:06         ` Zach O'Keefe
2022-05-16  3:56         ` Rongwei Wang
2022-05-12 20:03   ` David Rientjes
2022-05-04 21:44 ` [PATCH v5 11/13] selftests/vm: modularize collapse selftests Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 12/13] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
2022-05-04 21:44 ` [PATCH v5 13/13] selftests/vm: add test to verify recollapse of THPs Zach O'Keefe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.