All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/12] mm: userspace hugepage collapse
@ 2022-04-26 14:44 Zach O'Keefe
  2022-04-26 14:44 ` [PATCH v3 01/12] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
                   ` (12 more replies)
  0 siblings, 13 replies; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was introduced by David Rientjes[1], and the semantics and
implementation were introduced and discussed in a previous PATCH RFC[2].

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

process_madvise(2)

	Performs a synchronous collapse of the native pages
	mapped by the list of iovecs into transparent hugepages.

	The system-wide THP sysfs settings as well as the VMA flags of the
	memory range being collapsed are both respected for the purposes
	of determining THP eligibility.

	THP allocation may enter direct reclaim and/or compaction.

	When a range spans multiple VMAs, the semantics of the collapse
	over of each VMA is independent from the others.

	Caller must have CAP_SYS_ADMIN if not acting on self.

	Return value follows existing process_madvise(2) conventions.  A
	“success” indicates that all hugepage-sized/aligned regions
	covered by the provided range were either successfully
	collapsed, or were already pmd-mapped THPs.

madvise(2)

	Equivalent to process_madvise(2) on self, with 0 returned on
	“success”.

Current Use-Cases
--------------------------------

(1)	Immediately back executable text by THPs.  Current support provided
	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
	system which might impair services from serving at their full rated
	load after (re)starting.  Tricks like mremap(2)'ing text onto
	anonymous memory to immediately realize iTLB performance prevents
	page sharing and demand paging, both of which increase steady state
	memory footprint.  With MADV_COLLAPSE, we get the best of both
	worlds: Peak upfront performance and lower RAM footprints.  Note
	that subsequent support for file-backed memory is required here.

(2)	malloc() implementations that manage memory in hugepage-sized
	chunks, but sometimes subrelease memory back to the system in
	native-sized chunks via MADV_DONTNEED; zapping the pmd.  Later, when
	the memory is hot, the implementation could madvise(MADV_COLLAPSE)
	to re-back the memory by THPs to regain hugepage coverage and dTLB
	performance.  TCMalloc is such an implementation that could benefit
	from this[3].  A prior study of Google internal workloads during
	evaluation of Temeraire, a hugepage-aware enhancement to TCMalloc,
	showed that nearly 20% of all cpu cycles were spent in dTLB stalls,
	and that increasing hugepage coverage by even small amount can help
	with that[4].

Future work
--------------------------------

Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.

One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps.  For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[5].

Sequence of Patches
--------------------------------

Patches 1-4 perform refactoring of collapse logic within khugepaged.c
and introduce the notion of a collapse context.

Patches 5-9 introduces MADV_COLLAPSE, does some renaming, adds support
so that MADV_COLLAPSE context has the eligibility and allocation
semantics referenced above, and adds process_madivse(2) support.

Patches 10-12 add selftests to test the new functionality.

Applies against next-20220426.

Changelog
--------------------------------

v2 -> v3:
* Collapse semantics have changed: the gfp flags used for hugepage allocation
  now are independent of khugepaged.
* Cover-letter: add primary use-cases and update description of collapse
  semantics.
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Added .gfp operation to struct collapse_control
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Added madvise context .gfp implementation.
  -> Set scan_result appropriately on early exit due to mm exit or vma
     vma revalidation.
  -> Reword patch description
* Rebased onto next-20220426

v1 -> v2:
* Cover-letter clarification and added RFC -> v1 notes
* Fixes issues reported by kernel test robot <lkp@intel.com>
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
  -> Fixed mixed code/declarations
* 'mm/khugepaged: make hugepage allocation context-specific'
  -> Fixed bad function signature in !NUMA && TRANSPARENT_HUGEPAGE configs
  -> Added doc comment to retract_page_tables() for "cc"
* 'mm/khugepaged: add struct collapse_result'
  -> Added doc comment to retract_page_tables() for "cr"
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
  -> Added MADV_COLLAPSE definitions for alpha, mips, parisc, xtensa
  -> Moved an "#ifdef NUMA" so that khugepaged_find_target_node() is
     defined in !NUMA && TRANSPARENT_HUGEPAGE configs.
* 'mm/khugepaged: remove khugepaged prefix from shared collapse'
  functions
  -> Removed khugepaged prefix from khugepaged_find_target_node on L914
* Rebased onto next-20220414

RFC -> v1:
* The series was significantly reworked from RFC and most patches are entirely
  new or reworked.
* Collapse eligibility criteria has changed: MADV_COLLAPSE now respects
  VM_NOHUGEPAGE.
* Collapse semantics have changed: the gfp flags used for hugepage allocation
  now match that of khugepaged for the same VMA, instead of the gfp flags used
  at-fault for calling process for the VMA.
* Collapse semantics have changed: The collapse semantics for multiple VMAs
  spanning a single MADV_COLLAPSE call are now independent, whereas before the
  idea was to allow direct reclaim/compaction if any spanned VMA permitted so.
* The process_madvise(2) flags, MADV_F_COLLAPSE_LIMITS and
  MADV_F_COLLAPSE_DEFRAG have been removed.
* Implementation change: the RFC implemented collapse over a range of hugepages
  in a batched-fashion with the aim of doing multiple page table updates inside
  a single mmap_lock write.  This has been changed, and the implementation now
  collapses each hugepage-aligned/sized region iteratively.  This was motivated
  by an experiment which showed that, when multiple threads were concurrently
  faulting during a MADV_COLLAPSE operation, mean and tail latency to acquire
  mmap_lock in read for threads in the fault patch was improved by using a batch
  size of 1 (batch sizes of 1, 8, 16, 32 were tested)[6].
* Added: If a collapse operation fails because a page isn't found on the LRU, do
  a lru_add_drain_all() and retry.
* Added: selftests

[1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20220308213417.1407042-1-zokeefe@google.com/
[3] https://github.com/google/tcmalloc/tree/master/tcmalloc
[4] https://research.google/pubs/pub50370/
[5] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/
[6] https://lore.kernel.org/linux-mm/CAAa6QmRc76n-dspGT7UK8DkaqZAOz-CkCsME1V7KGtQ6Yt2FqA@mail.gmail.com/

Zach O'Keefe (13):
  mm/khugepaged: separate hugepage preallocation and cleanup
  mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  mm/khugepaged: add struct collapse_control
  mm/khugepaged: make hugepage allocation context-specific
  mm/khugepaged: add struct collapse_result
  mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  mm/khugepaged: remove khugepaged prefix from shared collapse functions
  mm/khugepaged: add flag to ignore khugepaged_max_ptes_*
  mm/khugepaged: add flag to ignore page young/referenced requirement
  mm/madvise: add MADV_COLLAPSE to process_madvise()
  selftests/vm: modularize collapse selftests
  selftests/vm: add MADV_COLLAPSE collapse context to selftests
  selftests/vm: add test to verify recollapse of THPs

 include/linux/huge_mm.h                 |  12 +
 include/trace/events/huge_memory.h      |   5 +-
 include/uapi/asm-generic/mman-common.h  |   2 +
 mm/internal.h                           |   1 +
 mm/khugepaged.c                         | 598 ++++++++++++++++--------
 mm/madvise.c                            |  11 +-
 mm/rmap.c                               |  15 +-
 tools/testing/selftests/vm/khugepaged.c | 417 +++++++++++------
 8 files changed, 702 insertions(+), 359 deletions(-)

-- 
2.35.1.1178.g4f1659d476-goog



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 01/12] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
@ 2022-04-26 14:44 ` Zach O'Keefe
  2022-04-27  0:26   ` Peter Xu
  2022-04-26 14:44 ` [PATCH v3 02/12] mm/khugepaged: add struct collapse_control Zach O'Keefe
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe, kernel test robot

When scanning an anon pmd to see if it's eligible for collapse, return
SCAN_PMD_MAPPED if the pmd already maps a THP.  Note that
SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
file-collapse path, since the latter might identify pte-mapped compound
pages.  This is required by MADV_COLLAPSE which necessarily needs to
know what hugepage-aligned/sized regions are already pmd-mapped.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 include/trace/events/huge_memory.h |  3 ++-
 mm/internal.h                      |  1 +
 mm/khugepaged.c                    | 30 ++++++++++++++++++++++++++----
 mm/rmap.c                          | 15 +++++++++++++--
 4 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index d651f3437367..9faa678e0a5b 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -33,7 +33,8 @@
 	EM( SCAN_ALLOC_HUGE_PAGE_FAIL,	"alloc_huge_page_failed")	\
 	EM( SCAN_CGROUP_CHARGE_FAIL,	"ccgroup_charge_failed")	\
 	EM( SCAN_TRUNCATED,		"truncated")			\
-	EMe(SCAN_PAGE_HAS_PRIVATE,	"page_has_private")		\
+	EM( SCAN_PAGE_HAS_PRIVATE,	"page_has_private")		\
+	EMe(SCAN_PMD_MAPPED,		"page_pmd_mapped")		\
 
 #undef EM
 #undef EMe
diff --git a/mm/internal.h b/mm/internal.h
index 0667abd57634..51ae9f71a2a3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -172,6 +172,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
 /*
  * in mm/rmap.c:
  */
+pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address);
 extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
 
 /*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ba8dbd1825da..2933b13fc975 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -51,6 +51,7 @@ enum scan_result {
 	SCAN_CGROUP_CHARGE_FAIL,
 	SCAN_TRUNCATED,
 	SCAN_PAGE_HAS_PRIVATE,
+	SCAN_PMD_MAPPED,
 };
 
 #define CREATE_TRACE_POINTS
@@ -987,6 +988,29 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	return 0;
 }
 
+static int find_pmd_or_thp_or_none(struct mm_struct *mm,
+				   unsigned long address,
+				   pmd_t **pmd)
+{
+	pmd_t pmde;
+
+	*pmd = mm_find_pmd_raw(mm, address);
+	if (!*pmd)
+		return SCAN_PMD_NULL;
+
+	pmde = pmd_read_atomic(*pmd);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/* See comments in pmd_none_or_trans_huge_or_clear_bad() */
+	barrier();
+#endif
+	if (!pmd_present(pmde) || pmd_none(pmde))
+		return SCAN_PMD_NULL;
+	if (pmd_trans_huge(pmde))
+		return SCAN_PMD_MAPPED;
+	return SCAN_SUCCEED;
+}
+
 /*
  * Bring missing pages in from swap, to complete THP collapse.
  * Only done if khugepaged_scan_pmd believes it is worthwhile.
@@ -1238,11 +1262,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd) {
-		result = SCAN_PMD_NULL;
+	result = find_pmd_or_thp_or_none(mm, address, &pmd);
+	if (result != SCAN_SUCCEED)
 		goto out;
-	}
 
 	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
diff --git a/mm/rmap.c b/mm/rmap.c
index 61e63db5dc6f..49817f35e65c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -759,13 +759,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 	return vma_address(page, vma);
 }
 
-pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
+pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd = NULL;
-	pmd_t pmde;
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
@@ -780,6 +779,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+out:
+	return pmd;
+}
+
+pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
+{
+	pmd_t pmde;
+	pmd_t *pmd;
+
+	pmd = mm_find_pmd_raw(mm, address);
+	if (!pmd)
+		goto out;
 	/*
 	 * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
 	 * without holding anon_vma lock for write.  So when looking for a
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 02/12] mm/khugepaged: add struct collapse_control
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
  2022-04-26 14:44 ` [PATCH v3 01/12] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
@ 2022-04-26 14:44 ` Zach O'Keefe
  2022-04-27 19:49   ` Peter Xu
  2022-04-26 14:44 ` [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific Zach O'Keefe
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Modularize hugepage collapse by introducing struct collapse_control.
This structure serves to describe the properties of the requested
collapse, as well as serve as a local scratch pad to use during the
collapse itself.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 79 ++++++++++++++++++++++++++++---------------------
 1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2933b13fc975..9d42fa330812 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -86,6 +86,14 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
 
 #define MAX_PTE_MAPPED_THP 8
 
+struct collapse_control {
+	/* Num pages scanned per node */
+	int node_load[MAX_NUMNODES];
+
+	/* Last target selected in khugepaged_find_target_node() for this scan */
+	int last_target_node;
+};
+
 /**
  * struct mm_slot - hash lookup from mm to mm_slot
  * @hash: hash collision list
@@ -796,9 +804,7 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-static int khugepaged_node_load[MAX_NUMNODES];
-
-static bool khugepaged_scan_abort(int nid)
+static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -810,11 +816,11 @@ static bool khugepaged_scan_abort(int nid)
 		return false;
 
 	/* If there is a count for this node already, it must be acceptable */
-	if (khugepaged_node_load[nid])
+	if (cc->node_load[nid])
 		return false;
 
 	for (i = 0; i < MAX_NUMNODES; i++) {
-		if (!khugepaged_node_load[i])
+		if (!cc->node_load[i])
 			continue;
 		if (node_distance(nid, i) > node_reclaim_distance)
 			return true;
@@ -829,28 +835,28 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(void)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
-	static int last_khugepaged_target_node = NUMA_NO_NODE;
 	int nid, target_node = 0, max_value = 0;
 
 	/* find first node with max normal pages hit */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
-		if (khugepaged_node_load[nid] > max_value) {
-			max_value = khugepaged_node_load[nid];
+		if (cc->node_load[nid] > max_value) {
+			max_value = cc->node_load[nid];
 			target_node = nid;
 		}
 
 	/* do some balance if several nodes have the same hit record */
-	if (target_node <= last_khugepaged_target_node)
-		for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
-				nid++)
-			if (max_value == khugepaged_node_load[nid]) {
+	if (target_node <= cc->last_target_node)
+		for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
+		     nid++) {
+			if (max_value == cc->node_load[nid]) {
 				target_node = nid;
 				break;
 			}
+		}
 
-	last_khugepaged_target_node = target_node;
+	cc->last_target_node = target_node;
 	return target_node;
 }
 
@@ -888,7 +894,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 	return *hpage;
 }
 #else
-static int khugepaged_find_target_node(void)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -1248,7 +1254,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 static int khugepaged_scan_pmd(struct mm_struct *mm,
 			       struct vm_area_struct *vma,
 			       unsigned long address,
-			       struct page **hpage)
+			       struct page **hpage,
+			       struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1266,7 +1273,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+	memset(cc->node_load, 0, sizeof(cc->node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
@@ -1332,16 +1339,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 
 		/*
 		 * Record which node the original page is from and save this
-		 * information to khugepaged_node_load[].
+		 * information to cc->node_load[].
 		 * Khugepaged will allocate hugepage from the node has the max
 		 * hit record.
 		 */
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
-		khugepaged_node_load[node]++;
+		cc->node_load[node]++;
 		if (!PageLRU(page)) {
 			result = SCAN_PAGE_LRU;
 			goto out_unmap;
@@ -1392,7 +1399,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
-		node = khugepaged_find_target_node();
+		node = khugepaged_find_target_node(cc);
 		/* collapse_huge_page will return with the mmap_lock released */
 		collapse_huge_page(mm, address, hpage, node,
 				referenced, unmapped);
@@ -2044,7 +2051,8 @@ static void collapse_file(struct mm_struct *mm,
 }
 
 static void khugepaged_scan_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start, struct page **hpage)
+		struct file *file, pgoff_t start, struct page **hpage,
+		struct collapse_control *cc)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2055,7 +2063,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 
 	present = 0;
 	swap = 0;
-	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+	memset(cc->node_load, 0, sizeof(cc->node_load));
 	rcu_read_lock();
 	xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
 		if (xas_retry(&xas, page))
@@ -2080,11 +2088,11 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 		}
 
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			break;
 		}
-		khugepaged_node_load[node]++;
+		cc->node_load[node]++;
 
 		if (!PageLRU(page)) {
 			result = SCAN_PAGE_LRU;
@@ -2117,7 +2125,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			node = khugepaged_find_target_node();
+			node = khugepaged_find_target_node(cc);
 			collapse_file(mm, file, start, hpage, node);
 		}
 	}
@@ -2126,7 +2134,8 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 }
 #else
 static void khugepaged_scan_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start, struct page **hpage)
+		struct file *file, pgoff_t start, struct page **hpage,
+		struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -2137,7 +2146,8 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
-					    struct page **hpage)
+					    struct page **hpage,
+					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
 {
@@ -2213,12 +2223,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 
 				mmap_read_unlock(mm);
 				ret = 1;
-				khugepaged_scan_file(mm, file, pgoff, hpage);
+				khugepaged_scan_file(mm, file, pgoff, hpage, cc);
 				fput(file);
 			} else {
 				ret = khugepaged_scan_pmd(mm, vma,
 						khugepaged_scan.address,
-						hpage);
+						hpage, cc);
 			}
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
@@ -2274,7 +2284,7 @@ static int khugepaged_wait_event(void)
 		kthread_should_stop();
 }
 
-static void khugepaged_do_scan(void)
+static void khugepaged_do_scan(struct collapse_control *cc)
 {
 	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
@@ -2298,7 +2308,7 @@ static void khugepaged_do_scan(void)
 		if (khugepaged_has_work() &&
 		    pass_through_head < 2)
 			progress += khugepaged_scan_mm_slot(pages - progress,
-							    &hpage);
+							    &hpage, cc);
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
@@ -2337,12 +2347,15 @@ static void khugepaged_wait_work(void)
 static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
+	struct collapse_control cc = {
+		.last_target_node = NUMA_NO_NODE,
+	};
 
 	set_freezable();
 	set_user_nice(current, MAX_NICE);
 
 	while (!kthread_should_stop()) {
-		khugepaged_do_scan();
+		khugepaged_do_scan(&cc);
 		khugepaged_wait_work();
 	}
 
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
  2022-04-26 14:44 ` [PATCH v3 01/12] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
  2022-04-26 14:44 ` [PATCH v3 02/12] mm/khugepaged: add struct collapse_control Zach O'Keefe
@ 2022-04-26 14:44 ` Zach O'Keefe
  2022-04-27 20:04   ` Peter Xu
  2022-04-26 14:44 ` [PATCH v3 04/12] mm/khugepaged: add struct collapse_result Zach O'Keefe
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe, kernel test robot

Add hugepage allocation context to struct collapse_context, allowing
different collapse contexts to allocate hugepages differently.  For
example, khugepaged decides to allocate differently in NUMA and UMA
configurations, and other collapse contexts shouldn't be coupled to this
decision.  Likewise for the gfp flags used for said allocation.

Additionally, move [pre]allocated hugepage pointer into
struct collapse_context.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 mm/khugepaged.c | 102 ++++++++++++++++++++++++------------------------
 1 file changed, 52 insertions(+), 50 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9d42fa330812..c4962191d6e1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,6 +92,11 @@ struct collapse_control {
 
 	/* Last target selected in khugepaged_find_target_node() for this scan */
 	int last_target_node;
+
+	struct page *hpage;
+	gfp_t (*gfp)(void);
+	struct page* (*alloc_hpage)(struct collapse_control *cc, gfp_t gfp,
+				    int node);
 };
 
 /**
@@ -877,21 +882,21 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 	return true;
 }
 
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
+static struct page *khugepaged_alloc_page(struct collapse_control *cc,
+					  gfp_t gfp, int node)
 {
-	VM_BUG_ON_PAGE(*hpage, *hpage);
+	VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
 
-	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
-	if (unlikely(!*hpage)) {
+	cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
+	if (unlikely(!cc->hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-		*hpage = ERR_PTR(-ENOMEM);
+		cc->hpage = ERR_PTR(-ENOMEM);
 		return NULL;
 	}
 
-	prep_transhuge_page(*hpage);
+	prep_transhuge_page(cc->hpage);
 	count_vm_event(THP_COLLAPSE_ALLOC);
-	return *hpage;
+	return cc->hpage;
 }
 #else
 static int khugepaged_find_target_node(struct collapse_control *cc)
@@ -953,12 +958,12 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 	return true;
 }
 
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
+static struct page *khugepaged_alloc_page(struct collapse_control *cc,
+					  gfp_t gfp, int node)
 {
-	VM_BUG_ON(!*hpage);
+	VM_BUG_ON(!cc->hpage);
 
-	return  *hpage;
+	return cc->hpage;
 }
 #endif
 
@@ -1080,10 +1085,9 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 	return true;
 }
 
-static void collapse_huge_page(struct mm_struct *mm,
-				   unsigned long address,
-				   struct page **hpage,
-				   int node, int referenced, int unmapped)
+static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
+			       struct collapse_control *cc, int referenced,
+			       int unmapped)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -1096,11 +1100,12 @@ static void collapse_huge_page(struct mm_struct *mm,
 	struct mmu_notifier_range range;
 	gfp_t gfp;
 	const struct cpumask *cpumask;
+	int node;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	/* Only allocate from the target node */
-	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
+	gfp = cc->gfp() | __GFP_THISNODE;
 
 	/*
 	 * Before allocating the hugepage, release the mmap_lock read lock.
@@ -1110,13 +1115,14 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	mmap_read_unlock(mm);
 
+	node = khugepaged_find_target_node(cc);
 	/* sched to specified node before huage page memory copy */
 	if (task_node(current) != node) {
 		cpumask = cpumask_of_node(node);
 		if (!cpumask_empty(cpumask))
 			set_cpus_allowed_ptr(current, cpumask);
 	}
-	new_page = khugepaged_alloc_page(hpage, gfp, node);
+	new_page = cc->alloc_hpage(cc, gfp, node);
 	if (!new_page) {
 		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
 		goto out_nolock;
@@ -1238,15 +1244,15 @@ static void collapse_huge_page(struct mm_struct *mm,
 	update_mmu_cache_pmd(vma, address, pmd);
 	spin_unlock(pmd_ptl);
 
-	*hpage = NULL;
+	cc->hpage = NULL;
 
 	khugepaged_pages_collapsed++;
 	result = SCAN_SUCCEED;
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
-	if (!IS_ERR_OR_NULL(*hpage))
-		mem_cgroup_uncharge(page_folio(*hpage));
+	if (!IS_ERR_OR_NULL(cc->hpage))
+		mem_cgroup_uncharge(page_folio(cc->hpage));
 	trace_mm_collapse_huge_page(mm, isolated, result);
 	return;
 }
@@ -1254,7 +1260,6 @@ static void collapse_huge_page(struct mm_struct *mm,
 static int khugepaged_scan_pmd(struct mm_struct *mm,
 			       struct vm_area_struct *vma,
 			       unsigned long address,
-			       struct page **hpage,
 			       struct collapse_control *cc)
 {
 	pmd_t *pmd;
@@ -1399,10 +1404,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
-		node = khugepaged_find_target_node(cc);
 		/* collapse_huge_page will return with the mmap_lock released */
-		collapse_huge_page(mm, address, hpage, node,
-				referenced, unmapped);
+		collapse_huge_page(mm, address, cc, referenced, unmapped);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
@@ -1667,8 +1670,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  * @mm: process address space where collapse happens
  * @file: file that collapse on
  * @start: collapse start address
- * @hpage: new allocated huge page for collapse
- * @node: appointed node the new huge page allocate from
+ * @cc: collapse context and scratchpad
  *
  * Basic scheme is simple, details are more complex:
  *  - allocate and lock a new huge page;
@@ -1686,8 +1688,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  *    + unlock and free huge page;
  */
 static void collapse_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start,
-		struct page **hpage, int node)
+			  struct file *file, pgoff_t start,
+			  struct collapse_control *cc)
 {
 	struct address_space *mapping = file->f_mapping;
 	gfp_t gfp;
@@ -1697,15 +1699,16 @@ static void collapse_file(struct mm_struct *mm,
 	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
 	int nr_none = 0, result = SCAN_SUCCEED;
 	bool is_shmem = shmem_file(file);
-	int nr;
+	int nr, node;
 
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
 	/* Only allocate from the target node */
-	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
+	gfp = cc->gfp() | __GFP_THISNODE;
+	node = khugepaged_find_target_node(cc);
 
-	new_page = khugepaged_alloc_page(hpage, gfp, node);
+	new_page = cc->alloc_hpage(cc, gfp, node);
 	if (!new_page) {
 		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
 		goto out;
@@ -1998,7 +2001,7 @@ static void collapse_file(struct mm_struct *mm,
 		 * Remove pte page tables, so we can re-fault the page as huge.
 		 */
 		retract_page_tables(mapping, start);
-		*hpage = NULL;
+		cc->hpage = NULL;
 
 		khugepaged_pages_collapsed++;
 	} else {
@@ -2045,14 +2048,14 @@ static void collapse_file(struct mm_struct *mm,
 	unlock_page(new_page);
 out:
 	VM_BUG_ON(!list_empty(&pagelist));
-	if (!IS_ERR_OR_NULL(*hpage))
-		mem_cgroup_uncharge(page_folio(*hpage));
+	if (!IS_ERR_OR_NULL(cc->hpage))
+		mem_cgroup_uncharge(page_folio(cc->hpage));
 	/* TODO: tracepoints */
 }
 
 static void khugepaged_scan_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start, struct page **hpage,
-		struct collapse_control *cc)
+				 struct file *file, pgoff_t start,
+				 struct collapse_control *cc)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2125,8 +2128,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			node = khugepaged_find_target_node(cc);
-			collapse_file(mm, file, start, hpage, node);
+			collapse_file(mm, file, start, cc);
 		}
 	}
 
@@ -2134,8 +2136,8 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 }
 #else
 static void khugepaged_scan_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start, struct page **hpage,
-		struct collapse_control *cc)
+				 struct file *file, pgoff_t start,
+				 struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -2146,7 +2148,6 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
-					    struct page **hpage,
 					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
@@ -2223,12 +2224,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 
 				mmap_read_unlock(mm);
 				ret = 1;
-				khugepaged_scan_file(mm, file, pgoff, hpage, cc);
+				khugepaged_scan_file(mm, file, pgoff, cc);
 				fput(file);
 			} else {
 				ret = khugepaged_scan_pmd(mm, vma,
-						khugepaged_scan.address,
-						hpage, cc);
+						khugepaged_scan.address, cc);
 			}
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
@@ -2286,15 +2286,15 @@ static int khugepaged_wait_event(void)
 
 static void khugepaged_do_scan(struct collapse_control *cc)
 {
-	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
 	unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
 	bool wait = true;
 
+	cc->hpage = NULL;
 	lru_add_drain_all();
 
 	while (progress < pages) {
-		if (!khugepaged_prealloc_page(&hpage, &wait))
+		if (!khugepaged_prealloc_page(&cc->hpage, &wait))
 			break;
 
 		cond_resched();
@@ -2308,14 +2308,14 @@ static void khugepaged_do_scan(struct collapse_control *cc)
 		if (khugepaged_has_work() &&
 		    pass_through_head < 2)
 			progress += khugepaged_scan_mm_slot(pages - progress,
-							    &hpage, cc);
+							    cc);
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
 	}
 
-	if (!IS_ERR_OR_NULL(hpage))
-		put_page(hpage);
+	if (!IS_ERR_OR_NULL(cc->hpage))
+		put_page(cc->hpage);
 }
 
 static bool khugepaged_should_wakeup(void)
@@ -2349,6 +2349,8 @@ static int khugepaged(void *none)
 	struct mm_slot *mm_slot;
 	struct collapse_control cc = {
 		.last_target_node = NUMA_NO_NODE,
+		.gfp = &alloc_hugepage_khugepaged_gfpmask,
+		.alloc_hpage = &khugepaged_alloc_page,
 	};
 
 	set_freezable();
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 04/12] mm/khugepaged: add struct collapse_result
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (2 preceding siblings ...)
  2022-04-26 14:44 ` [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific Zach O'Keefe
@ 2022-04-26 14:44 ` Zach O'Keefe
  2022-04-27 20:47   ` Peter Xu
  2022-04-26 14:44 ` [PATCH v3 05/12] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add struct collapse_result which aggregates data from a single
khugepaged_scan_pmd() or khugapaged_scan_file() request.  Change
khugepaged to take action based on this returned data instead of deep
within the collapsing functions themselves.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 187 ++++++++++++++++++++++++++----------------------
 1 file changed, 101 insertions(+), 86 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c4962191d6e1..0e4f5fbe00d2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -99,6 +99,14 @@ struct collapse_control {
 				    int node);
 };
 
+/* Gather information from one khugepaged_scan_[pmd|file]() request */
+struct collapse_result {
+	enum scan_result result;
+
+	/* Was mmap_lock dropped during request? */
+	bool dropped_mmap_lock;
+};
+
 /**
  * struct mm_slot - hash lookup from mm to mm_slot
  * @hash: hash collision list
@@ -743,13 +751,13 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(page, none_or_zero,
 						    referenced, writable, result);
-		return 1;
+		return SCAN_SUCCEED;
 	}
 out:
 	release_pte_pages(pte, _pte, compound_pagelist);
 	trace_mm_collapse_huge_page_isolate(page, none_or_zero,
 					    referenced, writable, result);
-	return 0;
+	return result;
 }
 
 static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
@@ -1087,7 +1095,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 
 static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 			       struct collapse_control *cc, int referenced,
-			       int unmapped)
+			       int unmapped, struct collapse_result *cr)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -1095,7 +1103,6 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pgtable_t pgtable;
 	struct page *new_page;
 	spinlock_t *pmd_ptl, *pte_ptl;
-	int isolated = 0, result = 0;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
 	gfp_t gfp;
@@ -1103,6 +1110,7 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	int node;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	cr->result = SCAN_FAIL;
 
 	/* Only allocate from the target node */
 	gfp = cc->gfp() | __GFP_THISNODE;
@@ -1114,6 +1122,7 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * that. We will recheck the vma after taking it again in write mode.
 	 */
 	mmap_read_unlock(mm);
+	cr->dropped_mmap_lock = true;
 
 	node = khugepaged_find_target_node(cc);
 	/* sched to specified node before huage page memory copy */
@@ -1124,26 +1133,26 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	}
 	new_page = cc->alloc_hpage(cc, gfp, node);
 	if (!new_page) {
-		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+		cr->result = SCAN_ALLOC_HUGE_PAGE_FAIL;
 		goto out_nolock;
 	}
 
 	if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
-		result = SCAN_CGROUP_CHARGE_FAIL;
+		cr->result = SCAN_CGROUP_CHARGE_FAIL;
 		goto out_nolock;
 	}
 	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma);
-	if (result) {
+	cr->result = hugepage_vma_revalidate(mm, address, &vma);
+	if (cr->result) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
 	}
 
 	pmd = mm_find_pmd(mm, address);
 	if (!pmd) {
-		result = SCAN_PMD_NULL;
+		cr->result = SCAN_PMD_NULL;
 		mmap_read_unlock(mm);
 		goto out_nolock;
 	}
@@ -1166,8 +1175,8 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * handled by the anon_vma lock + PG_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma);
-	if (result)
+	cr->result = hugepage_vma_revalidate(mm, address, &vma);
+	if (cr->result)
 		goto out_up_write;
 	/* check if the pmd is still valid */
 	if (mm_find_pmd(mm, address) != pmd)
@@ -1194,11 +1203,11 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 
 	spin_lock(pte_ptl);
-	isolated = __collapse_huge_page_isolate(vma, address, pte,
-			&compound_pagelist);
+	cr->result =  __collapse_huge_page_isolate(vma, address, pte,
+						   &compound_pagelist);
 	spin_unlock(pte_ptl);
 
-	if (unlikely(!isolated)) {
+	if (unlikely(cr->result != SCAN_SUCCEED)) {
 		pte_unmap(pte);
 		spin_lock(pmd_ptl);
 		BUG_ON(!pmd_none(*pmd));
@@ -1210,7 +1219,7 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
 		spin_unlock(pmd_ptl);
 		anon_vma_unlock_write(vma->anon_vma);
-		result = SCAN_FAIL;
+		cr->result = SCAN_FAIL;
 		goto out_up_write;
 	}
 
@@ -1246,25 +1255,25 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	cc->hpage = NULL;
 
-	khugepaged_pages_collapsed++;
-	result = SCAN_SUCCEED;
+	cr->result = SCAN_SUCCEED;
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
 	if (!IS_ERR_OR_NULL(cc->hpage))
 		mem_cgroup_uncharge(page_folio(cc->hpage));
-	trace_mm_collapse_huge_page(mm, isolated, result);
+	trace_mm_collapse_huge_page(mm, cr->result == SCAN_SUCCEED, cr->result);
 	return;
 }
 
-static int khugepaged_scan_pmd(struct mm_struct *mm,
-			       struct vm_area_struct *vma,
-			       unsigned long address,
-			       struct collapse_control *cc)
+static void khugepaged_scan_pmd(struct mm_struct *mm,
+				struct vm_area_struct *vma,
+				unsigned long address,
+				struct collapse_control *cc,
+				struct collapse_result *cr)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
-	int ret = 0, result = 0, referenced = 0;
+	int referenced = 0;
 	int none_or_zero = 0, shared = 0;
 	struct page *page = NULL;
 	unsigned long _address;
@@ -1273,9 +1282,10 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	bool writable = false;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	cr->result = SCAN_FAIL;
 
-	result = find_pmd_or_thp_or_none(mm, address, &pmd);
-	if (result != SCAN_SUCCEED)
+	cr->result = find_pmd_or_thp_or_none(mm, address, &pmd);
+	if (cr->result != SCAN_SUCCEED)
 		goto out;
 
 	memset(cc->node_load, 0, sizeof(cc->node_load));
@@ -1291,12 +1301,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 				 * comment below for pte_uffd_wp().
 				 */
 				if (pte_swp_uffd_wp(pteval)) {
-					result = SCAN_PTE_UFFD_WP;
+					cr->result = SCAN_PTE_UFFD_WP;
 					goto out_unmap;
 				}
 				continue;
 			} else {
-				result = SCAN_EXCEED_SWAP_PTE;
+				cr->result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				goto out_unmap;
 			}
@@ -1306,7 +1316,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			    ++none_or_zero <= khugepaged_max_ptes_none) {
 				continue;
 			} else {
-				result = SCAN_EXCEED_NONE_PTE;
+				cr->result = SCAN_EXCEED_NONE_PTE;
 				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 				goto out_unmap;
 			}
@@ -1321,7 +1331,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			 * userfault messages that falls outside of
 			 * the registered range.  So, just be simple.
 			 */
-			result = SCAN_PTE_UFFD_WP;
+			cr->result = SCAN_PTE_UFFD_WP;
 			goto out_unmap;
 		}
 		if (pte_write(pteval))
@@ -1329,13 +1339,13 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page)) {
-			result = SCAN_PAGE_NULL;
+			cr->result = SCAN_PAGE_NULL;
 			goto out_unmap;
 		}
 
 		if (page_mapcount(page) > 1 &&
 				++shared > khugepaged_max_ptes_shared) {
-			result = SCAN_EXCEED_SHARED_PTE;
+			cr->result = SCAN_EXCEED_SHARED_PTE;
 			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 			goto out_unmap;
 		}
@@ -1350,20 +1360,20 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		 */
 		node = page_to_nid(page);
 		if (khugepaged_scan_abort(node, cc)) {
-			result = SCAN_SCAN_ABORT;
+			cr->result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
 		cc->node_load[node]++;
 		if (!PageLRU(page)) {
-			result = SCAN_PAGE_LRU;
+			cr->result = SCAN_PAGE_LRU;
 			goto out_unmap;
 		}
 		if (PageLocked(page)) {
-			result = SCAN_PAGE_LOCK;
+			cr->result = SCAN_PAGE_LOCK;
 			goto out_unmap;
 		}
 		if (!PageAnon(page)) {
-			result = SCAN_PAGE_ANON;
+			cr->result = SCAN_PAGE_ANON;
 			goto out_unmap;
 		}
 
@@ -1385,7 +1395,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		 * will be done again later the risk seems low.
 		 */
 		if (!is_refcount_suitable(page)) {
-			result = SCAN_PAGE_COUNT;
+			cr->result = SCAN_PAGE_COUNT;
 			goto out_unmap;
 		}
 		if (pte_young(pteval) ||
@@ -1394,23 +1404,20 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			referenced++;
 	}
 	if (!writable) {
-		result = SCAN_PAGE_RO;
+		cr->result = SCAN_PAGE_RO;
 	} else if (!referenced || (unmapped && referenced < HPAGE_PMD_NR/2)) {
-		result = SCAN_LACK_REFERENCED_PAGE;
+		cr->result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
-		result = SCAN_SUCCEED;
-		ret = 1;
+		cr->result = SCAN_SUCCEED;
 	}
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
-	if (ret) {
+	if (cr->result == SCAN_SUCCEED)
 		/* collapse_huge_page will return with the mmap_lock released */
-		collapse_huge_page(mm, address, cc, referenced, unmapped);
-	}
+		collapse_huge_page(mm, address, cc, referenced, unmapped, cr);
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
-				     none_or_zero, result, unmapped);
-	return ret;
+				     none_or_zero, cr->result, unmapped);
 }
 
 static void collect_mm_slot(struct mm_slot *mm_slot)
@@ -1671,6 +1678,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  * @file: file that collapse on
  * @start: collapse start address
  * @cc: collapse context and scratchpad
+ * @cr: aggregate result information of collapse
  *
  * Basic scheme is simple, details are more complex:
  *  - allocate and lock a new huge page;
@@ -1689,7 +1697,9 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  */
 static void collapse_file(struct mm_struct *mm,
 			  struct file *file, pgoff_t start,
-			  struct collapse_control *cc)
+			  struct collapse_control *cc,
+			  struct collapse_result *cr)
+
 {
 	struct address_space *mapping = file->f_mapping;
 	gfp_t gfp;
@@ -1697,25 +1707,27 @@ static void collapse_file(struct mm_struct *mm,
 	pgoff_t index, end = start + HPAGE_PMD_NR;
 	LIST_HEAD(pagelist);
 	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
-	int nr_none = 0, result = SCAN_SUCCEED;
+	int nr_none = 0;
 	bool is_shmem = shmem_file(file);
 	int nr, node;
 
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
+	cr->result = SCAN_SUCCEED;
+
 	/* Only allocate from the target node */
 	gfp = cc->gfp() | __GFP_THISNODE;
 	node = khugepaged_find_target_node(cc);
 
 	new_page = cc->alloc_hpage(cc, gfp, node);
 	if (!new_page) {
-		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+		cr->result = SCAN_ALLOC_HUGE_PAGE_FAIL;
 		goto out;
 	}
 
 	if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
-		result = SCAN_CGROUP_CHARGE_FAIL;
+		cr->result = SCAN_CGROUP_CHARGE_FAIL;
 		goto out;
 	}
 	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
@@ -1731,7 +1743,7 @@ static void collapse_file(struct mm_struct *mm,
 			break;
 		xas_unlock_irq(&xas);
 		if (!xas_nomem(&xas, GFP_KERNEL)) {
-			result = SCAN_FAIL;
+			cr->result = SCAN_FAIL;
 			goto out;
 		}
 	} while (1);
@@ -1762,13 +1774,13 @@ static void collapse_file(struct mm_struct *mm,
 				 */
 				if (index == start) {
 					if (!xas_next_entry(&xas, end - 1)) {
-						result = SCAN_TRUNCATED;
+						cr->result = SCAN_TRUNCATED;
 						goto xa_locked;
 					}
 					xas_set(&xas, index);
 				}
 				if (!shmem_charge(mapping->host, 1)) {
-					result = SCAN_FAIL;
+					cr->result = SCAN_FAIL;
 					goto xa_locked;
 				}
 				xas_store(&xas, new_page);
@@ -1781,14 +1793,14 @@ static void collapse_file(struct mm_struct *mm,
 				/* swap in or instantiate fallocated page */
 				if (shmem_getpage(mapping->host, index, &page,
 						  SGP_NOALLOC)) {
-					result = SCAN_FAIL;
+					cr->result = SCAN_FAIL;
 					goto xa_unlocked;
 				}
 			} else if (trylock_page(page)) {
 				get_page(page);
 				xas_unlock_irq(&xas);
 			} else {
-				result = SCAN_PAGE_LOCK;
+				cr->result = SCAN_PAGE_LOCK;
 				goto xa_locked;
 			}
 		} else {	/* !is_shmem */
@@ -1801,7 +1813,7 @@ static void collapse_file(struct mm_struct *mm,
 				lru_add_drain();
 				page = find_lock_page(mapping, index);
 				if (unlikely(page == NULL)) {
-					result = SCAN_FAIL;
+					cr->result = SCAN_FAIL;
 					goto xa_unlocked;
 				}
 			} else if (PageDirty(page)) {
@@ -1820,17 +1832,17 @@ static void collapse_file(struct mm_struct *mm,
 				 */
 				xas_unlock_irq(&xas);
 				filemap_flush(mapping);
-				result = SCAN_FAIL;
+				cr->result = SCAN_FAIL;
 				goto xa_unlocked;
 			} else if (PageWriteback(page)) {
 				xas_unlock_irq(&xas);
-				result = SCAN_FAIL;
+				cr->result = SCAN_FAIL;
 				goto xa_unlocked;
 			} else if (trylock_page(page)) {
 				get_page(page);
 				xas_unlock_irq(&xas);
 			} else {
-				result = SCAN_PAGE_LOCK;
+				cr->result = SCAN_PAGE_LOCK;
 				goto xa_locked;
 			}
 		}
@@ -1843,7 +1855,7 @@ static void collapse_file(struct mm_struct *mm,
 
 		/* make sure the page is up to date */
 		if (unlikely(!PageUptodate(page))) {
-			result = SCAN_FAIL;
+			cr->result = SCAN_FAIL;
 			goto out_unlock;
 		}
 
@@ -1852,12 +1864,12 @@ static void collapse_file(struct mm_struct *mm,
 		 * we locked the first page, then a THP might be there already.
 		 */
 		if (PageTransCompound(page)) {
-			result = SCAN_PAGE_COMPOUND;
+			cr->result = SCAN_PAGE_COMPOUND;
 			goto out_unlock;
 		}
 
 		if (page_mapping(page) != mapping) {
-			result = SCAN_TRUNCATED;
+			cr->result = SCAN_TRUNCATED;
 			goto out_unlock;
 		}
 
@@ -1868,18 +1880,18 @@ static void collapse_file(struct mm_struct *mm,
 			 * page is dirty because it hasn't been flushed
 			 * since first write.
 			 */
-			result = SCAN_FAIL;
+			cr->result = SCAN_FAIL;
 			goto out_unlock;
 		}
 
 		if (isolate_lru_page(page)) {
-			result = SCAN_DEL_PAGE_LRU;
+			cr->result = SCAN_DEL_PAGE_LRU;
 			goto out_unlock;
 		}
 
 		if (page_has_private(page) &&
 		    !try_to_release_page(page, GFP_KERNEL)) {
-			result = SCAN_PAGE_HAS_PRIVATE;
+			cr->result = SCAN_PAGE_HAS_PRIVATE;
 			putback_lru_page(page);
 			goto out_unlock;
 		}
@@ -1900,7 +1912,7 @@ static void collapse_file(struct mm_struct *mm,
 		 *  - one from isolate_lru_page;
 		 */
 		if (!page_ref_freeze(page, 3)) {
-			result = SCAN_PAGE_COUNT;
+			cr->result = SCAN_PAGE_COUNT;
 			xas_unlock_irq(&xas);
 			putback_lru_page(page);
 			goto out_unlock;
@@ -1935,7 +1947,7 @@ static void collapse_file(struct mm_struct *mm,
 		 */
 		smp_mb();
 		if (inode_is_open_for_write(mapping->host)) {
-			result = SCAN_FAIL;
+			cr->result = SCAN_FAIL;
 			__mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
 			filemap_nr_thps_dec(mapping);
 			goto xa_locked;
@@ -1962,7 +1974,7 @@ static void collapse_file(struct mm_struct *mm,
 	 */
 	try_to_unmap_flush();
 
-	if (result == SCAN_SUCCEED) {
+	if (cr->result == SCAN_SUCCEED) {
 		struct page *page, *tmp;
 
 		/*
@@ -2002,8 +2014,6 @@ static void collapse_file(struct mm_struct *mm,
 		 */
 		retract_page_tables(mapping, start);
 		cc->hpage = NULL;
-
-		khugepaged_pages_collapsed++;
 	} else {
 		struct page *page;
 
@@ -2055,15 +2065,16 @@ static void collapse_file(struct mm_struct *mm,
 
 static void khugepaged_scan_file(struct mm_struct *mm,
 				 struct file *file, pgoff_t start,
-				 struct collapse_control *cc)
+				 struct collapse_control *cc,
+				 struct collapse_result *cr)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
 	XA_STATE(xas, &mapping->i_pages, start);
 	int present, swap;
 	int node = NUMA_NO_NODE;
-	int result = SCAN_SUCCEED;
 
+	cr->result = SCAN_SUCCEED;
 	present = 0;
 	swap = 0;
 	memset(cc->node_load, 0, sizeof(cc->node_load));
@@ -2074,7 +2085,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 
 		if (xa_is_value(page)) {
 			if (++swap > khugepaged_max_ptes_swap) {
-				result = SCAN_EXCEED_SWAP_PTE;
+				cr->result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				break;
 			}
@@ -2086,25 +2097,25 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 		 * into a PMD sized page
 		 */
 		if (PageTransCompound(page)) {
-			result = SCAN_PAGE_COMPOUND;
+			cr->result = SCAN_PAGE_COMPOUND;
 			break;
 		}
 
 		node = page_to_nid(page);
 		if (khugepaged_scan_abort(node, cc)) {
-			result = SCAN_SCAN_ABORT;
+			cr->result = SCAN_SCAN_ABORT;
 			break;
 		}
 		cc->node_load[node]++;
 
 		if (!PageLRU(page)) {
-			result = SCAN_PAGE_LRU;
+			cr->result = SCAN_PAGE_LRU;
 			break;
 		}
 
 		if (page_count(page) !=
 		    1 + page_mapcount(page) + page_has_private(page)) {
-			result = SCAN_PAGE_COUNT;
+			cr->result = SCAN_PAGE_COUNT;
 			break;
 		}
 
@@ -2123,12 +2134,12 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 	}
 	rcu_read_unlock();
 
-	if (result == SCAN_SUCCEED) {
+	if (cr->result == SCAN_SUCCEED) {
 		if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
-			result = SCAN_EXCEED_NONE_PTE;
+			cr->result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			collapse_file(mm, file, start, cc);
+			collapse_file(mm, file, start, cc, cr);
 		}
 	}
 
@@ -2137,7 +2148,8 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 #else
 static void khugepaged_scan_file(struct mm_struct *mm,
 				 struct file *file, pgoff_t start,
-				 struct collapse_control *cc)
+				 struct collapse_control *cc,
+				 struct collapse_result *cr)
 {
 	BUILD_BUG();
 }
@@ -2209,7 +2221,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 			goto skip;
 
 		while (khugepaged_scan.address < hend) {
-			int ret;
+			struct collapse_result cr = {0};
 			cond_resched();
 			if (unlikely(khugepaged_test_exit(mm)))
 				goto breakouterloop;
@@ -2223,17 +2235,20 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 						khugepaged_scan.address);
 
 				mmap_read_unlock(mm);
-				ret = 1;
-				khugepaged_scan_file(mm, file, pgoff, cc);
+				cr.dropped_mmap_lock = true;
+				khugepaged_scan_file(mm, file, pgoff, cc, &cr);
 				fput(file);
 			} else {
-				ret = khugepaged_scan_pmd(mm, vma,
-						khugepaged_scan.address, cc);
+				khugepaged_scan_pmd(mm, vma,
+						    khugepaged_scan.address,
+						    cc, &cr);
 			}
+			if (cr.result == SCAN_SUCCEED)
+				++khugepaged_pages_collapsed;
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
 			progress += HPAGE_PMD_NR;
-			if (ret)
+			if (cr.dropped_mmap_lock)
 				/* we released mmap_lock so break loop */
 				goto breakouterloop_mmap_lock;
 			if (progress >= pages)
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 05/12] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (3 preceding siblings ...)
  2022-04-26 14:44 ` [PATCH v3 04/12] mm/khugepaged: add struct collapse_result Zach O'Keefe
@ 2022-04-26 14:44 ` Zach O'Keefe
  2022-04-26 14:44 ` [PATCH v3 06/12] mm/khugepaged: remove khugepaged prefix from shared collapse functions Zach O'Keefe
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe, kernel test robot

This idea was introduced by David Rientjes[1].

Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
synchronous collapse of memory at their own expense.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the
  THP
* Avoid unpredictable timing of khugepaged collapse

An immediate user of this new functionality are malloc() implementations
that manage memory in hugepage-sized chunks, but sometimes subrelease
memory back to the system in native-sized chunks via MADV_DONTNEED;
zapping the pmd.  Later, when the memory is hot, the implementation
could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
hugepage coverage and dTLB performance.  TCMalloc is such an
implementation that could benefit from this[2].

Only privately-mapped anon memory is supported for now, but it is
expected that file and shmem support will be added later to support the
use-case of backing executable text by THPs.  Current support provided
by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
which might impair services from serving at their full rated load after
(re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
immediately realize iTLB performance prevents page sharing and demand
paging, both of which increase steady state memory footprint.  With
MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
and lower RAM footprints.

This call respects THP eligibility as determined by the system-wide sysfs
settings and the VMA flags for the memory range being collapsed.

[1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[2] https://github.com/google/tcmalloc/tree/master/tcmalloc

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 arch/alpha/include/uapi/asm/mman.h     |   2 +
 arch/mips/include/uapi/asm/mman.h      |   2 +
 arch/parisc/include/uapi/asm/mman.h    |   2 +
 arch/xtensa/include/uapi/asm/mman.h    |   2 +
 include/linux/huge_mm.h                |  12 ++
 include/uapi/asm-generic/mman-common.h |   2 +
 mm/khugepaged.c                        | 158 +++++++++++++++++++++++--
 mm/madvise.c                           |   5 +
 8 files changed, 173 insertions(+), 12 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 4aa996423b0d..763929e814e9 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -76,6 +76,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 1be428663c10..c6e1fc77c996 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -103,6 +103,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index a7ea3204a5fa..22133a6a506e 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -70,6 +70,8 @@
 #define MADV_WIPEONFORK 71		/* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 72		/* Undo MADV_WIPEONFORK */
 
+#define MADV_COLLAPSE	73		/* Synchronous hugepage collapse */
+
 #define MADV_HWPOISON     100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 7966a58af472..1ff0c858544f 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -111,6 +111,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 816a9937f30e..ddad7c7af44e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -236,6 +236,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
+int madvise_collapse(struct vm_area_struct *vma,
+		     struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -392,6 +395,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
 	BUG();
 	return 0;
 }
+
+static inline int madvise_collapse(struct vm_area_struct *vma,
+				   struct vm_area_struct **prev,
+				   unsigned long start, unsigned long end)
+{
+	BUG();
+	return 0;
+}
+
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6c1aa92a92e4..6ce1f1ceb432 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -77,6 +77,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0e4f5fbe00d2..098919d0324b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -847,6 +847,23 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 	return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
 }
 
+static struct page *alloc_hpage(struct collapse_control *cc, gfp_t gfp,
+				int node)
+{
+	VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
+
+	cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
+	if (unlikely(!cc->hpage)) {
+		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		cc->hpage = ERR_PTR(-ENOMEM);
+		return NULL;
+	}
+
+	prep_transhuge_page(cc->hpage);
+	count_vm_event(THP_COLLAPSE_ALLOC);
+	return cc->hpage;
+}
+
 #ifdef CONFIG_NUMA
 static int khugepaged_find_target_node(struct collapse_control *cc)
 {
@@ -893,18 +910,7 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 static struct page *khugepaged_alloc_page(struct collapse_control *cc,
 					  gfp_t gfp, int node)
 {
-	VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
-
-	cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
-	if (unlikely(!cc->hpage)) {
-		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-		cc->hpage = ERR_PTR(-ENOMEM);
-		return NULL;
-	}
-
-	prep_transhuge_page(cc->hpage);
-	count_vm_event(THP_COLLAPSE_ALLOC);
-	return cc->hpage;
+	return alloc_hpage(cc, gfp, node);
 }
 #else
 static int khugepaged_find_target_node(struct collapse_control *cc)
@@ -2471,3 +2477,131 @@ void khugepaged_min_free_kbytes_update(void)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
+
+static inline gfp_t alloc_hugepage_madvise_gfpmask(void)
+{
+	return GFP_TRANSHUGE;
+}
+
+static void madvise_collapse_cleanup_page(struct page **hpage)
+{
+	if (!IS_ERR(*hpage) && *hpage)
+		put_page(*hpage);
+	*hpage = NULL;
+}
+
+static int madvise_collapse_errno(enum scan_result r)
+{
+	switch (r) {
+	case SCAN_PMD_NULL:
+	case SCAN_ADDRESS_RANGE:
+	case SCAN_VMA_NULL:
+	case SCAN_PTE_NON_PRESENT:
+	case SCAN_PAGE_NULL:
+		/*
+		 * Addresses in the specified range are not currently mapped,
+		 * or are outside the AS of the process.
+		 */
+		return -ENOMEM;
+	case SCAN_ALLOC_HUGE_PAGE_FAIL:
+	case SCAN_CGROUP_CHARGE_FAIL:
+		/* A kernel resource was temporarily unavailable. */
+		return -EAGAIN;
+	default:
+		return -EINVAL;
+	}
+}
+
+int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end)
+{
+	struct collapse_control cc = {
+		.last_target_node = NUMA_NO_NODE,
+		.hpage = NULL,
+		.gfp = &alloc_hugepage_madvise_gfpmask,
+		.alloc_hpage = &alloc_hpage,
+	};
+	struct mm_struct *mm = vma->vm_mm;
+	struct collapse_result cr;
+	unsigned long hstart, hend, addr;
+	int thps = 0, nr_hpages = 0;
+
+	BUG_ON(vma->vm_start > start);
+	BUG_ON(vma->vm_end < end);
+
+	*prev = vma;
+
+	if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file)
+		return -EINVAL;
+
+	hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = end & HPAGE_PMD_MASK;
+	nr_hpages = (hend - hstart) >> HPAGE_PMD_SHIFT;
+
+	if (hstart >= hend || !transparent_hugepage_active(vma))
+		return -EINVAL;
+
+	mmgrab(mm);
+	lru_add_drain();
+
+	for (addr = hstart; ; ) {
+		mmap_assert_locked(mm);
+		cond_resched();
+		memset(&cr, 0, sizeof(cr));
+
+		if (unlikely(khugepaged_test_exit(mm))) {
+			cr.result = SCAN_ANY_PROCESS;
+			break;
+		}
+
+		memset(cc.node_load, 0, sizeof(cc.node_load));
+		khugepaged_scan_pmd(mm, vma, addr, &cc, &cr);
+		if (cr.dropped_mmap_lock)
+			*prev = NULL;  /* tell madvise we dropped mmap_lock */
+
+		switch (cr.result) {
+		/* Whitelisted set of results where continuing OK */
+		case SCAN_SUCCEED:
+		case SCAN_PMD_MAPPED:
+			++thps;
+		case SCAN_PMD_NULL:
+		case SCAN_PTE_NON_PRESENT:
+		case SCAN_PTE_UFFD_WP:
+		case SCAN_PAGE_RO:
+		case SCAN_LACK_REFERENCED_PAGE:
+		case SCAN_PAGE_NULL:
+		case SCAN_PAGE_COUNT:
+		case SCAN_PAGE_LOCK:
+		case SCAN_PAGE_COMPOUND:
+			break;
+		case SCAN_PAGE_LRU:
+			lru_add_drain_all();
+			goto retry;
+		default:
+			/* Other error, exit */
+			goto break_loop;
+		}
+		addr += HPAGE_PMD_SIZE;
+		if (addr >= hend)
+			break;
+retry:
+		if (cr.dropped_mmap_lock) {
+			mmap_read_lock(mm);
+			cr.result = hugepage_vma_revalidate(mm, addr, &vma);
+			if (cr.result)
+				goto out;
+		}
+		madvise_collapse_cleanup_page(&cc.hpage);
+	}
+
+break_loop:
+	/* madvise_walk_vmas() expects us to hold mmap_lock on return */
+	if (cr.dropped_mmap_lock)
+		mmap_read_lock(mm);
+out:
+	mmap_assert_locked(mm);
+	madvise_collapse_cleanup_page(&cc.hpage);
+	mmdrop(mm);
+
+	return thps == nr_hpages ? 0 : madvise_collapse_errno(cr.result);
+}
diff --git a/mm/madvise.c b/mm/madvise.c
index 5f4537511532..638517952bd2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_FREE:
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
+	case MADV_COLLAPSE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -1054,6 +1055,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_COLLAPSE:
+		return madvise_collapse(vma, prev, start, end);
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1147,6 +1150,7 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
+	case MADV_COLLAPSE:
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
@@ -1336,6 +1340,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
+ *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 06/12] mm/khugepaged: remove khugepaged prefix from shared collapse functions
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (4 preceding siblings ...)
  2022-04-26 14:44 ` [PATCH v3 05/12] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
@ 2022-04-26 14:44 ` Zach O'Keefe
  2022-04-27 21:10   ` Peter Xu
  2022-04-26 14:44 ` [PATCH v3 07/12] mm/khugepaged: add flag to ignore khugepaged_max_ptes_* Zach O'Keefe
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe, kernel test robot

The following functions/tracepoints are shared between khugepaged and
madvise collapse contexts.  Remove the khugepaged prefixes.

tracepoint:mm_khugepaged_scan_pmd -> tracepoint:mm_scan_pmd
khugepaged_test_exit() -> test_exit()
khugepaged_scan_abort() -> scan_abort()
khugepaged_scan_pmd() -> scan_pmd()
khugepaged_find_target_node() -> find_target_node()

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 include/trace/events/huge_memory.h |  2 +-
 mm/khugepaged.c                    | 70 ++++++++++++++----------------
 2 files changed, 34 insertions(+), 38 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 9faa678e0a5b..09be0e2f76b1 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -48,7 +48,7 @@ SCAN_STATUS
 #define EM(a, b)	{a, b},
 #define EMe(a, b)	{a, b}
 
-TRACE_EVENT(mm_khugepaged_scan_pmd,
+TRACE_EVENT(mm_scan_pmd,
 
 	TP_PROTO(struct mm_struct *mm, struct page *page, bool writable,
 		 int referenced, int none_or_zero, int status, int unmapped),
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 098919d0324b..a6881f5b3c67 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -90,7 +90,7 @@ struct collapse_control {
 	/* Num pages scanned per node */
 	int node_load[MAX_NUMNODES];
 
-	/* Last target selected in khugepaged_find_target_node() for this scan */
+	/* Last target selected in find_target_node() for this scan */
 	int last_target_node;
 
 	struct page *hpage;
@@ -454,7 +454,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
 	hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
 }
 
-static inline int khugepaged_test_exit(struct mm_struct *mm)
+static inline int test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
@@ -506,7 +506,7 @@ void __khugepaged_enter(struct mm_struct *mm)
 		return;
 
 	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
+	VM_BUG_ON_MM(test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
 		free_mm_slot(mm_slot);
 		return;
@@ -558,12 +558,11 @@ void __khugepaged_exit(struct mm_struct *mm)
 		mmdrop(mm);
 	} else if (mm_slot) {
 		/*
-		 * This is required to serialize against
-		 * khugepaged_test_exit() (which is guaranteed to run
-		 * under mmap sem read mode). Stop here (after we
-		 * return all pagetables will be destroyed) until
-		 * khugepaged has finished working on the pagetables
-		 * under the mmap_lock.
+		 * This is required to serialize against test_exit() (which is
+		 * guaranteed to run under mmap sem read mode). Stop here
+		 * (after we return all pagetables will be destroyed) until
+		 * khugepaged has finished working on the pagetables under
+		 * the mmap_lock.
 		 */
 		mmap_write_lock(mm);
 		mmap_write_unlock(mm);
@@ -817,7 +816,7 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
+static bool scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -865,7 +864,7 @@ static struct page *alloc_hpage(struct collapse_control *cc, gfp_t gfp,
 }
 
 #ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(struct collapse_control *cc)
+static int find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
 
@@ -913,7 +912,7 @@ static struct page *khugepaged_alloc_page(struct collapse_control *cc,
 	return alloc_hpage(cc, gfp, node);
 }
 #else
-static int khugepaged_find_target_node(struct collapse_control *cc)
+static int find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -994,7 +993,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	struct vm_area_struct *vma;
 	unsigned long hstart, hend;
 
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(test_exit(mm)))
 		return SCAN_ANY_PROCESS;
 
 	*vmap = vma = find_vma(mm, address);
@@ -1038,7 +1037,7 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 
 /*
  * Bring missing pages in from swap, to complete THP collapse.
- * Only done if khugepaged_scan_pmd believes it is worthwhile.
+ * Only done if scan_pmd believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held,
  * but with mmap_lock held to protect against vma changes.
@@ -1130,7 +1129,7 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmap_read_unlock(mm);
 	cr->dropped_mmap_lock = true;
 
-	node = khugepaged_find_target_node(cc);
+	node = find_target_node(cc);
 	/* sched to specified node before huage page memory copy */
 	if (task_node(current) != node) {
 		cpumask = cpumask_of_node(node);
@@ -1271,11 +1270,9 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	return;
 }
 
-static void khugepaged_scan_pmd(struct mm_struct *mm,
-				struct vm_area_struct *vma,
-				unsigned long address,
-				struct collapse_control *cc,
-				struct collapse_result *cr)
+static void scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long address, struct collapse_control *cc,
+		     struct collapse_result *cr)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1365,7 +1362,7 @@ static void khugepaged_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node, cc)) {
+		if (scan_abort(node, cc)) {
 			cr->result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
@@ -1422,8 +1419,8 @@ static void khugepaged_scan_pmd(struct mm_struct *mm,
 		/* collapse_huge_page will return with the mmap_lock released */
 		collapse_huge_page(mm, address, cc, referenced, unmapped, cr);
 out:
-	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
-				     none_or_zero, cr->result, unmapped);
+	trace_mm_scan_pmd(mm, page, writable, referenced, none_or_zero,
+			  cr->result, unmapped);
 }
 
 static void collect_mm_slot(struct mm_slot *mm_slot)
@@ -1432,7 +1429,7 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (khugepaged_test_exit(mm)) {
+	if (test_exit(mm)) {
 		/* free mm_slot */
 		hash_del(&mm_slot->hash);
 		list_del(&mm_slot->mm_node);
@@ -1603,7 +1600,7 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 	if (!mmap_write_trylock(mm))
 		return;
 
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(test_exit(mm)))
 		goto out;
 
 	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
@@ -1666,7 +1663,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			 * it'll always mapped in small page size for uffd-wp
 			 * registered ranges.
 			 */
-			if (!khugepaged_test_exit(mm) && !userfaultfd_wp(vma))
+			if (!test_exit(mm) && !userfaultfd_wp(vma))
 				collapse_and_free_pmd(mm, vma, addr, pmd);
 			mmap_write_unlock(mm);
 		} else {
@@ -1724,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm,
 
 	/* Only allocate from the target node */
 	gfp = cc->gfp() | __GFP_THISNODE;
-	node = khugepaged_find_target_node(cc);
+	node = find_target_node(cc);
 
 	new_page = cc->alloc_hpage(cc, gfp, node);
 	if (!new_page) {
@@ -2108,7 +2105,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 		}
 
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node, cc)) {
+		if (scan_abort(node, cc)) {
 			cr->result = SCAN_SCAN_ABORT;
 			break;
 		}
@@ -2197,7 +2194,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 	vma = NULL;
 	if (unlikely(!mmap_read_trylock(mm)))
 		goto breakouterloop_mmap_lock;
-	if (likely(!khugepaged_test_exit(mm)))
+	if (likely(!test_exit(mm)))
 		vma = find_vma(mm, khugepaged_scan.address);
 
 	progress++;
@@ -2205,7 +2202,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 		unsigned long hstart, hend;
 
 		cond_resched();
-		if (unlikely(khugepaged_test_exit(mm))) {
+		if (unlikely(test_exit(mm))) {
 			progress++;
 			break;
 		}
@@ -2229,7 +2226,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 		while (khugepaged_scan.address < hend) {
 			struct collapse_result cr = {0};
 			cond_resched();
-			if (unlikely(khugepaged_test_exit(mm)))
+			if (unlikely(test_exit(mm)))
 				goto breakouterloop;
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2245,9 +2242,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 				khugepaged_scan_file(mm, file, pgoff, cc, &cr);
 				fput(file);
 			} else {
-				khugepaged_scan_pmd(mm, vma,
-						    khugepaged_scan.address,
-						    cc, &cr);
+				scan_pmd(mm, vma, khugepaged_scan.address, cc,
+					 &cr);
 			}
 			if (cr.result == SCAN_SUCCEED)
 				++khugepaged_pages_collapsed;
@@ -2271,7 +2267,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (khugepaged_test_exit(mm) || !vma) {
+	if (test_exit(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
@@ -2549,13 +2545,13 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		cond_resched();
 		memset(&cr, 0, sizeof(cr));
 
-		if (unlikely(khugepaged_test_exit(mm))) {
+		if (unlikely(test_exit(mm))) {
 			cr.result = SCAN_ANY_PROCESS;
 			break;
 		}
 
 		memset(cc.node_load, 0, sizeof(cc.node_load));
-		khugepaged_scan_pmd(mm, vma, addr, &cc, &cr);
+		scan_pmd(mm, vma, addr, &cc, &cr);
 		if (cr.dropped_mmap_lock)
 			*prev = NULL;  /* tell madvise we dropped mmap_lock */
 
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 07/12] mm/khugepaged: add flag to ignore khugepaged_max_ptes_*
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (5 preceding siblings ...)
  2022-04-26 14:44 ` [PATCH v3 06/12] mm/khugepaged: remove khugepaged prefix from shared collapse functions Zach O'Keefe
@ 2022-04-26 14:44 ` Zach O'Keefe
  2022-04-27 21:12   ` Peter Xu
  2022-04-26 14:44 ` [PATCH v3 08/12] mm/khugepaged: add flag to ignore page young/referenced requirement Zach O'Keefe
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add enforce_pte_scan_limits flag to struct collapse_control that allows
context to ignore sysfs-controlled knobs:
khugepaged_max_ptes_[none|swap|shared].  Set this flag in khugepaged
collapse context to preserve existing khugepaged behavior and unset the
flag in madvise collapse context since the user presumably has reason to
believe the collapse will be beneficial.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a6881f5b3c67..57725482290d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -87,6 +87,9 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
 #define MAX_PTE_MAPPED_THP 8
 
 struct collapse_control {
+	/* Respect khugepaged_max_ptes_[none|swap|shared] */
+	bool enforce_pte_scan_limits;
+
 	/* Num pages scanned per node */
 	int node_load[MAX_NUMNODES];
 
@@ -632,6 +635,7 @@ static bool is_refcount_suitable(struct page *page)
 static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
+					struct collapse_control *cc,
 					struct list_head *compound_pagelist)
 {
 	struct page *page = NULL;
@@ -645,7 +649,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		if (pte_none(pteval) || (pte_present(pteval) &&
 				is_zero_pfn(pte_pfn(pteval)))) {
 			if (!userfaultfd_armed(vma) &&
-			    ++none_or_zero <= khugepaged_max_ptes_none) {
+			    (++none_or_zero <= khugepaged_max_ptes_none ||
+			     !cc->enforce_pte_scan_limits)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -665,8 +670,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		VM_BUG_ON_PAGE(!PageAnon(page), page);
 
-		if (page_mapcount(page) > 1 &&
-				++shared > khugepaged_max_ptes_shared) {
+		if (cc->enforce_pte_scan_limits && page_mapcount(page) > 1 &&
+		    ++shared > khugepaged_max_ptes_shared) {
 			result = SCAN_EXCEED_SHARED_PTE;
 			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 			goto out;
@@ -1208,7 +1213,7 @@ static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 
 	spin_lock(pte_ptl);
-	cr->result =  __collapse_huge_page_isolate(vma, address, pte,
+	cr->result =  __collapse_huge_page_isolate(vma, address, pte, cc,
 						   &compound_pagelist);
 	spin_unlock(pte_ptl);
 
@@ -1297,7 +1302,8 @@ static void scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (is_swap_pte(pteval)) {
-			if (++unmapped <= khugepaged_max_ptes_swap) {
+			if (++unmapped <= khugepaged_max_ptes_swap ||
+			    !cc->enforce_pte_scan_limits) {
 				/*
 				 * Always be strict with uffd-wp
 				 * enabled swap entries.  Please see
@@ -1316,7 +1322,8 @@ static void scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			if (!userfaultfd_armed(vma) &&
-			    ++none_or_zero <= khugepaged_max_ptes_none) {
+			    (++none_or_zero <= khugepaged_max_ptes_none ||
+			     !cc->enforce_pte_scan_limits)) {
 				continue;
 			} else {
 				cr->result = SCAN_EXCEED_NONE_PTE;
@@ -1346,8 +1353,9 @@ static void scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto out_unmap;
 		}
 
-		if (page_mapcount(page) > 1 &&
-				++shared > khugepaged_max_ptes_shared) {
+		if (cc->enforce_pte_scan_limits &&
+		    page_mapcount(page) > 1 &&
+		    ++shared > khugepaged_max_ptes_shared) {
 			cr->result = SCAN_EXCEED_SHARED_PTE;
 			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 			goto out_unmap;
@@ -2087,7 +2095,8 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 			continue;
 
 		if (xa_is_value(page)) {
-			if (++swap > khugepaged_max_ptes_swap) {
+			if (cc->enforce_pte_scan_limits &&
+			    ++swap > khugepaged_max_ptes_swap) {
 				cr->result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				break;
@@ -2138,7 +2147,8 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 	rcu_read_unlock();
 
 	if (cr->result == SCAN_SUCCEED) {
-		if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+		if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none &&
+		    cc->enforce_pte_scan_limits) {
 			cr->result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
@@ -2365,6 +2375,7 @@ static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
 	struct collapse_control cc = {
+		.enforce_pte_scan_limits = true,
 		.last_target_node = NUMA_NO_NODE,
 		.gfp = &alloc_hugepage_khugepaged_gfpmask,
 		.alloc_hpage = &khugepaged_alloc_page,
@@ -2512,6 +2523,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		     unsigned long start, unsigned long end)
 {
 	struct collapse_control cc = {
+		.enforce_pte_scan_limits = false,
 		.last_target_node = NUMA_NO_NODE,
 		.hpage = NULL,
 		.gfp = &alloc_hugepage_madvise_gfpmask,
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 08/12] mm/khugepaged: add flag to ignore page young/referenced requirement
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (6 preceding siblings ...)
  2022-04-26 14:44 ` [PATCH v3 07/12] mm/khugepaged: add flag to ignore khugepaged_max_ptes_* Zach O'Keefe
@ 2022-04-26 14:44 ` Zach O'Keefe
  2022-04-26 14:44 ` [PATCH v3 09/12] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add enforce_young flag to struct collapse_control that allows context to
ignore requirement that some pages in region being collapsed be young or
referenced.  Set this flag in khugepaged collapse context to preserve
existing khugepaged behavior and unset the flag in madvise collapse
context since the user presumably has reason to believe the collapse
will be beneficial.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 57725482290d..fe6810825259 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -90,6 +90,9 @@ struct collapse_control {
 	/* Respect khugepaged_max_ptes_[none|swap|shared] */
 	bool enforce_pte_scan_limits;
 
+	/* Require memory to be young */
+	bool enforce_young;
+
 	/* Num pages scanned per node */
 	int node_load[MAX_NUMNODES];
 
@@ -738,9 +741,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			list_add_tail(&page->lru, compound_pagelist);
 next:
 		/* There should be enough young pte to collapse the page */
-		if (pte_young(pteval) ||
-		    page_is_young(page) || PageReferenced(page) ||
-		    mmu_notifier_test_young(vma->vm_mm, address))
+		if (cc->enforce_young &&
+		    (pte_young(pteval) || page_is_young(page) ||
+		     PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
+								     address)))
 			referenced++;
 
 		if (pte_write(pteval))
@@ -749,7 +753,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 	if (unlikely(!writable)) {
 		result = SCAN_PAGE_RO;
-	} else if (unlikely(!referenced)) {
+	} else if (unlikely(cc->enforce_young && !referenced)) {
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
@@ -1409,14 +1413,16 @@ static void scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 			cr->result = SCAN_PAGE_COUNT;
 			goto out_unmap;
 		}
-		if (pte_young(pteval) ||
-		    page_is_young(page) || PageReferenced(page) ||
-		    mmu_notifier_test_young(vma->vm_mm, address))
+		if (cc->enforce_young &&
+		    (pte_young(pteval) || page_is_young(page) ||
+		     PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm,
+								     address)))
 			referenced++;
 	}
 	if (!writable) {
 		cr->result = SCAN_PAGE_RO;
-	} else if (!referenced || (unmapped && referenced < HPAGE_PMD_NR/2)) {
+	} else if (cc->enforce_young && (!referenced || (unmapped && referenced
+							 < HPAGE_PMD_NR / 2))) {
 		cr->result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		cr->result = SCAN_SUCCEED;
@@ -2376,6 +2382,7 @@ static int khugepaged(void *none)
 	struct mm_slot *mm_slot;
 	struct collapse_control cc = {
 		.enforce_pte_scan_limits = true,
+		.enforce_young = true,
 		.last_target_node = NUMA_NO_NODE,
 		.gfp = &alloc_hugepage_khugepaged_gfpmask,
 		.alloc_hpage = &khugepaged_alloc_page,
@@ -2524,6 +2531,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 {
 	struct collapse_control cc = {
 		.enforce_pte_scan_limits = false,
+		.enforce_young = false,
 		.last_target_node = NUMA_NO_NODE,
 		.hpage = NULL,
 		.gfp = &alloc_hugepage_madvise_gfpmask,
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 09/12] mm/madvise: add MADV_COLLAPSE to process_madvise()
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (7 preceding siblings ...)
  2022-04-26 14:44 ` [PATCH v3 08/12] mm/khugepaged: add flag to ignore page young/referenced requirement Zach O'Keefe
@ 2022-04-26 14:44 ` Zach O'Keefe
  2022-04-26 14:44 ` [PATCH v3 10/12] selftests/vm: modularize collapse selftests Zach O'Keefe
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
CAP_SYS_ADMIN or is requesting collapse of it's own memory.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/madvise.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 638517952bd2..08c11217025a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior)
 }
 
 static bool
-process_madvise_behavior_valid(int behavior)
+process_madvise_behavior_valid(int behavior, struct task_struct *task)
 {
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
 	case MADV_WILLNEED:
 		return true;
+	case MADV_COLLAPSE:
+		return task == current || capable(CAP_SYS_ADMIN);
 	default:
 		return false;
 	}
@@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto free_iov;
 	}
 
-	if (!process_madvise_behavior_valid(behavior)) {
+	if (!process_madvise_behavior_valid(behavior, task)) {
 		ret = -EINVAL;
 		goto release_task;
 	}
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 10/12] selftests/vm: modularize collapse selftests
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (8 preceding siblings ...)
  2022-04-26 14:44 ` [PATCH v3 09/12] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
@ 2022-04-26 14:44 ` Zach O'Keefe
  2022-04-26 14:44 ` [PATCH v3 11/12] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Modularize the collapse action of khugepaged collapse selftests by
introducing a struct collapse_context which specifies how to collapse a
given memory range and the expected semantics of the collapse.  This
can be reused later to test other collapse contexts.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 257 +++++++++++-------------
 1 file changed, 116 insertions(+), 141 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 155120b67a16..c59d832fee96 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -23,6 +23,12 @@ static int hpage_pmd_nr;
 #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
 #define PID_SMAPS "/proc/self/smaps"
 
+struct collapse_context {
+	const char *name;
+	void (*collapse)(const char *msg, char *p, bool expect);
+	bool enforce_pte_scan_limits;
+};
+
 enum thp_enabled {
 	THP_ALWAYS,
 	THP_MADVISE,
@@ -528,53 +534,39 @@ static void alloc_at_fault(void)
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_full(void)
+static void collapse_full(struct collapse_context *context)
 {
 	void *p;
 
 	p = alloc_mapping();
 	fill_memory(p, 0, hpage_pmd_size);
-	if (wait_for_scan("Collapse fully populated PTE table", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	context->collapse("Collapse fully populated PTE table", p, true);
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_empty(void)
+static void collapse_empty(struct collapse_context *context)
 {
 	void *p;
 
 	p = alloc_mapping();
-	if (wait_for_scan("Do not collapse empty PTE table", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		fail("Fail");
-	else
-		success("OK");
+	context->collapse("Do not collapse empty PTE table", p, false);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_single_pte_entry(void)
+static void collapse_single_pte_entry(struct collapse_context *context)
 {
 	void *p;
 
 	p = alloc_mapping();
 	fill_memory(p, 0, page_size);
-	if (wait_for_scan("Collapse PTE table with single PTE entry present", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	context->collapse("Collapse PTE table with single PTE entry present", p,
+			  true);
 	validate_memory(p, 0, page_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_none(void)
+static void collapse_max_ptes_none(struct collapse_context *context)
 {
 	int max_ptes_none = hpage_pmd_nr / 2;
 	struct settings settings = default_settings;
@@ -586,28 +578,23 @@ static void collapse_max_ptes_none(void)
 	p = alloc_mapping();
 
 	fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
-	if (wait_for_scan("Do not collapse with max_ptes_none exceeded", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		fail("Fail");
-	else
-		success("OK");
+	context->collapse("Maybe collapse with max_ptes_none exceeded", p,
+			  !context->enforce_pte_scan_limits);
 	validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
 
-	fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
-	if (wait_for_scan("Collapse with max_ptes_none PTEs empty", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
-	validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
+	if (context->enforce_pte_scan_limits) {
+		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
+		context->collapse("Collapse with max_ptes_none PTEs empty", p,
+				  true);
+		validate_memory(p, 0,
+				(hpage_pmd_nr - max_ptes_none) * page_size);
+	}
 
 	munmap(p, hpage_pmd_size);
 	write_settings(&default_settings);
 }
 
-static void collapse_swapin_single_pte(void)
+static void collapse_swapin_single_pte(struct collapse_context *context)
 {
 	void *p;
 	p = alloc_mapping();
@@ -625,18 +612,14 @@ static void collapse_swapin_single_pte(void)
 		goto out;
 	}
 
-	if (wait_for_scan("Collapse with swapping in single PTE entry", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	context->collapse("Collapse with swapping in single PTE entry",
+			  p, true);
 	validate_memory(p, 0, hpage_pmd_size);
 out:
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_swap(void)
+static void collapse_max_ptes_swap(struct collapse_context *context)
 {
 	int max_ptes_swap = read_num("khugepaged/max_ptes_swap");
 	void *p;
@@ -656,39 +639,34 @@ static void collapse_max_ptes_swap(void)
 		goto out;
 	}
 
-	if (wait_for_scan("Do not collapse with max_ptes_swap exceeded", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		fail("Fail");
-	else
-		success("OK");
+	context->collapse("Maybe collapse with max_ptes_swap exceeded",
+			      p, !context->enforce_pte_scan_limits);
 	validate_memory(p, 0, hpage_pmd_size);
 
-	fill_memory(p, 0, hpage_pmd_size);
-	printf("Swapout %d of %d pages...", max_ptes_swap, hpage_pmd_nr);
-	if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) {
-		perror("madvise(MADV_PAGEOUT)");
-		exit(EXIT_FAILURE);
-	}
-	if (check_swap(p, max_ptes_swap * page_size)) {
-		success("OK");
-	} else {
-		fail("Fail");
-		goto out;
-	}
+	if (context->enforce_pte_scan_limits) {
+		fill_memory(p, 0, hpage_pmd_size);
+		printf("Swapout %d of %d pages...", max_ptes_swap,
+		       hpage_pmd_nr);
+		if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) {
+			perror("madvise(MADV_PAGEOUT)");
+			exit(EXIT_FAILURE);
+		}
+		if (check_swap(p, max_ptes_swap * page_size)) {
+			success("OK");
+		} else {
+			fail("Fail");
+			goto out;
+		}
 
-	if (wait_for_scan("Collapse with max_ptes_swap pages swapped out", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
-	validate_memory(p, 0, hpage_pmd_size);
+		context->collapse("Collapse with max_ptes_swap pages swapped out",
+				  p, true);
+		validate_memory(p, 0, hpage_pmd_size);
+	}
 out:
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_single_pte_entry_compound(void)
+static void collapse_single_pte_entry_compound(struct collapse_context *context)
 {
 	void *p;
 
@@ -710,17 +688,13 @@ static void collapse_single_pte_entry_compound(void)
 	else
 		fail("Fail");
 
-	if (wait_for_scan("Collapse PTE table with single PTE mapping compound page", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	context->collapse("Collapse PTE table with single PTE mapping compound page",
+			  p, true);
 	validate_memory(p, 0, page_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_full_of_compound(void)
+static void collapse_full_of_compound(struct collapse_context *context)
 {
 	void *p;
 
@@ -742,17 +716,12 @@ static void collapse_full_of_compound(void)
 	else
 		fail("Fail");
 
-	if (wait_for_scan("Collapse PTE table full of compound pages", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	context->collapse("Collapse PTE table full of compound pages", p, true);
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_compound_extreme(void)
+static void collapse_compound_extreme(struct collapse_context *context)
 {
 	void *p;
 	int i;
@@ -798,18 +767,14 @@ static void collapse_compound_extreme(void)
 	else
 		fail("Fail");
 
-	if (wait_for_scan("Collapse PTE table full of different compound pages", p))
-		fail("Timeout");
-	else if (check_huge(p))
-		success("OK");
-	else
-		fail("Fail");
+	context->collapse("Collapse PTE table full of different compound pages",
+			  p, true);
 
 	validate_memory(p, 0, hpage_pmd_size);
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_fork(void)
+static void collapse_fork(struct collapse_context *context)
 {
 	int wstatus;
 	void *p;
@@ -835,13 +800,8 @@ static void collapse_fork(void)
 			fail("Fail");
 
 		fill_memory(p, page_size, 2 * page_size);
-
-		if (wait_for_scan("Collapse PTE table with single page shared with parent process", p))
-			fail("Timeout");
-		else if (check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
+		context->collapse("Collapse PTE table with single page shared with parent process",
+				  p, true);
 
 		validate_memory(p, 0, page_size);
 		munmap(p, hpage_pmd_size);
@@ -860,7 +820,7 @@ static void collapse_fork(void)
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_fork_compound(void)
+static void collapse_fork_compound(struct collapse_context *context)
 {
 	int wstatus;
 	void *p;
@@ -896,14 +856,10 @@ static void collapse_fork_compound(void)
 		fill_memory(p, 0, page_size);
 
 		write_num("khugepaged/max_ptes_shared", hpage_pmd_nr - 1);
-		if (wait_for_scan("Collapse PTE table full of compound pages in child", p))
-			fail("Timeout");
-		else if (check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
+		context->collapse("Collapse PTE table full of compound pages in child",
+				  p, true);
 		write_num("khugepaged/max_ptes_shared",
-				default_settings.khugepaged.max_ptes_shared);
+			  default_settings.khugepaged.max_ptes_shared);
 
 		validate_memory(p, 0, hpage_pmd_size);
 		munmap(p, hpage_pmd_size);
@@ -922,7 +878,7 @@ static void collapse_fork_compound(void)
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_shared()
+static void collapse_max_ptes_shared(struct collapse_context *context)
 {
 	int max_ptes_shared = read_num("khugepaged/max_ptes_shared");
 	int wstatus;
@@ -957,28 +913,22 @@ static void collapse_max_ptes_shared()
 		else
 			fail("Fail");
 
-		if (wait_for_scan("Do not collapse with max_ptes_shared exceeded", p))
-			fail("Timeout");
-		else if (!check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
-
-		printf("Trigger CoW on page %d of %d...",
-				hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
-		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) * page_size);
-		if (!check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
-
-
-		if (wait_for_scan("Collapse with max_ptes_shared PTEs shared", p))
-			fail("Timeout");
-		else if (check_huge(p))
-			success("OK");
-		else
-			fail("Fail");
+		context->collapse("Maybe collapse with max_ptes_shared exceeded",
+				  p, !context->enforce_pte_scan_limits);
+
+		if (context->enforce_pte_scan_limits) {
+			printf("Trigger CoW on page %d of %d...",
+			       hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
+			fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) *
+				    page_size);
+			if (!check_huge(p))
+				success("OK");
+			else
+				fail("Fail");
+
+			context->collapse("Collapse with max_ptes_shared PTEs shared",
+					  p, true);
+		}
 
 		validate_memory(p, 0, hpage_pmd_size);
 		munmap(p, hpage_pmd_size);
@@ -997,8 +947,27 @@ static void collapse_max_ptes_shared()
 	munmap(p, hpage_pmd_size);
 }
 
+static void khugepaged_collapse(const char *msg, char *p, bool expect)
+{
+	if (wait_for_scan(msg, p))
+		fail("Timeout");
+	else if (check_huge(p) == expect)
+		success("OK");
+	else
+		fail("Fail");
+}
+
 int main(void)
 {
+	struct collapse_context contexts[] = {
+		{
+			.name = "khugepaged",
+			.collapse = &khugepaged_collapse,
+			.enforce_pte_scan_limits = true,
+		},
+	};
+	int i;
+
 	setbuf(stdout, NULL);
 
 	page_size = getpagesize();
@@ -1014,18 +983,24 @@ int main(void)
 	adjust_settings();
 
 	alloc_at_fault();
-	collapse_full();
-	collapse_empty();
-	collapse_single_pte_entry();
-	collapse_max_ptes_none();
-	collapse_swapin_single_pte();
-	collapse_max_ptes_swap();
-	collapse_single_pte_entry_compound();
-	collapse_full_of_compound();
-	collapse_compound_extreme();
-	collapse_fork();
-	collapse_fork_compound();
-	collapse_max_ptes_shared();
+
+	for (i = 0; i < sizeof(contexts) / sizeof(contexts[0]); ++i) {
+		struct collapse_context *c = &contexts[i];
+
+		printf("\n*** Testing context: %s ***\n", c->name);
+		collapse_full(c);
+		collapse_empty(c);
+		collapse_single_pte_entry(c);
+		collapse_max_ptes_none(c);
+		collapse_swapin_single_pte(c);
+		collapse_max_ptes_swap(c);
+		collapse_single_pte_entry_compound(c);
+		collapse_full_of_compound(c);
+		collapse_compound_extreme(c);
+		collapse_fork(c);
+		collapse_fork_compound(c);
+		collapse_max_ptes_shared(c);
+	}
 
 	restore_settings(0);
 }
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 11/12] selftests/vm: add MADV_COLLAPSE collapse context to selftests
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (9 preceding siblings ...)
  2022-04-26 14:44 ` [PATCH v3 10/12] selftests/vm: modularize collapse selftests Zach O'Keefe
@ 2022-04-26 14:44 ` Zach O'Keefe
  2022-04-26 14:44 ` [PATCH v3 12/12] selftests/vm: add test to verify recollapse of THPs Zach O'Keefe
  2022-04-26 20:23 ` [PATCH v3 00/12] mm: userspace hugepage collapse Andrew Morton
  12 siblings, 0 replies; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add MADV_COLLAPSE selftests.  Extend struct collapse_context to support
context initialization/cleanup.  This is used by madvise collapse
context to "disable" and "enable" khugepaged, since it would otherwise
interfere with the tests.

The mechanism used to "disable" khugepaged is a hack: it sets
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs to a
large value and feeds khugepaged enough suitable VMAs/pages to keep
khugepaged sleeping for the duration of the madvise collapse tests.
Since khugepaged is woken when this file is written, enough VMAs must be
queued to put khugepaged back to sleep when the tests write to this file
in write_settings().

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 133 ++++++++++++++++++++++--
 1 file changed, 125 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index c59d832fee96..e0ccc9443f78 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -14,17 +14,23 @@
 #ifndef MADV_PAGEOUT
 #define MADV_PAGEOUT 21
 #endif
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
 
 #define BASE_ADDR ((void *)(1UL << 30))
 static unsigned long hpage_pmd_size;
 static unsigned long page_size;
 static int hpage_pmd_nr;
+static int num_khugepaged_wakeups;
 
 #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
 #define PID_SMAPS "/proc/self/smaps"
 
 struct collapse_context {
 	const char *name;
+	bool (*init_context)(void);
+	bool (*cleanup_context)(void);
 	void (*collapse)(const char *msg, char *p, bool expect);
 	bool enforce_pte_scan_limits;
 };
@@ -264,6 +270,17 @@ static void write_num(const char *name, unsigned long num)
 	}
 }
 
+/*
+ * Use this macro instead of write_settings inside tests, and should
+ * be called at most once per callsite.
+ *
+ * Hack to statically count the number of times khugepaged is woken up due to
+ * writes to
+ * /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs,
+ * and is stored in __COUNTER__.
+ */
+#define WRITE_SETTINGS(s) do { __COUNTER__; write_settings(s); } while (0)
+
 static void write_settings(struct settings *settings)
 {
 	struct khugepaged_settings *khugepaged = &settings->khugepaged;
@@ -332,7 +349,7 @@ static void adjust_settings(void)
 {
 
 	printf("Adjust settings...");
-	write_settings(&default_settings);
+	WRITE_SETTINGS(&default_settings);
 	success("OK");
 }
 
@@ -440,20 +457,25 @@ static bool check_swap(void *addr, unsigned long size)
 	return swap;
 }
 
-static void *alloc_mapping(void)
+static void *alloc_mapping_at(void *at, size_t size)
 {
 	void *p;
 
-	p = mmap(BASE_ADDR, hpage_pmd_size, PROT_READ | PROT_WRITE,
-			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
-	if (p != BASE_ADDR) {
-		printf("Failed to allocate VMA at %p\n", BASE_ADDR);
+	p = mmap(at, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE,
+		 -1, 0);
+	if (p != at) {
+		printf("Failed to allocate VMA at %p\n", at);
 		exit(EXIT_FAILURE);
 	}
 
 	return p;
 }
 
+static void *alloc_mapping(void)
+{
+	return alloc_mapping_at(BASE_ADDR, hpage_pmd_size);
+}
+
 static void fill_memory(int *p, unsigned long start, unsigned long end)
 {
 	int i;
@@ -573,7 +595,7 @@ static void collapse_max_ptes_none(struct collapse_context *context)
 	void *p;
 
 	settings.khugepaged.max_ptes_none = max_ptes_none;
-	write_settings(&settings);
+	WRITE_SETTINGS(&settings);
 
 	p = alloc_mapping();
 
@@ -591,7 +613,7 @@ static void collapse_max_ptes_none(struct collapse_context *context)
 	}
 
 	munmap(p, hpage_pmd_size);
-	write_settings(&default_settings);
+	WRITE_SETTINGS(&default_settings);
 }
 
 static void collapse_swapin_single_pte(struct collapse_context *context)
@@ -947,6 +969,87 @@ static void collapse_max_ptes_shared(struct collapse_context *context)
 	munmap(p, hpage_pmd_size);
 }
 
+static void madvise_collapse(const char *msg, char *p, bool expect)
+{
+	int ret;
+
+	printf("%s...", msg);
+	/* Sanity check */
+	if (check_huge(p)) {
+		printf("Unexpected huge page\n");
+		exit(EXIT_FAILURE);
+	}
+
+	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+	ret = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+	if (((bool)ret) == expect)
+		fail("Fail: Bad return value");
+	else if (check_huge(p) != expect)
+		fail("Fail: check_huge()");
+	else
+		success("OK");
+}
+
+static struct khugepaged_disable_state {
+	void *p;
+	size_t map_size;
+} khugepaged_disable_state;
+
+static bool disable_khugepaged(void)
+{
+	/*
+	 * Hack to "disable" khugepaged by setting
+	 * /transparent_hugepage/khugepaged/scan_sleep_millisecs to some large
+	 * value, then feeding it enough suitable VMAs to scan and subsequently
+	 * sleep.
+	 *
+	 * khugepaged is woken up on writes to
+	 * /transparent_hugepage/khugepaged/scan_sleep_millisecs, so care must
+	 * be taken to not inadvertently wake khugepaged in these tests.
+	 *
+	 * Feed khugepaged 1 hugepage-sized VMA to scan and sleep on, then
+	 * N more for each time khugepaged would be woken up.
+	 */
+	size_t map_size = (num_khugepaged_wakeups + 1) * hpage_pmd_size;
+	void *p;
+	bool ret = true;
+	int full_scans;
+	int timeout = 6;  /* 3 seconds */
+
+	default_settings.khugepaged.scan_sleep_millisecs = 1000 * 60 * 10;
+	default_settings.khugepaged.pages_to_scan = 1;
+	write_settings(&default_settings);
+
+	p = alloc_mapping_at(((char *)BASE_ADDR) + (1UL << 30), map_size);
+	fill_memory(p, 0, map_size);
+
+	full_scans = read_num("khugepaged/full_scans") + 2;
+
+	printf("disabling khugepaged...");
+	while (timeout--) {
+		if (read_num("khugepaged/full_scans") >= full_scans) {
+			fail("Fail");
+			ret = false;
+			break;
+		}
+		printf(".");
+		usleep(TICK);
+	}
+	success("OK");
+	khugepaged_disable_state.p = p;
+	khugepaged_disable_state.map_size = map_size;
+	return ret;
+}
+
+static bool enable_khugepaged(void)
+{
+	printf("enabling khugepaged...");
+	munmap(khugepaged_disable_state.p, khugepaged_disable_state.map_size);
+	write_settings(&saved_settings);
+	success("OK");
+	return true;
+}
+
 static void khugepaged_collapse(const char *msg, char *p, bool expect)
 {
 	if (wait_for_scan(msg, p))
@@ -962,9 +1065,18 @@ int main(void)
 	struct collapse_context contexts[] = {
 		{
 			.name = "khugepaged",
+			.init_context = NULL,
+			.cleanup_context = NULL,
 			.collapse = &khugepaged_collapse,
 			.enforce_pte_scan_limits = true,
 		},
+		{
+			.name = "madvise",
+			.init_context = &disable_khugepaged,
+			.cleanup_context = &enable_khugepaged,
+			.collapse = &madvise_collapse,
+			.enforce_pte_scan_limits = false,
+		},
 	};
 	int i;
 
@@ -973,6 +1085,7 @@ int main(void)
 	page_size = getpagesize();
 	hpage_pmd_size = read_num("hpage_pmd_size");
 	hpage_pmd_nr = hpage_pmd_size / page_size;
+	num_khugepaged_wakeups = __COUNTER__;
 
 	default_settings.khugepaged.max_ptes_none = hpage_pmd_nr - 1;
 	default_settings.khugepaged.max_ptes_swap = hpage_pmd_nr / 8;
@@ -988,6 +1101,8 @@ int main(void)
 		struct collapse_context *c = &contexts[i];
 
 		printf("\n*** Testing context: %s ***\n", c->name);
+		if (c->init_context && !c->init_context())
+			continue;
 		collapse_full(c);
 		collapse_empty(c);
 		collapse_single_pte_entry(c);
@@ -1000,6 +1115,8 @@ int main(void)
 		collapse_fork(c);
 		collapse_fork_compound(c);
 		collapse_max_ptes_shared(c);
+		if (c->cleanup_context && !c->cleanup_context())
+			break;
 	}
 
 	restore_settings(0);
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 12/12] selftests/vm: add test to verify recollapse of THPs
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (10 preceding siblings ...)
  2022-04-26 14:44 ` [PATCH v3 11/12] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
@ 2022-04-26 14:44 ` Zach O'Keefe
  2022-04-26 20:23 ` [PATCH v3 00/12] mm: userspace hugepage collapse Andrew Morton
  12 siblings, 0 replies; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-26 14:44 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Thomas Bogendoerfer,
	Zach O'Keefe

Add selftest specific to madvise collapse context that tests
MADV_COLLAPSE is "successful" if a hugepage-algined/sized region is
already pmd-mapped.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 32 +++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index e0ccc9443f78..c36d04218083 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -969,6 +969,32 @@ static void collapse_max_ptes_shared(struct collapse_context *context)
 	munmap(p, hpage_pmd_size);
 }
 
+static void madvise_collapse_existing_thps(void)
+{
+	void *p;
+	int err;
+
+	p = alloc_mapping();
+	fill_memory(p, 0, hpage_pmd_size);
+
+	printf("Collapse fully populated PTE table...");
+	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+	err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+	if (err == 0 && check_huge(p)) {
+		success("OK");
+		printf("Re-collapse PMD-mapped hugepage");
+		err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
+		if (err == 0 && check_huge(p))
+			success("OK");
+		else
+			fail("Fail");
+	} else {
+		fail("Fail");
+	}
+	validate_memory(p, 0, hpage_pmd_size);
+	munmap(p, hpage_pmd_size);
+}
+
 static void madvise_collapse(const char *msg, char *p, bool expect)
 {
 	int ret;
@@ -1097,6 +1123,7 @@ int main(void)
 
 	alloc_at_fault();
 
+	/* Shared tests */
 	for (i = 0; i < sizeof(contexts) / sizeof(contexts[0]); ++i) {
 		struct collapse_context *c = &contexts[i];
 
@@ -1119,5 +1146,10 @@ int main(void)
 			break;
 	}
 
+	/* madvise-specific tests */
+	disable_khugepaged();
+	madvise_collapse_existing_thps();
+	enable_khugepaged();
+
 	restore_settings(0);
 }
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 00/12] mm: userspace hugepage collapse
  2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
                   ` (11 preceding siblings ...)
  2022-04-26 14:44 ` [PATCH v3 12/12] selftests/vm: add test to verify recollapse of THPs Zach O'Keefe
@ 2022-04-26 20:23 ` Andrew Morton
  12 siblings, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2022-04-26 20:23 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, Peter Xu, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Arnd Bergmann, Axel Rasmussen, Chris Kennelly, Chris Zankel,
	Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Tue, 26 Apr 2022 07:44:00 -0700 "Zach O'Keefe" <zokeefe@google.com> wrote:

> This series provides a mechanism for userspace to induce a collapse of
> eligible ranges of memory into transparent hugepages in process context,
> thus permitting users to more tightly control their own hugepage
> utilization policy at their own expense.

This looks useful.

Alas, the flow of new material seems to be far exceeding our review
bandwidth at present, so I think I'll sit this one out for a while.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 01/12] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-04-26 14:44 ` [PATCH v3 01/12] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
@ 2022-04-27  0:26   ` Peter Xu
  2022-04-27 15:48     ` Zach O'Keefe
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2022-04-27  0:26 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, kernel test robot

Hi, Zach,

On Tue, Apr 26, 2022 at 07:44:01AM -0700, Zach O'Keefe wrote:
> When scanning an anon pmd to see if it's eligible for collapse, return
> SCAN_PMD_MAPPED if the pmd already maps a THP.  Note that
> SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
> file-collapse path, since the latter might identify pte-mapped compound
> pages.  This is required by MADV_COLLAPSE which necessarily needs to
> know what hugepage-aligned/sized regions are already pmd-mapped.
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> Reported-by: kernel test robot <lkp@intel.com>

IIUC we don't need to attach this reported-by if this is not a bugfix.  I
think you can simply fix all issues reported by the test bot and only
attach the line if the patch is fixing the problem that the bot was
reporting explicitly.

> ---
>  include/trace/events/huge_memory.h |  3 ++-
>  mm/internal.h                      |  1 +
>  mm/khugepaged.c                    | 30 ++++++++++++++++++++++++++----
>  mm/rmap.c                          | 15 +++++++++++++--
>  4 files changed, 42 insertions(+), 7 deletions(-)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index d651f3437367..9faa678e0a5b 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -33,7 +33,8 @@
>  	EM( SCAN_ALLOC_HUGE_PAGE_FAIL,	"alloc_huge_page_failed")	\
>  	EM( SCAN_CGROUP_CHARGE_FAIL,	"ccgroup_charge_failed")	\
>  	EM( SCAN_TRUNCATED,		"truncated")			\
> -	EMe(SCAN_PAGE_HAS_PRIVATE,	"page_has_private")		\
> +	EM( SCAN_PAGE_HAS_PRIVATE,	"page_has_private")		\
> +	EMe(SCAN_PMD_MAPPED,		"page_pmd_mapped")		\

Nit: IMHO it can be put even in the middle so we don't need to touch the
EMe() every time. :)

Apart from that, it does sound proper to me to put SCAN_PMD_MAPPED to be
right after SCAN_PMD_NULL anyway.

>  
>  #undef EM
>  #undef EMe
> diff --git a/mm/internal.h b/mm/internal.h
> index 0667abd57634..51ae9f71a2a3 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -172,6 +172,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
>  /*
>   * in mm/rmap.c:
>   */
> +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address);
>  extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
>  
>  /*
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ba8dbd1825da..2933b13fc975 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -51,6 +51,7 @@ enum scan_result {
>  	SCAN_CGROUP_CHARGE_FAIL,
>  	SCAN_TRUNCATED,
>  	SCAN_PAGE_HAS_PRIVATE,
> +	SCAN_PMD_MAPPED,
>  };
>  
>  #define CREATE_TRACE_POINTS
> @@ -987,6 +988,29 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>  	return 0;
>  }
>  
> +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> +				   unsigned long address,
> +				   pmd_t **pmd)
> +{
> +	pmd_t pmde;
> +
> +	*pmd = mm_find_pmd_raw(mm, address);
> +	if (!*pmd)
> +		return SCAN_PMD_NULL;
> +
> +	pmde = pmd_read_atomic(*pmd);
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	/* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> +	barrier();
> +#endif
> +	if (!pmd_present(pmde) || pmd_none(pmde))

Could we drop the pmd_none() check?  I assume !pmd_present() should have
covered that case already?

> +		return SCAN_PMD_NULL;
> +	if (pmd_trans_huge(pmde))
> +		return SCAN_PMD_MAPPED;
> +	return SCAN_SUCCEED;
> +}
> +
>  /*
>   * Bring missing pages in from swap, to complete THP collapse.
>   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> @@ -1238,11 +1262,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> -	pmd = mm_find_pmd(mm, address);
> -	if (!pmd) {
> -		result = SCAN_PMD_NULL;
> +	result = find_pmd_or_thp_or_none(mm, address, &pmd);
> +	if (result != SCAN_SUCCEED)
>  		goto out;
> -	}
>  
>  	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
>  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 61e63db5dc6f..49817f35e65c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -759,13 +759,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
>  	return vma_address(page, vma);
>  }
>  
> -pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address)
>  {
>  	pgd_t *pgd;
>  	p4d_t *p4d;
>  	pud_t *pud;
>  	pmd_t *pmd = NULL;
> -	pmd_t pmde;
>  
>  	pgd = pgd_offset(mm, address);
>  	if (!pgd_present(*pgd))
> @@ -780,6 +779,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>  		goto out;
>  
>  	pmd = pmd_offset(pud, address);
> +out:
> +	return pmd;
> +}
> +
> +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> +{
> +	pmd_t pmde;
> +	pmd_t *pmd;
> +
> +	pmd = mm_find_pmd_raw(mm, address);
> +	if (!pmd)
> +		goto out;
>  	/*
>  	 * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
>  	 * without holding anon_vma lock for write.  So when looking for a
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 01/12] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
  2022-04-27  0:26   ` Peter Xu
@ 2022-04-27 15:48     ` Zach O'Keefe
  0 siblings, 0 replies; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-27 15:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, kernel test robot

Thanks for taking the time to review, Peter!

On Tue, Apr 26, 2022 at 5:26 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Zach,
>
> On Tue, Apr 26, 2022 at 07:44:01AM -0700, Zach O'Keefe wrote:
> > When scanning an anon pmd to see if it's eligible for collapse, return
> > SCAN_PMD_MAPPED if the pmd already maps a THP.  Note that
> > SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
> > file-collapse path, since the latter might identify pte-mapped compound
> > pages.  This is required by MADV_COLLAPSE which necessarily needs to
> > know what hugepage-aligned/sized regions are already pmd-mapped.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > Reported-by: kernel test robot <lkp@intel.com>
>
> IIUC we don't need to attach this reported-by if this is not a bugfix.  I
> think you can simply fix all issues reported by the test bot and only
> attach the line if the patch is fixing the problem that the bot was
> reporting explicitly.
>

Ya, I wasn't entirely sure what to do here, but including seems to not
be without precedent, e.g. commit 92bbef67d459 ("mm: make
alloc_contig_range work at pageblock granularity"), and likewise just
wanted to give credit where I thought it was due. Though, I suppose
folks who catch bugs in the review process aren't ack'd similarly, so
perhaps it does make sense to remove this.

> > ---
> >  include/trace/events/huge_memory.h |  3 ++-
> >  mm/internal.h                      |  1 +
> >  mm/khugepaged.c                    | 30 ++++++++++++++++++++++++++----
> >  mm/rmap.c                          | 15 +++++++++++++--
> >  4 files changed, 42 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index d651f3437367..9faa678e0a5b 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -33,7 +33,8 @@
> >       EM( SCAN_ALLOC_HUGE_PAGE_FAIL,  "alloc_huge_page_failed")       \
> >       EM( SCAN_CGROUP_CHARGE_FAIL,    "ccgroup_charge_failed")        \
> >       EM( SCAN_TRUNCATED,             "truncated")                    \
> > -     EMe(SCAN_PAGE_HAS_PRIVATE,      "page_has_private")             \
> > +     EM( SCAN_PAGE_HAS_PRIVATE,      "page_has_private")             \
> > +     EMe(SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
>
> Nit: IMHO it can be put even in the middle so we don't need to touch the
> EMe() every time. :)
>
> Apart from that, it does sound proper to me to put SCAN_PMD_MAPPED to be
> right after SCAN_PMD_NULL anyway.
>

Makes sense to me. Done.

> >
> >  #undef EM
> >  #undef EMe
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 0667abd57634..51ae9f71a2a3 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -172,6 +172,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
> >  /*
> >   * in mm/rmap.c:
> >   */
> > +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address);
> >  extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> >
> >  /*
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index ba8dbd1825da..2933b13fc975 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -51,6 +51,7 @@ enum scan_result {
> >       SCAN_CGROUP_CHARGE_FAIL,
> >       SCAN_TRUNCATED,
> >       SCAN_PAGE_HAS_PRIVATE,
> > +     SCAN_PMD_MAPPED,
> >  };
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -987,6 +988,29 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >       return 0;
> >  }
> >
> > +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > +                                unsigned long address,
> > +                                pmd_t **pmd)
> > +{
> > +     pmd_t pmde;
> > +
> > +     *pmd = mm_find_pmd_raw(mm, address);
> > +     if (!*pmd)
> > +             return SCAN_PMD_NULL;
> > +
> > +     pmde = pmd_read_atomic(*pmd);
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +     /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > +     barrier();
> > +#endif
> > +     if (!pmd_present(pmde) || pmd_none(pmde))
>
> Could we drop the pmd_none() check?  I assume !pmd_present() should have
> covered that case already?
>

I opted for safety here since I didn't know if pmd_present() always
implied !pmd_none() on all archs, but given mm_find_pmd() elides the
check, perhaps it's safe to do so here. Thanks for the suggestion.


> > +             return SCAN_PMD_NULL;
> > +     if (pmd_trans_huge(pmde))
> > +             return SCAN_PMD_MAPPED;
> > +     return SCAN_SUCCEED;
> > +}
> > +
> >  /*
> >   * Bring missing pages in from swap, to complete THP collapse.
> >   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> > @@ -1238,11 +1262,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > -     pmd = mm_find_pmd(mm, address);
> > -     if (!pmd) {
> > -             result = SCAN_PMD_NULL;
> > +     result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > +     if (result != SCAN_SUCCEED)
> >               goto out;
> > -     }
> >
> >       memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
> >       pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 61e63db5dc6f..49817f35e65c 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -759,13 +759,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
> >       return vma_address(page, vma);
> >  }
> >
> > -pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address)
> >  {
> >       pgd_t *pgd;
> >       p4d_t *p4d;
> >       pud_t *pud;
> >       pmd_t *pmd = NULL;
> > -     pmd_t pmde;
> >
> >       pgd = pgd_offset(mm, address);
> >       if (!pgd_present(*pgd))
> > @@ -780,6 +779,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> >               goto out;
> >
> >       pmd = pmd_offset(pud, address);
> > +out:
> > +     return pmd;
> > +}
> > +
> > +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> > +{
> > +     pmd_t pmde;
> > +     pmd_t *pmd;
> > +
> > +     pmd = mm_find_pmd_raw(mm, address);
> > +     if (!pmd)
> > +             goto out;
> >       /*
> >        * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
> >        * without holding anon_vma lock for write.  So when looking for a
> > --
> > 2.36.0.rc2.479.g8af0fa9b8e-goog
> >
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 02/12] mm/khugepaged: add struct collapse_control
  2022-04-26 14:44 ` [PATCH v3 02/12] mm/khugepaged: add struct collapse_control Zach O'Keefe
@ 2022-04-27 19:49   ` Peter Xu
  2022-04-28  0:19     ` Zach O'Keefe
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2022-04-27 19:49 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Tue, Apr 26, 2022 at 07:44:02AM -0700, Zach O'Keefe wrote:
> Modularize hugepage collapse by introducing struct collapse_control.
> This structure serves to describe the properties of the requested
> collapse, as well as serve as a local scratch pad to use during the
> collapse itself.

Reasonable, but IMHO worth mentioning that "we're introducing the context
by first moving the per-node statistics into ...", or something like that.

> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/khugepaged.c | 79 ++++++++++++++++++++++++++++---------------------
>  1 file changed, 46 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2933b13fc975..9d42fa330812 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -86,6 +86,14 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
>  
>  #define MAX_PTE_MAPPED_THP 8
>  
> +struct collapse_control {
> +	/* Num pages scanned per node */
> +	int node_load[MAX_NUMNODES];
> +
> +	/* Last target selected in khugepaged_find_target_node() for this scan */

Not really important, but.. iiuc this is not only for this scan but for all
scans of the khugepaged thread.

> +	int last_target_node;
> +};

[...]

> @@ -829,28 +835,28 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
>  }
>  
>  #ifdef CONFIG_NUMA
> -static int khugepaged_find_target_node(void)
> +static int khugepaged_find_target_node(struct collapse_control *cc)
>  {
> -	static int last_khugepaged_target_node = NUMA_NO_NODE;
>  	int nid, target_node = 0, max_value = 0;
>  
>  	/* find first node with max normal pages hit */
>  	for (nid = 0; nid < MAX_NUMNODES; nid++)
> -		if (khugepaged_node_load[nid] > max_value) {
> -			max_value = khugepaged_node_load[nid];
> +		if (cc->node_load[nid] > max_value) {
> +			max_value = cc->node_load[nid];
>  			target_node = nid;
>  		}
>  
>  	/* do some balance if several nodes have the same hit record */
> -	if (target_node <= last_khugepaged_target_node)
> -		for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
> -				nid++)
> -			if (max_value == khugepaged_node_load[nid]) {
> +	if (target_node <= cc->last_target_node)
> +		for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
> +		     nid++) {
> +			if (max_value == cc->node_load[nid]) {
>  				target_node = nid;
>  				break;
>  			}
> +		}

I'm not sure what's the coding style for this case, but IIUC we could
either add both (to both outer "if" and "for") or none; adding one pair of
brackets seems a bit odd.

>  
> -	last_khugepaged_target_node = target_node;
> +	cc->last_target_node = target_node;
>  	return target_node;
>  }

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific
  2022-04-26 14:44 ` [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific Zach O'Keefe
@ 2022-04-27 20:04   ` Peter Xu
  2022-04-28  0:51     ` Zach O'Keefe
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2022-04-27 20:04 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, kernel test robot

On Tue, Apr 26, 2022 at 07:44:03AM -0700, Zach O'Keefe wrote:
> Add hugepage allocation context to struct collapse_context, allowing
> different collapse contexts to allocate hugepages differently.  For
> example, khugepaged decides to allocate differently in NUMA and UMA
> configurations, and other collapse contexts shouldn't be coupled to this
> decision.  Likewise for the gfp flags used for said allocation.
> 
> Additionally, move [pre]allocated hugepage pointer into
> struct collapse_context.
> 
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> Reported-by: kernel test robot <lkp@intel.com>

(Please remember to remove this one too, and the rest ones.. :)

> ---
>  mm/khugepaged.c | 102 ++++++++++++++++++++++++------------------------
>  1 file changed, 52 insertions(+), 50 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9d42fa330812..c4962191d6e1 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -92,6 +92,11 @@ struct collapse_control {
>  
>  	/* Last target selected in khugepaged_find_target_node() for this scan */
>  	int last_target_node;
> +
> +	struct page *hpage;
> +	gfp_t (*gfp)(void);
> +	struct page* (*alloc_hpage)(struct collapse_control *cc, gfp_t gfp,
> +				    int node);

Nit: s/page* /page */ looks a bit more consistent..

>  };
>  
>  /**
> @@ -877,21 +882,21 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
>  	return true;
>  }
>  
> -static struct page *
> -khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> +static struct page *khugepaged_alloc_page(struct collapse_control *cc,
> +					  gfp_t gfp, int node)

Nit: I'd suggest we put collapse_control* either at the start or at the
end, and keep doing it when possible.

IIRC you used to put it always at the end, but now it's at the start.  Not
a big deal but it'll be nice to keep the code self-aligned.

>  {
> -	VM_BUG_ON_PAGE(*hpage, *hpage);
> +	VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
>  
> -	*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> -	if (unlikely(!*hpage)) {
> +	cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> +	if (unlikely(!cc->hpage)) {
>  		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> -		*hpage = ERR_PTR(-ENOMEM);
> +		cc->hpage = ERR_PTR(-ENOMEM);
>  		return NULL;
>  	}
>  
> -	prep_transhuge_page(*hpage);
> +	prep_transhuge_page(cc->hpage);
>  	count_vm_event(THP_COLLAPSE_ALLOC);
> -	return *hpage;
> +	return cc->hpage;
>  }
>  #else
>  static int khugepaged_find_target_node(struct collapse_control *cc)
> @@ -953,12 +958,12 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
>  	return true;
>  }
>  
> -static struct page *
> -khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> +static struct page *khugepaged_alloc_page(struct collapse_control *cc,
> +					  gfp_t gfp, int node)
>  {
> -	VM_BUG_ON(!*hpage);
> +	VM_BUG_ON(!cc->hpage);
>  
> -	return  *hpage;
> +	return cc->hpage;
>  }
>  #endif
>  
> @@ -1080,10 +1085,9 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>  	return true;
>  }
>  
> -static void collapse_huge_page(struct mm_struct *mm,
> -				   unsigned long address,
> -				   struct page **hpage,
> -				   int node, int referenced, int unmapped)
> +static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
> +			       struct collapse_control *cc, int referenced,
> +			       int unmapped)
>  {
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> @@ -1096,11 +1100,12 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	struct mmu_notifier_range range;
>  	gfp_t gfp;
>  	const struct cpumask *cpumask;
> +	int node;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
>  	/* Only allocate from the target node */
> -	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> +	gfp = cc->gfp() | __GFP_THISNODE;
>  
>  	/*
>  	 * Before allocating the hugepage, release the mmap_lock read lock.
> @@ -1110,13 +1115,14 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	 */
>  	mmap_read_unlock(mm);
>  
> +	node = khugepaged_find_target_node(cc);
>  	/* sched to specified node before huage page memory copy */
>  	if (task_node(current) != node) {
>  		cpumask = cpumask_of_node(node);
>  		if (!cpumask_empty(cpumask))
>  			set_cpus_allowed_ptr(current, cpumask);
>  	}
> -	new_page = khugepaged_alloc_page(hpage, gfp, node);
> +	new_page = cc->alloc_hpage(cc, gfp, node);

AFAICT you removed all references of khugepaged_alloc_page() in this patch,
then you'd better drop the function for both NUMA and UMA.

Said that, I think it's actually better to keep them, as they do things
useful.  For example,AFAICT this ->alloc_hpage() change can leak the hpage
alloacted for UMA already so that's why I think keeping them makes sense,
then iiuc the BUG_ON would trigger with UMA already.

I saw that you've moved khugepaged_find_target_node() into this function
which looks definitely good, but if we could keep khugepaged_alloc_page()
and then IMHO we could even move khugepaged_find_target_node() into that
and drop the "node" parameter in khugepaged_alloc_page().

I'd even consider moving cc->gfp() all into it if possible, since the gfp
seems to be always with __GFP_THISNODE anyways.

What do you think?

>  	if (!new_page) {
>  		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
>  		goto out_nolock;
> @@ -1238,15 +1244,15 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	update_mmu_cache_pmd(vma, address, pmd);
>  	spin_unlock(pmd_ptl);
>  
> -	*hpage = NULL;
> +	cc->hpage = NULL;
>  
>  	khugepaged_pages_collapsed++;
>  	result = SCAN_SUCCEED;
>  out_up_write:
>  	mmap_write_unlock(mm);
>  out_nolock:
> -	if (!IS_ERR_OR_NULL(*hpage))
> -		mem_cgroup_uncharge(page_folio(*hpage));
> +	if (!IS_ERR_OR_NULL(cc->hpage))
> +		mem_cgroup_uncharge(page_folio(cc->hpage));
>  	trace_mm_collapse_huge_page(mm, isolated, result);
>  	return;
>  }
> @@ -1254,7 +1260,6 @@ static void collapse_huge_page(struct mm_struct *mm,
>  static int khugepaged_scan_pmd(struct mm_struct *mm,
>  			       struct vm_area_struct *vma,
>  			       unsigned long address,
> -			       struct page **hpage,
>  			       struct collapse_control *cc)
>  {
>  	pmd_t *pmd;
> @@ -1399,10 +1404,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (ret) {
> -		node = khugepaged_find_target_node(cc);
>  		/* collapse_huge_page will return with the mmap_lock released */
> -		collapse_huge_page(mm, address, hpage, node,
> -				referenced, unmapped);
> +		collapse_huge_page(mm, address, cc, referenced, unmapped);
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
> @@ -1667,8 +1670,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>   * @mm: process address space where collapse happens
>   * @file: file that collapse on
>   * @start: collapse start address
> - * @hpage: new allocated huge page for collapse
> - * @node: appointed node the new huge page allocate from
> + * @cc: collapse context and scratchpad
>   *
>   * Basic scheme is simple, details are more complex:
>   *  - allocate and lock a new huge page;
> @@ -1686,8 +1688,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>   *    + unlock and free huge page;
>   */
>  static void collapse_file(struct mm_struct *mm,
> -		struct file *file, pgoff_t start,
> -		struct page **hpage, int node)
> +			  struct file *file, pgoff_t start,
> +			  struct collapse_control *cc)
>  {
>  	struct address_space *mapping = file->f_mapping;
>  	gfp_t gfp;
> @@ -1697,15 +1699,16 @@ static void collapse_file(struct mm_struct *mm,
>  	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
>  	int nr_none = 0, result = SCAN_SUCCEED;
>  	bool is_shmem = shmem_file(file);
> -	int nr;
> +	int nr, node;
>  
>  	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>  	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>  
>  	/* Only allocate from the target node */
> -	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> +	gfp = cc->gfp() | __GFP_THISNODE;
> +	node = khugepaged_find_target_node(cc);
>  
> -	new_page = khugepaged_alloc_page(hpage, gfp, node);
> +	new_page = cc->alloc_hpage(cc, gfp, node);
>  	if (!new_page) {
>  		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
>  		goto out;
> @@ -1998,7 +2001,7 @@ static void collapse_file(struct mm_struct *mm,
>  		 * Remove pte page tables, so we can re-fault the page as huge.
>  		 */
>  		retract_page_tables(mapping, start);
> -		*hpage = NULL;
> +		cc->hpage = NULL;
>  
>  		khugepaged_pages_collapsed++;
>  	} else {
> @@ -2045,14 +2048,14 @@ static void collapse_file(struct mm_struct *mm,
>  	unlock_page(new_page);
>  out:
>  	VM_BUG_ON(!list_empty(&pagelist));
> -	if (!IS_ERR_OR_NULL(*hpage))
> -		mem_cgroup_uncharge(page_folio(*hpage));
> +	if (!IS_ERR_OR_NULL(cc->hpage))
> +		mem_cgroup_uncharge(page_folio(cc->hpage));
>  	/* TODO: tracepoints */
>  }
>  
>  static void khugepaged_scan_file(struct mm_struct *mm,
> -		struct file *file, pgoff_t start, struct page **hpage,
> -		struct collapse_control *cc)
> +				 struct file *file, pgoff_t start,
> +				 struct collapse_control *cc)
>  {
>  	struct page *page = NULL;
>  	struct address_space *mapping = file->f_mapping;
> @@ -2125,8 +2128,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
>  			result = SCAN_EXCEED_NONE_PTE;
>  			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>  		} else {
> -			node = khugepaged_find_target_node(cc);
> -			collapse_file(mm, file, start, hpage, node);
> +			collapse_file(mm, file, start, cc);
>  		}
>  	}
>  
> @@ -2134,8 +2136,8 @@ static void khugepaged_scan_file(struct mm_struct *mm,
>  }
>  #else
>  static void khugepaged_scan_file(struct mm_struct *mm,
> -		struct file *file, pgoff_t start, struct page **hpage,
> -		struct collapse_control *cc)
> +				 struct file *file, pgoff_t start,
> +				 struct collapse_control *cc)
>  {
>  	BUILD_BUG();
>  }
> @@ -2146,7 +2148,6 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
>  #endif
>  
>  static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> -					    struct page **hpage,
>  					    struct collapse_control *cc)
>  	__releases(&khugepaged_mm_lock)
>  	__acquires(&khugepaged_mm_lock)
> @@ -2223,12 +2224,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
>  
>  				mmap_read_unlock(mm);
>  				ret = 1;
> -				khugepaged_scan_file(mm, file, pgoff, hpage, cc);
> +				khugepaged_scan_file(mm, file, pgoff, cc);
>  				fput(file);
>  			} else {
>  				ret = khugepaged_scan_pmd(mm, vma,
> -						khugepaged_scan.address,
> -						hpage, cc);
> +						khugepaged_scan.address, cc);
>  			}
>  			/* move to next address */
>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
> @@ -2286,15 +2286,15 @@ static int khugepaged_wait_event(void)
>  
>  static void khugepaged_do_scan(struct collapse_control *cc)
>  {
> -	struct page *hpage = NULL;
>  	unsigned int progress = 0, pass_through_head = 0;
>  	unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
>  	bool wait = true;
>  
> +	cc->hpage = NULL;
>  	lru_add_drain_all();
>  
>  	while (progress < pages) {
> -		if (!khugepaged_prealloc_page(&hpage, &wait))
> +		if (!khugepaged_prealloc_page(&cc->hpage, &wait))
>  			break;
>  
>  		cond_resched();
> @@ -2308,14 +2308,14 @@ static void khugepaged_do_scan(struct collapse_control *cc)
>  		if (khugepaged_has_work() &&
>  		    pass_through_head < 2)
>  			progress += khugepaged_scan_mm_slot(pages - progress,
> -							    &hpage, cc);
> +							    cc);
>  		else
>  			progress = pages;
>  		spin_unlock(&khugepaged_mm_lock);
>  	}
>  
> -	if (!IS_ERR_OR_NULL(hpage))
> -		put_page(hpage);
> +	if (!IS_ERR_OR_NULL(cc->hpage))
> +		put_page(cc->hpage);
>  }
>  
>  static bool khugepaged_should_wakeup(void)
> @@ -2349,6 +2349,8 @@ static int khugepaged(void *none)
>  	struct mm_slot *mm_slot;
>  	struct collapse_control cc = {
>  		.last_target_node = NUMA_NO_NODE,
> +		.gfp = &alloc_hugepage_khugepaged_gfpmask,
> +		.alloc_hpage = &khugepaged_alloc_page,

I'm also wondering whether we could avoid the gfp() hook, because logically
that can be inlined into the ->alloc_hpage() hook, then we don't need gfp
parameter anymore for ->alloc_hpage(), which seems to be a win-win?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 04/12] mm/khugepaged: add struct collapse_result
  2022-04-26 14:44 ` [PATCH v3 04/12] mm/khugepaged: add struct collapse_result Zach O'Keefe
@ 2022-04-27 20:47   ` Peter Xu
  2022-04-28 21:59     ` Zach O'Keefe
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2022-04-27 20:47 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

[-- Attachment #1: Type: text/plain, Size: 1293 bytes --]

On Tue, Apr 26, 2022 at 07:44:04AM -0700, Zach O'Keefe wrote:
> +/* Gather information from one khugepaged_scan_[pmd|file]() request */
> +struct collapse_result {
> +	enum scan_result result;
> +
> +	/* Was mmap_lock dropped during request? */
> +	bool dropped_mmap_lock;
> +};

IMHO this new dropped_mmap_lock makes things a bit more complicated..

To me, the old code actually is easy to read on when the lock is dropped:

  - For file, we always drop it
  - For anon, when khugepaged_scan_pmd() returns 1

That's fairly clear to me.

But I understand what you need, probably the "result" hidden in the old
world but you'd need that for MADV_COLLAPSE?

I also noticed major chunks of this patch is converting "result = XXX" into
"cr->result = XXX".  IMHO that's really not necessary for touching up all
the places around.

I'm also wondering whether we could simply pass over the result into your
newly introduced collapse_control*.

In summary, this is what's in my mind..

  - Add collapse_control.result
  - Make sure both khugepaged_scan_file() and khugepaged_scan_pmd() to
    simply remember to update cc->result before it returns

I've attached a pesudo code patch (note!  it's pesudo not even compile but
to show what I meant..), so far I don't see a problem with it.

-- 
Peter Xu

[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 2507 bytes --]

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d7c5bb9fd1fb..bd341115d359 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1040,7 +1040,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 	return true;
 }
 
-static void collapse_huge_page(struct mm_struct *mm,
+static int collapse_huge_page(struct mm_struct *mm,
 				   unsigned long address,
 				   struct page **hpage,
 				   int node, int referenced, int unmapped)
@@ -1208,7 +1208,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (!IS_ERR_OR_NULL(*hpage))
 		mem_cgroup_uncharge(page_folio(*hpage));
 	trace_mm_collapse_huge_page(mm, isolated, result);
-	return;
+	return result;
 }
 
 static int khugepaged_scan_pmd(struct mm_struct *mm,
@@ -1361,13 +1361,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
 		node = khugepaged_find_target_node();
-		/* collapse_huge_page will return with the mmap_lock released */
-		collapse_huge_page(mm, address, hpage, node,
-				referenced, unmapped);
+		/*
+		 * collapse_huge_page will return with the mmap_lock
+		 * released.  It'll also further update cc->result if
+		 * anything wrong happens.
+		 */
+		result = collapse_huge_page(mm, address, hpage, node,
+					    referenced, unmapped, cc);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
 				     none_or_zero, result, unmapped);
+	/* Deliver the result to the caller */
+	cc->result = result;
 	return ret;
 }
 
@@ -1646,7 +1652,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  *    + restore gaps in the page cache;
  *    + unlock and free huge page;
  */
-static void collapse_file(struct mm_struct *mm,
+static int collapse_file(struct mm_struct *mm,
 		struct file *file, pgoff_t start,
 		struct page **hpage, int node)
 {
@@ -2009,6 +2015,7 @@ static void collapse_file(struct mm_struct *mm,
 	if (!IS_ERR_OR_NULL(*hpage))
 		mem_cgroup_uncharge(page_folio(*hpage));
 	/* TODO: tracepoints */
+	return result;
 }
 
 static void khugepaged_scan_file(struct mm_struct *mm,
@@ -2086,11 +2093,12 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
 			node = khugepaged_find_target_node();
-			collapse_file(mm, file, start, hpage, node);
+			result = collapse_file(mm, file, start, hpage, node);
 		}
 	}
 
 	/* TODO: tracepoints */
+	cc->result = result;
 }
 #else
 static void khugepaged_scan_file(struct mm_struct *mm,

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/12] mm/khugepaged: remove khugepaged prefix from shared collapse functions
  2022-04-26 14:44 ` [PATCH v3 06/12] mm/khugepaged: remove khugepaged prefix from shared collapse functions Zach O'Keefe
@ 2022-04-27 21:10   ` Peter Xu
  2022-04-28 22:51     ` Zach O'Keefe
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2022-04-27 21:10 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, kernel test robot

On Tue, Apr 26, 2022 at 07:44:06AM -0700, Zach O'Keefe wrote:
> The following functions/tracepoints are shared between khugepaged and
> madvise collapse contexts.  Remove the khugepaged prefixes.
> 
> tracepoint:mm_khugepaged_scan_pmd -> tracepoint:mm_scan_pmd
> khugepaged_test_exit() -> test_exit()
> khugepaged_scan_abort() -> scan_abort()
> khugepaged_scan_pmd() -> scan_pmd()
> khugepaged_find_target_node() -> find_target_node()

The khugepaged prefix can indeed be confusing, but are the new names just
too short?  IMHO it'll be nice to still have some generic prefix.

Perhaps something like "hpage_collapse_"/"hpage_"/"hpc_"/..?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 07/12] mm/khugepaged: add flag to ignore khugepaged_max_ptes_*
  2022-04-26 14:44 ` [PATCH v3 07/12] mm/khugepaged: add flag to ignore khugepaged_max_ptes_* Zach O'Keefe
@ 2022-04-27 21:12   ` Peter Xu
  2022-04-29 14:26     ` Zach O'Keefe
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2022-04-27 21:12 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Tue, Apr 26, 2022 at 07:44:07AM -0700, Zach O'Keefe wrote:
> @@ -2365,6 +2375,7 @@ static int khugepaged(void *none)
>  {
>  	struct mm_slot *mm_slot;
>  	struct collapse_control cc = {
> +		.enforce_pte_scan_limits = true,
>  		.last_target_node = NUMA_NO_NODE,
>  		.gfp = &alloc_hugepage_khugepaged_gfpmask,
>  		.alloc_hpage = &khugepaged_alloc_page,
> @@ -2512,6 +2523,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		     unsigned long start, unsigned long end)
>  {
>  	struct collapse_control cc = {
> +		.enforce_pte_scan_limits = false,
>  		.last_target_node = NUMA_NO_NODE,
>  		.hpage = NULL,
>  		.gfp = &alloc_hugepage_madvise_gfpmask,

This changes the semantics of the new madvise().  IMHO it'll be ideal if
this patch is moved before the introduction of MADV_COLLAPSE, so the new
madvise() will have a consistent behavior.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 02/12] mm/khugepaged: add struct collapse_control
  2022-04-27 19:49   ` Peter Xu
@ 2022-04-28  0:19     ` Zach O'Keefe
  0 siblings, 0 replies; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-28  0:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, Apr 27, 2022 at 12:49 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Apr 26, 2022 at 07:44:02AM -0700, Zach O'Keefe wrote:
> > Modularize hugepage collapse by introducing struct collapse_control.
> > This structure serves to describe the properties of the requested
> > collapse, as well as serve as a local scratch pad to use during the
> > collapse itself.
>
> Reasonable, but IMHO worth mentioning that "we're introducing the context
> by first moving the per-node statistics into ...", or something like that.
>

Sounds good to me. Updated.

> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  mm/khugepaged.c | 79 ++++++++++++++++++++++++++++---------------------
> >  1 file changed, 46 insertions(+), 33 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 2933b13fc975..9d42fa330812 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -86,6 +86,14 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
> >
> >  #define MAX_PTE_MAPPED_THP 8
> >
> > +struct collapse_control {
> > +     /* Num pages scanned per node */
> > +     int node_load[MAX_NUMNODES];
> > +
> > +     /* Last target selected in khugepaged_find_target_node() for this scan */
>
> Not really important, but.. iiuc this is not only for this scan but for all
> scans of the khugepaged thread.
>

Thanks for catching this. I've removed the "for this scan" bit to
avoid confusion.

> > +     int last_target_node;
> > +};
>
> [...]
>
> > @@ -829,28 +835,28 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
> >  }
> >
> >  #ifdef CONFIG_NUMA
> > -static int khugepaged_find_target_node(void)
> > +static int khugepaged_find_target_node(struct collapse_control *cc)
> >  {
> > -     static int last_khugepaged_target_node = NUMA_NO_NODE;
> >       int nid, target_node = 0, max_value = 0;
> >
> >       /* find first node with max normal pages hit */
> >       for (nid = 0; nid < MAX_NUMNODES; nid++)
> > -             if (khugepaged_node_load[nid] > max_value) {
> > -                     max_value = khugepaged_node_load[nid];
> > +             if (cc->node_load[nid] > max_value) {
> > +                     max_value = cc->node_load[nid];
> >                       target_node = nid;
> >               }
> >
> >       /* do some balance if several nodes have the same hit record */
> > -     if (target_node <= last_khugepaged_target_node)
> > -             for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
> > -                             nid++)
> > -                     if (max_value == khugepaged_node_load[nid]) {
> > +     if (target_node <= cc->last_target_node)
> > +             for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
> > +                  nid++) {
> > +                     if (max_value == cc->node_load[nid]) {
> >                               target_node = nid;
> >                               break;
> >                       }
> > +             }
>
> I'm not sure what's the coding style for this case, but IIUC we could
> either add both (to both outer "if" and "for") or none; adding one pair of
> brackets seems a bit odd.
>

Sorry about that - remnant of a change that was (partially) reverted.
I've removed the extra brackets to keep change minimal. Thank you.

> >
> > -     last_khugepaged_target_node = target_node;
> > +     cc->last_target_node = target_node;
> >       return target_node;
> >  }
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific
  2022-04-27 20:04   ` Peter Xu
@ 2022-04-28  0:51     ` Zach O'Keefe
  2022-04-28 14:51       ` Peter Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-28  0:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, kernel test robot

On Wed, Apr 27, 2022 at 1:05 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Apr 26, 2022 at 07:44:03AM -0700, Zach O'Keefe wrote:
> > Add hugepage allocation context to struct collapse_context, allowing
> > different collapse contexts to allocate hugepages differently.  For
> > example, khugepaged decides to allocate differently in NUMA and UMA
> > configurations, and other collapse contexts shouldn't be coupled to this
> > decision.  Likewise for the gfp flags used for said allocation.
> >
> > Additionally, move [pre]allocated hugepage pointer into
> > struct collapse_context.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > Reported-by: kernel test robot <lkp@intel.com>
>
> (Please remember to remove this one too, and the rest ones.. :)
>

Done :) Thanks

> > ---
> >  mm/khugepaged.c | 102 ++++++++++++++++++++++++------------------------
> >  1 file changed, 52 insertions(+), 50 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 9d42fa330812..c4962191d6e1 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -92,6 +92,11 @@ struct collapse_control {
> >
> >       /* Last target selected in khugepaged_find_target_node() for this scan */
> >       int last_target_node;
> > +
> > +     struct page *hpage;
> > +     gfp_t (*gfp)(void);
> > +     struct page* (*alloc_hpage)(struct collapse_control *cc, gfp_t gfp,
> > +                                 int node);
>
> Nit: s/page* /page */ looks a bit more consistent..
>

Done - thanks for catching.

> >  };
> >
> >  /**
> > @@ -877,21 +882,21 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
> >       return true;
> >  }
> >
> > -static struct page *
> > -khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > +static struct page *khugepaged_alloc_page(struct collapse_control *cc,
> > +                                       gfp_t gfp, int node)
>
> Nit: I'd suggest we put collapse_control* either at the start or at the
> end, and keep doing it when possible.
>
> IIRC you used to put it always at the end, but now it's at the start.  Not
> a big deal but it'll be nice to keep the code self-aligned.
>

Makes sense to me to be consistent. I'll move it to the end here and
look over the other patches to make an effort to do the same. General
strategy was to keep "output" arguments at the very end.

> >  {
> > -     VM_BUG_ON_PAGE(*hpage, *hpage);
> > +     VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
> >
> > -     *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> > -     if (unlikely(!*hpage)) {
> > +     cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> > +     if (unlikely(!cc->hpage)) {
> >               count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> > -             *hpage = ERR_PTR(-ENOMEM);
> > +             cc->hpage = ERR_PTR(-ENOMEM);
> >               return NULL;
> >       }
> >
> > -     prep_transhuge_page(*hpage);
> > +     prep_transhuge_page(cc->hpage);
> >       count_vm_event(THP_COLLAPSE_ALLOC);
> > -     return *hpage;
> > +     return cc->hpage;
> >  }
> >  #else
> >  static int khugepaged_find_target_node(struct collapse_control *cc)
> > @@ -953,12 +958,12 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
> >       return true;
> >  }
> >
> > -static struct page *
> > -khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > +static struct page *khugepaged_alloc_page(struct collapse_control *cc,
> > +                                       gfp_t gfp, int node)
> >  {
> > -     VM_BUG_ON(!*hpage);
> > +     VM_BUG_ON(!cc->hpage);
> >
> > -     return  *hpage;
> > +     return cc->hpage;
> >  }
> >  #endif
> >
> > @@ -1080,10 +1085,9 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> >       return true;
> >  }
> >
> > -static void collapse_huge_page(struct mm_struct *mm,
> > -                                unsigned long address,
> > -                                struct page **hpage,
> > -                                int node, int referenced, int unmapped)
> > +static void collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > +                            struct collapse_control *cc, int referenced,
> > +                            int unmapped)
> >  {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > @@ -1096,11 +1100,12 @@ static void collapse_huge_page(struct mm_struct *mm,
> >       struct mmu_notifier_range range;
> >       gfp_t gfp;
> >       const struct cpumask *cpumask;
> > +     int node;
> >
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       /* Only allocate from the target node */
> > -     gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> > +     gfp = cc->gfp() | __GFP_THISNODE;
> >
> >       /*
> >        * Before allocating the hugepage, release the mmap_lock read lock.
> > @@ -1110,13 +1115,14 @@ static void collapse_huge_page(struct mm_struct *mm,
> >        */
> >       mmap_read_unlock(mm);
> >
> > +     node = khugepaged_find_target_node(cc);
> >       /* sched to specified node before huage page memory copy */
> >       if (task_node(current) != node) {
> >               cpumask = cpumask_of_node(node);
> >               if (!cpumask_empty(cpumask))
> >                       set_cpus_allowed_ptr(current, cpumask);
> >       }
> > -     new_page = khugepaged_alloc_page(hpage, gfp, node);
> > +     new_page = cc->alloc_hpage(cc, gfp, node);
>
> AFAICT you removed all references of khugepaged_alloc_page() in this patch,
> then you'd better drop the function for both NUMA and UMA.
>

Sorry, I'm not sure I follow. In khugepaged context, logic WRT
khugepaged_alloc_page() is unchanged - it's just called indirectly
through ->alloc_hpage().

> Said that, I think it's actually better to keep them, as they do things
> useful.  For example,AFAICT this ->alloc_hpage() change can leak the hpage
> alloacted for UMA already so that's why I think keeping them makes sense,
> then iiuc the BUG_ON would trigger with UMA already.
>
> I saw that you've moved khugepaged_find_target_node() into this function
> which looks definitely good, but if we could keep khugepaged_alloc_page()
> and then IMHO we could even move khugepaged_find_target_node() into that
> and drop the "node" parameter in khugepaged_alloc_page().
>

I actually had done this, until commit 4272063db38c ("mm/khugepaged:
sched to numa node when collapse huge page") which forced me to keep
"node" visible in this function.

> I'd even consider moving cc->gfp() all into it if possible, since the gfp
> seems to be always with __GFP_THISNODE anyways.
>

I would have liked to do this, but the gfp flags are referenced again
when calling mem_cgroup_charge(), so I couldn't quite get rid of them
from here.

> What do you think?
>
> >       if (!new_page) {
> >               result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> >               goto out_nolock;
> > @@ -1238,15 +1244,15 @@ static void collapse_huge_page(struct mm_struct *mm,
> >       update_mmu_cache_pmd(vma, address, pmd);
> >       spin_unlock(pmd_ptl);
> >
> > -     *hpage = NULL;
> > +     cc->hpage = NULL;
> >
> >       khugepaged_pages_collapsed++;
> >       result = SCAN_SUCCEED;
> >  out_up_write:
> >       mmap_write_unlock(mm);
> >  out_nolock:
> > -     if (!IS_ERR_OR_NULL(*hpage))
> > -             mem_cgroup_uncharge(page_folio(*hpage));
> > +     if (!IS_ERR_OR_NULL(cc->hpage))
> > +             mem_cgroup_uncharge(page_folio(cc->hpage));
> >       trace_mm_collapse_huge_page(mm, isolated, result);
> >       return;
> >  }
> > @@ -1254,7 +1260,6 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                              struct vm_area_struct *vma,
> >                              unsigned long address,
> > -                            struct page **hpage,
> >                              struct collapse_control *cc)
> >  {
> >       pmd_t *pmd;
> > @@ -1399,10 +1404,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  out_unmap:
> >       pte_unmap_unlock(pte, ptl);
> >       if (ret) {
> > -             node = khugepaged_find_target_node(cc);
> >               /* collapse_huge_page will return with the mmap_lock released */
> > -             collapse_huge_page(mm, address, hpage, node,
> > -                             referenced, unmapped);
> > +             collapse_huge_page(mm, address, cc, referenced, unmapped);
> >       }
> >  out:
> >       trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
> > @@ -1667,8 +1670,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >   * @mm: process address space where collapse happens
> >   * @file: file that collapse on
> >   * @start: collapse start address
> > - * @hpage: new allocated huge page for collapse
> > - * @node: appointed node the new huge page allocate from
> > + * @cc: collapse context and scratchpad
> >   *
> >   * Basic scheme is simple, details are more complex:
> >   *  - allocate and lock a new huge page;
> > @@ -1686,8 +1688,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >   *    + unlock and free huge page;
> >   */
> >  static void collapse_file(struct mm_struct *mm,
> > -             struct file *file, pgoff_t start,
> > -             struct page **hpage, int node)
> > +                       struct file *file, pgoff_t start,
> > +                       struct collapse_control *cc)
> >  {
> >       struct address_space *mapping = file->f_mapping;
> >       gfp_t gfp;
> > @@ -1697,15 +1699,16 @@ static void collapse_file(struct mm_struct *mm,
> >       XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
> >       int nr_none = 0, result = SCAN_SUCCEED;
> >       bool is_shmem = shmem_file(file);
> > -     int nr;
> > +     int nr, node;
> >
> >       VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> >       VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> >
> >       /* Only allocate from the target node */
> > -     gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> > +     gfp = cc->gfp() | __GFP_THISNODE;
> > +     node = khugepaged_find_target_node(cc);
> >
> > -     new_page = khugepaged_alloc_page(hpage, gfp, node);
> > +     new_page = cc->alloc_hpage(cc, gfp, node);
> >       if (!new_page) {
> >               result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> >               goto out;
> > @@ -1998,7 +2001,7 @@ static void collapse_file(struct mm_struct *mm,
> >                * Remove pte page tables, so we can re-fault the page as huge.
> >                */
> >               retract_page_tables(mapping, start);
> > -             *hpage = NULL;
> > +             cc->hpage = NULL;
> >
> >               khugepaged_pages_collapsed++;
> >       } else {
> > @@ -2045,14 +2048,14 @@ static void collapse_file(struct mm_struct *mm,
> >       unlock_page(new_page);
> >  out:
> >       VM_BUG_ON(!list_empty(&pagelist));
> > -     if (!IS_ERR_OR_NULL(*hpage))
> > -             mem_cgroup_uncharge(page_folio(*hpage));
> > +     if (!IS_ERR_OR_NULL(cc->hpage))
> > +             mem_cgroup_uncharge(page_folio(cc->hpage));
> >       /* TODO: tracepoints */
> >  }
> >
> >  static void khugepaged_scan_file(struct mm_struct *mm,
> > -             struct file *file, pgoff_t start, struct page **hpage,
> > -             struct collapse_control *cc)
> > +                              struct file *file, pgoff_t start,
> > +                              struct collapse_control *cc)
> >  {
> >       struct page *page = NULL;
> >       struct address_space *mapping = file->f_mapping;
> > @@ -2125,8 +2128,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
> >                       result = SCAN_EXCEED_NONE_PTE;
> >                       count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >               } else {
> > -                     node = khugepaged_find_target_node(cc);
> > -                     collapse_file(mm, file, start, hpage, node);
> > +                     collapse_file(mm, file, start, cc);
> >               }
> >       }
> >
> > @@ -2134,8 +2136,8 @@ static void khugepaged_scan_file(struct mm_struct *mm,
> >  }
> >  #else
> >  static void khugepaged_scan_file(struct mm_struct *mm,
> > -             struct file *file, pgoff_t start, struct page **hpage,
> > -             struct collapse_control *cc)
> > +                              struct file *file, pgoff_t start,
> > +                              struct collapse_control *cc)
> >  {
> >       BUILD_BUG();
> >  }
> > @@ -2146,7 +2148,6 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
> >  #endif
> >
> >  static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> > -                                         struct page **hpage,
> >                                           struct collapse_control *cc)
> >       __releases(&khugepaged_mm_lock)
> >       __acquires(&khugepaged_mm_lock)
> > @@ -2223,12 +2224,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> >
> >                               mmap_read_unlock(mm);
> >                               ret = 1;
> > -                             khugepaged_scan_file(mm, file, pgoff, hpage, cc);
> > +                             khugepaged_scan_file(mm, file, pgoff, cc);
> >                               fput(file);
> >                       } else {
> >                               ret = khugepaged_scan_pmd(mm, vma,
> > -                                             khugepaged_scan.address,
> > -                                             hpage, cc);
> > +                                             khugepaged_scan.address, cc);
> >                       }
> >                       /* move to next address */
> >                       khugepaged_scan.address += HPAGE_PMD_SIZE;
> > @@ -2286,15 +2286,15 @@ static int khugepaged_wait_event(void)
> >
> >  static void khugepaged_do_scan(struct collapse_control *cc)
> >  {
> > -     struct page *hpage = NULL;
> >       unsigned int progress = 0, pass_through_head = 0;
> >       unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
> >       bool wait = true;
> >
> > +     cc->hpage = NULL;
> >       lru_add_drain_all();
> >
> >       while (progress < pages) {
> > -             if (!khugepaged_prealloc_page(&hpage, &wait))
> > +             if (!khugepaged_prealloc_page(&cc->hpage, &wait))
> >                       break;
> >
> >               cond_resched();
> > @@ -2308,14 +2308,14 @@ static void khugepaged_do_scan(struct collapse_control *cc)
> >               if (khugepaged_has_work() &&
> >                   pass_through_head < 2)
> >                       progress += khugepaged_scan_mm_slot(pages - progress,
> > -                                                         &hpage, cc);
> > +                                                         cc);
> >               else
> >                       progress = pages;
> >               spin_unlock(&khugepaged_mm_lock);
> >       }
> >
> > -     if (!IS_ERR_OR_NULL(hpage))
> > -             put_page(hpage);
> > +     if (!IS_ERR_OR_NULL(cc->hpage))
> > +             put_page(cc->hpage);
> >  }
> >
> >  static bool khugepaged_should_wakeup(void)
> > @@ -2349,6 +2349,8 @@ static int khugepaged(void *none)
> >       struct mm_slot *mm_slot;
> >       struct collapse_control cc = {
> >               .last_target_node = NUMA_NO_NODE,
> > +             .gfp = &alloc_hugepage_khugepaged_gfpmask,
> > +             .alloc_hpage = &khugepaged_alloc_page,
>
> I'm also wondering whether we could avoid the gfp() hook, because logically
> that can be inlined into the ->alloc_hpage() hook, then we don't need gfp
> parameter anymore for ->alloc_hpage(), which seems to be a win-win?
>

As mentioned above, I would have liked to not have this, but the gfp
flags are used in a couple places (mem_cgroup_charge() and
->alloc_hpage()).

I could (?) make .gfp of type gfp_t and just update it on every
khugepaged scan (in case it changed) and also remove the gfp parameter
for ->alloc_hpage().

Thank you, Peter!

Zach

> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific
  2022-04-28  0:51     ` Zach O'Keefe
@ 2022-04-28 14:51       ` Peter Xu
  2022-04-28 15:37         ` Zach O'Keefe
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2022-04-28 14:51 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, kernel test robot

On Wed, Apr 27, 2022 at 05:51:53PM -0700, Zach O'Keefe wrote:
> > > @@ -1110,13 +1115,14 @@ static void collapse_huge_page(struct mm_struct *mm,
> > >        */
> > >       mmap_read_unlock(mm);
> > >
> > > +     node = khugepaged_find_target_node(cc);
> > >       /* sched to specified node before huage page memory copy */
> > >       if (task_node(current) != node) {
> > >               cpumask = cpumask_of_node(node);
> > >               if (!cpumask_empty(cpumask))
> > >                       set_cpus_allowed_ptr(current, cpumask);
> > >       }
> > > -     new_page = khugepaged_alloc_page(hpage, gfp, node);
> > > +     new_page = cc->alloc_hpage(cc, gfp, node);
> >
> > AFAICT you removed all references of khugepaged_alloc_page() in this patch,
> > then you'd better drop the function for both NUMA and UMA.
> >
> 
> Sorry, I'm not sure I follow. In khugepaged context, logic WRT
> khugepaged_alloc_page() is unchanged - it's just called indirectly
> through ->alloc_hpage().

Ah you're right, sorry for the confusion.

> 
> > Said that, I think it's actually better to keep them, as they do things
> > useful.  For example,AFAICT this ->alloc_hpage() change can leak the hpage
> > alloacted for UMA already so that's why I think keeping them makes sense,
> > then iiuc the BUG_ON would trigger with UMA already.
> >
> > I saw that you've moved khugepaged_find_target_node() into this function
> > which looks definitely good, but if we could keep khugepaged_alloc_page()
> > and then IMHO we could even move khugepaged_find_target_node() into that
> > and drop the "node" parameter in khugepaged_alloc_page().
> >
> 
> I actually had done this, until commit 4272063db38c ("mm/khugepaged:
> sched to numa node when collapse huge page") which forced me to keep
> "node" visible in this function.

Right, I was looking at my local tree and that patch seems to be very
lately added into -next.

I'm curious why it wasn't applying to file thps too if it is worthwhile,
since if so that's also a suitable candidate to be moved into the same
hook.  I've asked in that thread instead.

Before that, feel free to keep your code as is.

> 
> > I'd even consider moving cc->gfp() all into it if possible, since the gfp
> > seems to be always with __GFP_THISNODE anyways.
> >
> 
> I would have liked to do this, but the gfp flags are referenced again
> when calling mem_cgroup_charge(), so I couldn't quite get rid of them
> from here.

Yeah, maybe we could move mem_cgroup_charge() into the hook too?  As below
codes are duplicated between file & anon and IMHO they're good candidate to
a new helper already anyway:

	/* Only allocate from the target node */
	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;

	new_page = khugepaged_alloc_page(hpage, gfp, node);
	if (!new_page) {
		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
		goto out;
	}

	if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
		result = SCAN_CGROUP_CHARGE_FAIL;
		goto out;
	}
	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);

If we want to generalize it maybe we want to return the "result" instead of
the new page though, so the newpage can be passed in as a pointer.

There's one mmap_read_unlock(mm) for the anonymous code but I think that
can simply be moved above the chunk.

No strong opinion here, please feel free to think about what's the best
approach for landing this series.

[...]

> 
> I could (?) make .gfp of type gfp_t and just update it on every
> khugepaged scan (in case it changed) and also remove the gfp parameter
> for ->alloc_hpage().

If that's the case I'd prefer you keep you code as-is; gfp() is perfectly
fine and gfp() is light, I'm afraid that caching thing could make things
complicated.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific
  2022-04-28 14:51       ` Peter Xu
@ 2022-04-28 15:37         ` Zach O'Keefe
  2022-04-28 15:52           ` Peter Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-28 15:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, kernel test robot

On Thu, Apr 28, 2022 at 7:51 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Apr 27, 2022 at 05:51:53PM -0700, Zach O'Keefe wrote:
> > > > @@ -1110,13 +1115,14 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > >        */
> > > >       mmap_read_unlock(mm);
> > > >
> > > > +     node = khugepaged_find_target_node(cc);
> > > >       /* sched to specified node before huage page memory copy */
> > > >       if (task_node(current) != node) {
> > > >               cpumask = cpumask_of_node(node);
> > > >               if (!cpumask_empty(cpumask))
> > > >                       set_cpus_allowed_ptr(current, cpumask);
> > > >       }
> > > > -     new_page = khugepaged_alloc_page(hpage, gfp, node);
> > > > +     new_page = cc->alloc_hpage(cc, gfp, node);
> > >
> > > AFAICT you removed all references of khugepaged_alloc_page() in this patch,
> > > then you'd better drop the function for both NUMA and UMA.
> > >
> >
> > Sorry, I'm not sure I follow. In khugepaged context, logic WRT
> > khugepaged_alloc_page() is unchanged - it's just called indirectly
> > through ->alloc_hpage().
>
> Ah you're right, sorry for the confusion.
>
> >
> > > Said that, I think it's actually better to keep them, as they do things
> > > useful.  For example,AFAICT this ->alloc_hpage() change can leak the hpage
> > > alloacted for UMA already so that's why I think keeping them makes sense,
> > > then iiuc the BUG_ON would trigger with UMA already.
> > >
> > > I saw that you've moved khugepaged_find_target_node() into this function
> > > which looks definitely good, but if we could keep khugepaged_alloc_page()
> > > and then IMHO we could even move khugepaged_find_target_node() into that
> > > and drop the "node" parameter in khugepaged_alloc_page().
> > >
> >
> > I actually had done this, until commit 4272063db38c ("mm/khugepaged:
> > sched to numa node when collapse huge page") which forced me to keep
> > "node" visible in this function.
>
> Right, I was looking at my local tree and that patch seems to be very
> lately added into -next.
>
> I'm curious why it wasn't applying to file thps too if it is worthwhile,
> since if so that's also a suitable candidate to be moved into the same
> hook.  I've asked in that thread instead.
>
> Before that, feel free to keep your code as is.
>

Thanks for checking on that. On second thought, it seems like we would
actually want to move this sched logic into the khupaged hook impl
since we probably don't want to be moving around the calling process
if in madvise context.

> >
> > > I'd even consider moving cc->gfp() all into it if possible, since the gfp
> > > seems to be always with __GFP_THISNODE anyways.
> > >
> >
> > I would have liked to do this, but the gfp flags are referenced again
> > when calling mem_cgroup_charge(), so I couldn't quite get rid of them
> > from here.
>
> Yeah, maybe we could move mem_cgroup_charge() into the hook too?  As below
> codes are duplicated between file & anon and IMHO they're good candidate to
> a new helper already anyway:
>
>         /* Only allocate from the target node */
>         gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
>
>         new_page = khugepaged_alloc_page(hpage, gfp, node);
>         if (!new_page) {
>                 result = SCAN_ALLOC_HUGE_PAGE_FAIL;
>                 goto out;
>         }
>
>         if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
>                 result = SCAN_CGROUP_CHARGE_FAIL;
>                 goto out;
>         }
>         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
>
> If we want to generalize it maybe we want to return the "result" instead of
> the new page though, so the newpage can be passed in as a pointer.
>
> There's one mmap_read_unlock(mm) for the anonymous code but I think that
> can simply be moved above the chunk.
>
> No strong opinion here, please feel free to think about what's the best
> approach for landing this series.
>

Thanks for the suggestion. I'll play around with it a little and see
what looks good.

> [...]
>
> >
> > I could (?) make .gfp of type gfp_t and just update it on every
> > khugepaged scan (in case it changed) and also remove the gfp parameter
> > for ->alloc_hpage().
>
> If that's the case I'd prefer you keep you code as-is; gfp() is perfectly
> fine and gfp() is light, I'm afraid that caching thing could make things
> complicated.
>

Ack.

Thanks again for your time and thoughts, Peter!

Best,
Zach

> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific
  2022-04-28 15:37         ` Zach O'Keefe
@ 2022-04-28 15:52           ` Peter Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2022-04-28 15:52 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, kernel test robot

On Thu, Apr 28, 2022 at 08:37:06AM -0700, Zach O'Keefe wrote:
> On Thu, Apr 28, 2022 at 7:51 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Apr 27, 2022 at 05:51:53PM -0700, Zach O'Keefe wrote:
> > > > > @@ -1110,13 +1115,14 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > >        */
> > > > >       mmap_read_unlock(mm);
> > > > >
> > > > > +     node = khugepaged_find_target_node(cc);
> > > > >       /* sched to specified node before huage page memory copy */
> > > > >       if (task_node(current) != node) {
> > > > >               cpumask = cpumask_of_node(node);
> > > > >               if (!cpumask_empty(cpumask))
> > > > >                       set_cpus_allowed_ptr(current, cpumask);
> > > > >       }
> > > > > -     new_page = khugepaged_alloc_page(hpage, gfp, node);
> > > > > +     new_page = cc->alloc_hpage(cc, gfp, node);
> > > >
> > > > AFAICT you removed all references of khugepaged_alloc_page() in this patch,
> > > > then you'd better drop the function for both NUMA and UMA.
> > > >
> > >
> > > Sorry, I'm not sure I follow. In khugepaged context, logic WRT
> > > khugepaged_alloc_page() is unchanged - it's just called indirectly
> > > through ->alloc_hpage().
> >
> > Ah you're right, sorry for the confusion.
> >
> > >
> > > > Said that, I think it's actually better to keep them, as they do things
> > > > useful.  For example,AFAICT this ->alloc_hpage() change can leak the hpage
> > > > alloacted for UMA already so that's why I think keeping them makes sense,
> > > > then iiuc the BUG_ON would trigger with UMA already.
> > > >
> > > > I saw that you've moved khugepaged_find_target_node() into this function
> > > > which looks definitely good, but if we could keep khugepaged_alloc_page()
> > > > and then IMHO we could even move khugepaged_find_target_node() into that
> > > > and drop the "node" parameter in khugepaged_alloc_page().
> > > >
> > >
> > > I actually had done this, until commit 4272063db38c ("mm/khugepaged:
> > > sched to numa node when collapse huge page") which forced me to keep
> > > "node" visible in this function.
> >
> > Right, I was looking at my local tree and that patch seems to be very
> > lately added into -next.
> >
> > I'm curious why it wasn't applying to file thps too if it is worthwhile,
> > since if so that's also a suitable candidate to be moved into the same
> > hook.  I've asked in that thread instead.
> >
> > Before that, feel free to keep your code as is.
> >
> 
> Thanks for checking on that. On second thought, it seems like we would
> actually want to move this sched logic into the khupaged hook impl
> since we probably don't want to be moving around the calling process
> if in madvise context.

Sounds correct, if it's moved over it must only be in the new helper (if
you're going to introduce it, like below :) that was only for khugepaged.

Actually I really start to question that idea more..  e.g. even for
khugepaged the target node can be changing rapidly depending on the owner
of pages planned to collapse.

Whether does it make sense to bounce khugepaged thread itself seems still
questionable to me, and definitely not a good idea for madvise() context.
The only proof in that patch was a whole testbench result but I'm not sure
how reliable that'll be when applied to real world scenarios.

> 
> > >
> > > > I'd even consider moving cc->gfp() all into it if possible, since the gfp
> > > > seems to be always with __GFP_THISNODE anyways.
> > > >
> > >
> > > I would have liked to do this, but the gfp flags are referenced again
> > > when calling mem_cgroup_charge(), so I couldn't quite get rid of them
> > > from here.
> >
> > Yeah, maybe we could move mem_cgroup_charge() into the hook too?  As below
> > codes are duplicated between file & anon and IMHO they're good candidate to
> > a new helper already anyway:
> >
> >         /* Only allocate from the target node */
> >         gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> >
> >         new_page = khugepaged_alloc_page(hpage, gfp, node);
> >         if (!new_page) {
> >                 result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> >                 goto out;
> >         }
> >
> >         if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
> >                 result = SCAN_CGROUP_CHARGE_FAIL;
> >                 goto out;
> >         }
> >         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> >
> > If we want to generalize it maybe we want to return the "result" instead of
> > the new page though, so the newpage can be passed in as a pointer.
> >
> > There's one mmap_read_unlock(mm) for the anonymous code but I think that
> > can simply be moved above the chunk.
> >
> > No strong opinion here, please feel free to think about what's the best
> > approach for landing this series.
> >
> 
> Thanks for the suggestion. I'll play around with it a little and see
> what looks good.

Sounds good!

If the new helper would be welcomed then it can be a standalone
pre-requisite patch to clean things up first.  Let's also see how it goes
with the other patch, because there's chance you can also cleanup
khugepaged_find_target_node() in the same patch along with all above.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 04/12] mm/khugepaged: add struct collapse_result
  2022-04-27 20:47   ` Peter Xu
@ 2022-04-28 21:59     ` Zach O'Keefe
  2022-04-28 23:21       ` Peter Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-28 21:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, Apr 27, 2022 at 1:47 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Apr 26, 2022 at 07:44:04AM -0700, Zach O'Keefe wrote:
> > +/* Gather information from one khugepaged_scan_[pmd|file]() request */
> > +struct collapse_result {
> > +     enum scan_result result;
> > +
> > +     /* Was mmap_lock dropped during request? */
> > +     bool dropped_mmap_lock;
> > +};
>
> IMHO this new dropped_mmap_lock makes things a bit more complicated..
>
> To me, the old code actually is easy to read on when the lock is dropped:
>
>   - For file, we always drop it
>   - For anon, when khugepaged_scan_pmd() returns 1
>
> That's fairly clear to me.
>

Agreed - but I know when I first read the code it wasn't obvious to me
what the return value of khugepaged_scan_pmd() was saying. I thought
giving it a name ("dropped_mmap_lock") would help future readers.

That said, I'm fine reverting this and going back to the old way of
things. It's less members crowding struct collapse_control as well.

> But I understand what you need, probably the "result" hidden in the old
> world but you'd need that for MADV_COLLAPSE?
>
> I also noticed major chunks of this patch is converting "result = XXX" into
> "cr->result = XXX".  IMHO that's really not necessary for touching up all
> the places around.
>
> I'm also wondering whether we could simply pass over the result into your
> newly introduced collapse_control*.
>
> In summary, this is what's in my mind..
>
>   - Add collapse_control.result
>   - Make sure both khugepaged_scan_file() and khugepaged_scan_pmd() to
>     simply remember to update cc->result before it returns
>
> I've attached a pesudo code patch (note!  it's pesudo not even compile but
> to show what I meant..), so far I don't see a problem with it.
>

Thanks for going so far as to throw this patch together! I've gone
ahead and used it with minimal changes. I agree that it's cleaner now.

> --
> Peter Xu


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/12] mm/khugepaged: remove khugepaged prefix from shared collapse functions
  2022-04-27 21:10   ` Peter Xu
@ 2022-04-28 22:51     ` Zach O'Keefe
  0 siblings, 0 replies; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-28 22:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer, kernel test robot

On Wed, Apr 27, 2022 at 2:10 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Apr 26, 2022 at 07:44:06AM -0700, Zach O'Keefe wrote:
> > The following functions/tracepoints are shared between khugepaged and
> > madvise collapse contexts.  Remove the khugepaged prefixes.
> >
> > tracepoint:mm_khugepaged_scan_pmd -> tracepoint:mm_scan_pmd
> > khugepaged_test_exit() -> test_exit()
> > khugepaged_scan_abort() -> scan_abort()
> > khugepaged_scan_pmd() -> scan_pmd()
> > khugepaged_find_target_node() -> find_target_node()
>
> The khugepaged prefix can indeed be confusing, but are the new names just
> too short?  IMHO it'll be nice to still have some generic prefix.
>
> Perhaps something like "hpage_collapse_"/"hpage_"/"hpc_"/..?
>

If we want to keep a prefix, I'd throw my hat in with
"hpage_collapse_". I feel like "collapse" should be in there, and
"collapse_" on its own seems odd to me. Updated.

Thanks Peter.


> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 04/12] mm/khugepaged: add struct collapse_result
  2022-04-28 21:59     ` Zach O'Keefe
@ 2022-04-28 23:21       ` Peter Xu
  2022-04-29 16:01         ` Zach O'Keefe
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2022-04-28 23:21 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Thu, Apr 28, 2022 at 02:59:44PM -0700, Zach O'Keefe wrote:
> On Wed, Apr 27, 2022 at 1:47 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Apr 26, 2022 at 07:44:04AM -0700, Zach O'Keefe wrote:
> > > +/* Gather information from one khugepaged_scan_[pmd|file]() request */
> > > +struct collapse_result {
> > > +     enum scan_result result;
> > > +
> > > +     /* Was mmap_lock dropped during request? */
> > > +     bool dropped_mmap_lock;
> > > +};
> >
> > IMHO this new dropped_mmap_lock makes things a bit more complicated..
> >
> > To me, the old code actually is easy to read on when the lock is dropped:
> >
> >   - For file, we always drop it
> >   - For anon, when khugepaged_scan_pmd() returns 1
> >
> > That's fairly clear to me.
> >
> 
> Agreed - but I know when I first read the code it wasn't obvious to me
> what the return value of khugepaged_scan_pmd() was saying. I thought
> giving it a name ("dropped_mmap_lock") would help future readers.

Yes that's a fair point.  It's not that obvious to me too when I first read
it, but I am just worried hiding that too deep could make it complicated in
another form.

Maybe we can rename "ret" in khugepaged_scan_mm_slot() to "lock_dropped",
or we can comment above khugepaged_scan_pmd() explaining the retval.
Or.. both?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 07/12] mm/khugepaged: add flag to ignore khugepaged_max_ptes_*
  2022-04-27 21:12   ` Peter Xu
@ 2022-04-29 14:26     ` Zach O'Keefe
  0 siblings, 0 replies; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-29 14:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Wed, Apr 27, 2022 at 2:13 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Apr 26, 2022 at 07:44:07AM -0700, Zach O'Keefe wrote:
> > @@ -2365,6 +2375,7 @@ static int khugepaged(void *none)
> >  {
> >       struct mm_slot *mm_slot;
> >       struct collapse_control cc = {
> > +             .enforce_pte_scan_limits = true,
> >               .last_target_node = NUMA_NO_NODE,
> >               .gfp = &alloc_hugepage_khugepaged_gfpmask,
> >               .alloc_hpage = &khugepaged_alloc_page,
> > @@ -2512,6 +2523,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >                    unsigned long start, unsigned long end)
> >  {
> >       struct collapse_control cc = {
> > +             .enforce_pte_scan_limits = false,
> >               .last_target_node = NUMA_NO_NODE,
> >               .hpage = NULL,
> >               .gfp = &alloc_hugepage_madvise_gfpmask,
>
> This changes the semantics of the new madvise().  IMHO it'll be ideal if
> this patch is moved before the introduction of MADV_COLLAPSE, so the new
> madvise() will have a consistent behavior.
>

That makes sense to me, as this was how it was done in the RFC. I'll
assume this applies equally well to "mm/khugepaged: add flag to ignore
page young/referenced requirement" and move both before introducing
the new madvise(2) mode.

Thanks again Peter,
Zach

> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 04/12] mm/khugepaged: add struct collapse_result
  2022-04-28 23:21       ` Peter Xu
@ 2022-04-29 16:01         ` Zach O'Keefe
  0 siblings, 0 replies; 31+ messages in thread
From: Zach O'Keefe @ 2022-04-29 16:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Yang Shi, Zi Yan, linux-mm, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Thomas Bogendoerfer

On Thu, Apr 28, 2022 at 4:21 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Apr 28, 2022 at 02:59:44PM -0700, Zach O'Keefe wrote:
> > On Wed, Apr 27, 2022 at 1:47 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Apr 26, 2022 at 07:44:04AM -0700, Zach O'Keefe wrote:
> > > > +/* Gather information from one khugepaged_scan_[pmd|file]() request */
> > > > +struct collapse_result {
> > > > +     enum scan_result result;
> > > > +
> > > > +     /* Was mmap_lock dropped during request? */
> > > > +     bool dropped_mmap_lock;
> > > > +};
> > >
> > > IMHO this new dropped_mmap_lock makes things a bit more complicated..
> > >
> > > To me, the old code actually is easy to read on when the lock is dropped:
> > >
> > >   - For file, we always drop it
> > >   - For anon, when khugepaged_scan_pmd() returns 1
> > >
> > > That's fairly clear to me.
> > >
> >
> > Agreed - but I know when I first read the code it wasn't obvious to me
> > what the return value of khugepaged_scan_pmd() was saying. I thought
> > giving it a name ("dropped_mmap_lock") would help future readers.
>
> Yes that's a fair point.  It's not that obvious to me too when I first read
> it, but I am just worried hiding that too deep could make it complicated in
> another form.
>
> Maybe we can rename "ret" in khugepaged_scan_mm_slot() to "lock_dropped",
> or we can comment above khugepaged_scan_pmd() explaining the retval.
> Or.. both?
>

So after continuing to play around with your suggestion for a bit
(thank you again), I found that with minimal effort, we could just
pipe the scan_result back through function return values.

The only point where this becomes awkward is for khugepaged_scan_pmd()
where the return value already has a specific meaning.

Solving both problems at once, we can both (a) remove another member
from struct collapse_control (result) by passing this value along as
the return value of functions, and (b) move this "we dropped the lock"
return value into a "bool *mmap_locked" arg to khugepaged_scan_pmd().

The resulting code rebased on this change is much cleaner overall.



> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-04-29 16:01 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-26 14:44 [PATCH v3 00/12] mm: userspace hugepage collapse Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 01/12] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
2022-04-27  0:26   ` Peter Xu
2022-04-27 15:48     ` Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 02/12] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-04-27 19:49   ` Peter Xu
2022-04-28  0:19     ` Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific Zach O'Keefe
2022-04-27 20:04   ` Peter Xu
2022-04-28  0:51     ` Zach O'Keefe
2022-04-28 14:51       ` Peter Xu
2022-04-28 15:37         ` Zach O'Keefe
2022-04-28 15:52           ` Peter Xu
2022-04-26 14:44 ` [PATCH v3 04/12] mm/khugepaged: add struct collapse_result Zach O'Keefe
2022-04-27 20:47   ` Peter Xu
2022-04-28 21:59     ` Zach O'Keefe
2022-04-28 23:21       ` Peter Xu
2022-04-29 16:01         ` Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 05/12] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 06/12] mm/khugepaged: remove khugepaged prefix from shared collapse functions Zach O'Keefe
2022-04-27 21:10   ` Peter Xu
2022-04-28 22:51     ` Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 07/12] mm/khugepaged: add flag to ignore khugepaged_max_ptes_* Zach O'Keefe
2022-04-27 21:12   ` Peter Xu
2022-04-29 14:26     ` Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 08/12] mm/khugepaged: add flag to ignore page young/referenced requirement Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 09/12] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 10/12] selftests/vm: modularize collapse selftests Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 11/12] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 12/12] selftests/vm: add test to verify recollapse of THPs Zach O'Keefe
2022-04-26 20:23 ` [PATCH v3 00/12] mm: userspace hugepage collapse Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.