linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/14] mm: userspace hugepage collapse
@ 2022-03-08 21:34 Zach O'Keefe
  2022-03-08 21:34 ` [RFC PATCH 01/14] mm/rmap: add mm_find_pmd_raw helper Zach O'Keefe
                   ` (16 more replies)
  0 siblings, 17 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was previously introduced by David Rientjes, and thanks to
everyone for your patience while I prepared these patches resulting from
that discussion[1].

[1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

(*) process_madvise(2)

        Performs a synchronous collapse of the native pages mapped by
        the list of iovecs into transparent hugepages. The default gfp
        flags used will be the same as those used at-fault for the VMA
        region(s) covered. When multiple VMA regions are spanned, if
        faulting-in memory from any VMA would permit synchronous
        compaction and reclaim, then all hugepage allocations required
        to satisfy the request may enter compaction and reclaim.
        Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
        by default, as the user is explicitly requesting this action.
        Define two flags to control collapse semantics, passed through
        process_madvise(2)’s optional flags parameter:

        MADV_F_COLLAPSE_LIMITS

        If supplied, collapse respects pte collapse limits set via
        sysfs:
        /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
        Required if calling on behalf of another process and not
        CAP_SYS_ADMIN.

        MADV_F_COLLAPSE_DEFRAG

        If supplied, permit synchronous compaction and reclaim,
        regardless of VMA flags.

(*) madvise(2)

        Equivalent to process_madvise(2) on self, with no flags
        passed; pte collapse limits are ignored, and the gfp flags will
        be the same as those used at-fault for the VMA region(s)
        covered. Note that, users wanting different collapse semantics
        can always use process_madvise(2) on themselves.

Discussion
--------------------------------

The mechanism is fully compatible with khugepaged, allowing userspace to
separately define synchronous and asynchronous hugepage policies, as
priority dictates. It also naturally permits a DAMON scheme,
DAMOS_COLLAPSE, to make efficient use of the available hugepages on the
system by backing the most frequently accessed memory by hugepages[2].
Though not required to justify this series, hugepage management could be
offloaded entirely to a sufficiently informed userspace agent,
supplanting the need for khugepaged in the kernel.

Along with the interface, this series proposes a batched implementation
to collapse a range of memory. The motivation for this is to limit
contention on mmap_lock, doing multiple page table modifications while
the lock is held exclusively.

Only private anonymous memory is supported by this series. File-backed
memory support will be added later.

Multiple hugepages support (such as 1 GiB gigantic hugepages) were not
considered at this time, but could be supported by the flags parameter
in the future.

kselftests were omitted from this series for brevity, but would be
included in an eventual patch submission.

[2] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/

Sequence of Patches
--------------------------------

Patches 1-10 perform refactoring of collapse logic within khugepaged.c:
introducing the notion of a collapse context and isolating logic that
can be reused later in the series for the madvise collapse context.

Patches 11-14 introduce logic for the proposed madvise collapse
mechanism. Patch 11 adds madvise and header file plumbing. Patch 12 and
13, separately, add the core collapse logic, with the former introducing
the overall batched approach and locking strategy, and the latter
fills-in batch action details. This separation was purely to keep patch
size down. Patch 14 adds process_madvise support.

Applies against next-20220308.

Zach O'Keefe (14):
  mm/rmap: add mm_find_pmd_raw helper
  mm/khugepaged: add struct collapse_control
  mm/khugepaged: add __do_collapse_huge_page() helper
  mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse
  mm/khugepaged: add mmap_assert_locked() checks to scan_pmd()
  mm/khugepaged: add hugepage_vma_revalidate_pmd_count()
  mm/khugepaged: add vm_flags_ignore to
    hugepage_vma_revalidate_pmd_count()
  mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled()
  mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP
  mm/khugepaged: rename khugepaged-specific/not functions
  mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
  mm/madvise: add __madvise_collapse_*_batch() actions.
  mm/madvise: add process_madvise(MADV_COLLAPSE)

 fs/io_uring.c                          |   3 +-
 include/linux/huge_mm.h                |  27 +-
 include/linux/mm.h                     |   3 +-
 include/uapi/asm-generic/mman-common.h |  10 +
 mm/huge_memory.c                       |   2 +-
 mm/internal.h                          |   1 +
 mm/khugepaged.c                        | 937 ++++++++++++++++++++-----
 mm/madvise.c                           |  45 +-
 mm/memory.c                            |   6 +-
 mm/rmap.c                              |  15 +-
 10 files changed, 842 insertions(+), 207 deletions(-)

-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH 01/14] mm/rmap: add mm_find_pmd_raw helper
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-09 22:48   ` Yang Shi
  2022-03-08 21:34 ` [RFC PATCH 02/14] mm/khugepaged: add struct collapse_control Zach O'Keefe
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

Later in the series, we want to find a pmd and take different actions,
depending on if the pmd maps a thp or not.  Currently, mm_find_pmd()
returns NULL if a valid pmd maps a thp, and so we can't use it directly.

Split mm_find_pmd() into 2 parts: mm_find_pmd_raw(), which returns a
raw pmd pointer, and the logic that filters out non-present none, or
huge pmds.  mm_find_pmd_raw() can then be reused later in the series.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/internal.h |  1 +
 mm/rmap.c     | 15 +++++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 86277d90a5e2..aaea25bb9096 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -166,6 +166,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
 /*
  * in mm/rmap.c:
  */
+pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address);
 extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
 
 /*
diff --git a/mm/rmap.c b/mm/rmap.c
index 70375c331083..0ae99affcb27 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -758,13 +758,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 	return vma_address(page, vma);
 }
 
-pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
+pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd = NULL;
-	pmd_t pmde;
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
@@ -779,6 +778,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+out:
+	return pmd;
+}
+
+pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
+{
+	pmd_t pmde;
+	pmd_t *pmd;
+
+	pmd = mm_find_pmd_raw(mm, address);
+	if (!pmd)
+		goto out;
 	/*
 	 * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
 	 * without holding anon_vma lock for write.  So when looking for a
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 02/14] mm/khugepaged: add struct collapse_control
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
  2022-03-08 21:34 ` [RFC PATCH 01/14] mm/rmap: add mm_find_pmd_raw helper Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-09 22:53   ` Yang Shi
  2022-03-08 21:34 ` [RFC PATCH 03/14] mm/khugepaged: add __do_collapse_huge_page() helper Zach O'Keefe
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

Modularize huge page collapse by introducing struct collapse_control.
This structure serves to describe the properties of the requested
collapse, as well as serve as a local scratch pad to use during the
collapse itself.

Later in the series when we introduce the madvise collapse context, we
will want to be able to ignore khugepaged_max_ptes_[none|swap|shared]
in said context, and so is included here as a property of the
requested collapse.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 120 ++++++++++++++++++++++++++++++------------------
 1 file changed, 76 insertions(+), 44 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a4e5eaf3eb01..36fc0099c445 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -85,6 +85,24 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
 
 #define MAX_PTE_MAPPED_THP 8
 
+struct collapse_control {
+	/* Respect khugepaged_max_ptes_[none|swap|shared] */
+	bool enforce_pte_scan_limits;
+
+	/* Num pages scanned per node */
+	int node_load[MAX_NUMNODES];
+
+	/* Last target selected in khugepaged_find_target_node() for this scan */
+	int last_target_node;
+};
+
+static void collapse_control_init(struct collapse_control *cc,
+				  bool enforce_pte_scan_limits)
+{
+	cc->enforce_pte_scan_limits = enforce_pte_scan_limits;
+	cc->last_target_node = NUMA_NO_NODE;
+}
+
 /**
  * struct mm_slot - hash lookup from mm to mm_slot
  * @hash: hash collision list
@@ -601,6 +619,7 @@ static bool is_refcount_suitable(struct page *page)
 static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
+					bool enforce_pte_scan_limits,
 					struct list_head *compound_pagelist)
 {
 	struct page *page = NULL;
@@ -614,7 +633,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		if (pte_none(pteval) || (pte_present(pteval) &&
 				is_zero_pfn(pte_pfn(pteval)))) {
 			if (!userfaultfd_armed(vma) &&
-			    ++none_or_zero <= khugepaged_max_ptes_none) {
+			    (++none_or_zero <= khugepaged_max_ptes_none ||
+			     !enforce_pte_scan_limits)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -634,8 +654,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		VM_BUG_ON_PAGE(!PageAnon(page), page);
 
-		if (page_mapcount(page) > 1 &&
-				++shared > khugepaged_max_ptes_shared) {
+		if (page_mapcount(page) > 1 && enforce_pte_scan_limits &&
+		    ++shared > khugepaged_max_ptes_shared) {
 			result = SCAN_EXCEED_SHARED_PTE;
 			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 			goto out;
@@ -785,9 +805,7 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-static int khugepaged_node_load[MAX_NUMNODES];
-
-static bool khugepaged_scan_abort(int nid)
+static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -799,11 +817,11 @@ static bool khugepaged_scan_abort(int nid)
 		return false;
 
 	/* If there is a count for this node already, it must be acceptable */
-	if (khugepaged_node_load[nid])
+	if (cc->node_load[nid])
 		return false;
 
 	for (i = 0; i < MAX_NUMNODES; i++) {
-		if (!khugepaged_node_load[i])
+		if (!cc->node_load[i])
 			continue;
 		if (node_distance(nid, i) > node_reclaim_distance)
 			return true;
@@ -818,28 +836,28 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(void)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
-	static int last_khugepaged_target_node = NUMA_NO_NODE;
 	int nid, target_node = 0, max_value = 0;
 
 	/* find first node with max normal pages hit */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
-		if (khugepaged_node_load[nid] > max_value) {
-			max_value = khugepaged_node_load[nid];
+		if (cc->node_load[nid] > max_value) {
+			max_value = cc->node_load[nid];
 			target_node = nid;
 		}
 
 	/* do some balance if several nodes have the same hit record */
-	if (target_node <= last_khugepaged_target_node)
-		for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
-				nid++)
-			if (max_value == khugepaged_node_load[nid]) {
+	if (target_node <= cc->last_target_node)
+		for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
+		     nid++) {
+			if (max_value == cc->node_load[nid]) {
 				target_node = nid;
 				break;
 			}
+		}
 
-	last_khugepaged_target_node = target_node;
+	cc->last_target_node = target_node;
 	return target_node;
 }
 
@@ -877,7 +895,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 	return *hpage;
 }
 #else
-static int khugepaged_find_target_node(void)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -1043,7 +1061,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 static void collapse_huge_page(struct mm_struct *mm,
 				   unsigned long address,
 				   struct page **hpage,
-				   int node, int referenced, int unmapped)
+				   int node, int referenced, int unmapped,
+				   int enforce_pte_scan_limits)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -1141,7 +1160,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte,
-			&compound_pagelist);
+			enforce_pte_scan_limits, &compound_pagelist);
 	spin_unlock(pte_ptl);
 
 	if (unlikely(!isolated)) {
@@ -1206,7 +1225,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 static int khugepaged_scan_pmd(struct mm_struct *mm,
 			       struct vm_area_struct *vma,
 			       unsigned long address,
-			       struct page **hpage)
+			       struct page **hpage,
+			       struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1226,13 +1246,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		goto out;
 	}
 
-	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+	memset(cc->node_load, 0, sizeof(cc->node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (is_swap_pte(pteval)) {
-			if (++unmapped <= khugepaged_max_ptes_swap) {
+			if (++unmapped <= khugepaged_max_ptes_swap ||
+			    !cc->enforce_pte_scan_limits) {
 				/*
 				 * Always be strict with uffd-wp
 				 * enabled swap entries.  Please see
@@ -1251,7 +1272,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			if (!userfaultfd_armed(vma) &&
-			    ++none_or_zero <= khugepaged_max_ptes_none) {
+			    (++none_or_zero <= khugepaged_max_ptes_none ||
+			     !cc->enforce_pte_scan_limits)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1282,7 +1304,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		}
 
 		if (page_mapcount(page) > 1 &&
-				++shared > khugepaged_max_ptes_shared) {
+				++shared > khugepaged_max_ptes_shared &&
+				cc->enforce_pte_scan_limits) {
 			result = SCAN_EXCEED_SHARED_PTE;
 			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 			goto out_unmap;
@@ -1292,16 +1315,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 
 		/*
 		 * Record which node the original page is from and save this
-		 * information to khugepaged_node_load[].
+		 * information to cc->node_load[].
 		 * Khugepaged will allocate hugepage from the node has the max
 		 * hit record.
 		 */
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
-		khugepaged_node_load[node]++;
+		cc->node_load[node]++;
 		if (!PageLRU(page)) {
 			result = SCAN_PAGE_LRU;
 			goto out_unmap;
@@ -1352,10 +1375,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
-		node = khugepaged_find_target_node();
+		node = khugepaged_find_target_node(cc);
 		/* collapse_huge_page will return with the mmap_lock released */
 		collapse_huge_page(mm, address, hpage, node,
-				referenced, unmapped);
+				referenced, unmapped,
+				cc->enforce_pte_scan_limits);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
@@ -1992,7 +2016,8 @@ static void collapse_file(struct mm_struct *mm,
 }
 
 static void khugepaged_scan_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start, struct page **hpage)
+		struct file *file, pgoff_t start, struct page **hpage,
+		struct collapse_control *cc)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2003,14 +2028,15 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 
 	present = 0;
 	swap = 0;
-	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+	memset(cc->node_load, 0, sizeof(cc->node_load));
 	rcu_read_lock();
 	xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
 		if (xas_retry(&xas, page))
 			continue;
 
 		if (xa_is_value(page)) {
-			if (++swap > khugepaged_max_ptes_swap) {
+			if (cc->enforce_pte_scan_limits &&
+			    ++swap > khugepaged_max_ptes_swap) {
 				result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				break;
@@ -2028,11 +2054,11 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 		}
 
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			break;
 		}
-		khugepaged_node_load[node]++;
+		cc->node_load[node]++;
 
 		if (!PageLRU(page)) {
 			result = SCAN_PAGE_LRU;
@@ -2061,11 +2087,12 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 	rcu_read_unlock();
 
 	if (result == SCAN_SUCCEED) {
-		if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+		if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none &&
+		    cc->enforce_pte_scan_limits) {
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			node = khugepaged_find_target_node();
+			node = khugepaged_find_target_node(cc);
 			collapse_file(mm, file, start, hpage, node);
 		}
 	}
@@ -2074,7 +2101,8 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 }
 #else
 static void khugepaged_scan_file(struct mm_struct *mm,
-		struct file *file, pgoff_t start, struct page **hpage)
+		struct file *file, pgoff_t start, struct page **hpage,
+		struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -2085,7 +2113,8 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
-					    struct page **hpage)
+					    struct page **hpage,
+					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
 {
@@ -2161,12 +2190,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 
 				mmap_read_unlock(mm);
 				ret = 1;
-				khugepaged_scan_file(mm, file, pgoff, hpage);
+				khugepaged_scan_file(mm, file, pgoff, hpage, cc);
 				fput(file);
 			} else {
 				ret = khugepaged_scan_pmd(mm, vma,
 						khugepaged_scan.address,
-						hpage);
+						hpage, cc);
 			}
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
@@ -2222,7 +2251,7 @@ static int khugepaged_wait_event(void)
 		kthread_should_stop();
 }
 
-static void khugepaged_do_scan(void)
+static void khugepaged_do_scan(struct collapse_control *cc)
 {
 	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
@@ -2246,7 +2275,7 @@ static void khugepaged_do_scan(void)
 		if (khugepaged_has_work() &&
 		    pass_through_head < 2)
 			progress += khugepaged_scan_mm_slot(pages - progress,
-							    &hpage);
+							    &hpage, cc);
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
@@ -2285,12 +2314,15 @@ static void khugepaged_wait_work(void)
 static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
+	struct collapse_control cc;
+
+	collapse_control_init(&cc, /* enforce_pte_scan_limits= */ 1);
 
 	set_freezable();
 	set_user_nice(current, MAX_NICE);
 
 	while (!kthread_should_stop()) {
-		khugepaged_do_scan();
+		khugepaged_do_scan(&cc);
 		khugepaged_wait_work();
 	}
 
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 03/14] mm/khugepaged: add __do_collapse_huge_page() helper
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
  2022-03-08 21:34 ` [RFC PATCH 01/14] mm/rmap: add mm_find_pmd_raw helper Zach O'Keefe
  2022-03-08 21:34 ` [RFC PATCH 02/14] mm/khugepaged: add struct collapse_control Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-08 21:34 ` [RFC PATCH 04/14] mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse Zach O'Keefe
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

collapse_huge_page currently does: (1) possibly allocates a hugepage,
(2) charges the owning memcg, (3) swaps in swapped-out pages (4) the
actual collapse (copying of pages, installation of huge pmd), and (5)
some final memcg accounting in error path.

Separate out (4) so that it can be reused by itself later in the series.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 178 +++++++++++++++++++++++++++---------------------
 1 file changed, 100 insertions(+), 78 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 36fc0099c445..e3399a451662 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1058,85 +1058,23 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 	return true;
 }
 
-static void collapse_huge_page(struct mm_struct *mm,
-				   unsigned long address,
-				   struct page **hpage,
-				   int node, int referenced, int unmapped,
-				   int enforce_pte_scan_limits)
-{
-	LIST_HEAD(compound_pagelist);
-	pmd_t *pmd, _pmd;
+static int __do_collapse_huge_page(struct mm_struct *mm,
+				   struct vm_area_struct *vma,
+				   unsigned long address, pmd_t *pmd,
+				   struct page *new_page,
+				   int enforce_pte_scan_limits,
+				   int *isolated_out)
+{
+	pmd_t _pmd;
 	pte_t *pte;
 	pgtable_t pgtable;
-	struct page *new_page;
 	spinlock_t *pmd_ptl, *pte_ptl;
-	int isolated = 0, result = 0;
-	struct vm_area_struct *vma;
+	int isolated = 0, result = SCAN_SUCCEED;
 	struct mmu_notifier_range range;
-	gfp_t gfp;
-
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-
-	/* Only allocate from the target node */
-	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
-
-	/*
-	 * Before allocating the hugepage, release the mmap_lock read lock.
-	 * The allocation can take potentially a long time if it involves
-	 * sync compaction, and we do not need to hold the mmap_lock during
-	 * that. We will recheck the vma after taking it again in write mode.
-	 */
-	mmap_read_unlock(mm);
-	new_page = khugepaged_alloc_page(hpage, gfp, node);
-	if (!new_page) {
-		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
-		goto out_nolock;
-	}
-
-	if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
-		result = SCAN_CGROUP_CHARGE_FAIL;
-		goto out_nolock;
-	}
-	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
-
-	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma);
-	if (result) {
-		mmap_read_unlock(mm);
-		goto out_nolock;
-	}
-
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd) {
-		result = SCAN_PMD_NULL;
-		mmap_read_unlock(mm);
-		goto out_nolock;
-	}
-
-	/*
-	 * __collapse_huge_page_swapin always returns with mmap_lock locked.
-	 * If it fails, we release mmap_lock and jump out_nolock.
-	 * Continuing to collapse causes inconsistency.
-	 */
-	if (unmapped && !__collapse_huge_page_swapin(mm, vma, address,
-						     pmd, referenced)) {
-		mmap_read_unlock(mm);
-		goto out_nolock;
-	}
+	LIST_HEAD(compound_pagelist);
 
-	mmap_read_unlock(mm);
-	/*
-	 * Prevent all access to pagetables with the exception of
-	 * gup_fast later handled by the ptep_clear_flush and the VM
-	 * handled by the anon_vma lock + PG_lock.
-	 */
-	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma);
-	if (result)
-		goto out_up_write;
-	/* check if the pmd is still valid */
-	if (mm_find_pmd(mm, address) != pmd)
-		goto out_up_write;
+	VM_BUG_ON(!new_page);
+	mmap_assert_write_locked(mm);
 
 	anon_vma_lock_write(vma->anon_vma);
 
@@ -1176,7 +1114,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 		spin_unlock(pmd_ptl);
 		anon_vma_unlock_write(vma->anon_vma);
 		result = SCAN_FAIL;
-		goto out_up_write;
+		goto out;
 	}
 
 	/*
@@ -1208,11 +1146,95 @@ static void collapse_huge_page(struct mm_struct *mm,
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
 	spin_unlock(pmd_ptl);
+out:
+	if (isolated_out)
+		*isolated_out = isolated;
+	return result;
+}
 
-	*hpage = NULL;
 
-	khugepaged_pages_collapsed++;
-	result = SCAN_SUCCEED;
+static void collapse_huge_page(struct mm_struct *mm,
+			       unsigned long address,
+			       struct page **hpage,
+			       int node, int referenced, int unmapped,
+			       int enforce_pte_scan_limits)
+{
+	pmd_t *pmd;
+	struct page *new_page;
+	int isolated = 0, result = 0;
+	struct vm_area_struct *vma;
+	gfp_t gfp;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	/* Only allocate from the target node */
+	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
+
+	/*
+	 * Before allocating the hugepage, release the mmap_lock read lock.
+	 * The allocation can take potentially a long time if it involves
+	 * sync compaction, and we do not need to hold the mmap_lock during
+	 * that. We will recheck the vma after taking it again in write mode.
+	 */
+	mmap_read_unlock(mm);
+	new_page = khugepaged_alloc_page(hpage, gfp, node);
+	if (!new_page) {
+		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+		goto out_nolock;
+	}
+
+	if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
+		result = SCAN_CGROUP_CHARGE_FAIL;
+		goto out_nolock;
+	}
+	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
+
+	mmap_read_lock(mm);
+	result = hugepage_vma_revalidate(mm, address, &vma);
+	if (result) {
+		mmap_read_unlock(mm);
+		goto out_nolock;
+	}
+
+	pmd = mm_find_pmd(mm, address);
+	if (!pmd) {
+		result = SCAN_PMD_NULL;
+		mmap_read_unlock(mm);
+		goto out_nolock;
+	}
+
+	/*
+	 * __collapse_huge_page_swapin always returns with mmap_lock locked.
+	 * If it fails, we release mmap_lock and jump out_nolock.
+	 * Continuing to collapse causes inconsistency.
+	 */
+	if (unmapped && !__collapse_huge_page_swapin(mm, vma, address,
+						     pmd, referenced)) {
+		mmap_read_unlock(mm);
+		goto out_nolock;
+	}
+
+	mmap_read_unlock(mm);
+	/*
+	 * Prevent all access to pagetables with the exception of
+	 * gup_fast later handled by the ptep_clear_flush and the VM
+	 * handled by the anon_vma lock + PG_lock.
+	 */
+	mmap_write_lock(mm);
+
+	result = hugepage_vma_revalidate(mm, address, &vma);
+	if (result)
+		goto out_up_write;
+	/* check if the pmd is still valid */
+	if (mm_find_pmd(mm, address) != pmd)
+		goto out_up_write;
+
+	result = __do_collapse_huge_page(mm, vma, address, pmd, new_page,
+					 enforce_pte_scan_limits, &isolated);
+	if (result == SCAN_SUCCEED) {
+		*hpage = NULL;
+		khugepaged_pages_collapsed++;
+	}
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 04/14] mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (2 preceding siblings ...)
  2022-03-08 21:34 ` [RFC PATCH 03/14] mm/khugepaged: add __do_collapse_huge_page() helper Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-08 21:34 ` [RFC PATCH 05/14] mm/khugepaged: add mmap_assert_locked() checks to scan_pmd() Zach O'Keefe
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

khugepaged_scan_pmd() currently does : (1) scan pmd to see if it's
suitable for collapse, then (2) do the collapse, if scan succeeds.

Separate out (1) so that it can be reused by itself later in the
series, and introduce a struct scan_pmd_result to gather data about the
scan.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 107 ++++++++++++++++++++++++++++++------------------
 1 file changed, 67 insertions(+), 40 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e3399a451662..b204bc1eefa7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1244,27 +1244,34 @@ static void collapse_huge_page(struct mm_struct *mm,
 	return;
 }
 
-static int khugepaged_scan_pmd(struct mm_struct *mm,
-			       struct vm_area_struct *vma,
-			       unsigned long address,
-			       struct page **hpage,
-			       struct collapse_control *cc)
+struct scan_pmd_result {
+	int result;
+	bool writable;
+	int referenced;
+	int unmapped;
+	int none_or_zero;
+	struct page *head;
+};
+
+static void scan_pmd(struct mm_struct *mm,
+		     struct vm_area_struct *vma,
+		     unsigned long address,
+		     struct collapse_control *cc,
+		     struct scan_pmd_result *scan_result)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
-	int ret = 0, result = 0, referenced = 0;
-	int none_or_zero = 0, shared = 0;
+	int shared = 0;
 	struct page *page = NULL;
 	unsigned long _address;
 	spinlock_t *ptl;
-	int node = NUMA_NO_NODE, unmapped = 0;
-	bool writable = false;
+	int node = NUMA_NO_NODE;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	pmd = mm_find_pmd(mm, address);
 	if (!pmd) {
-		result = SCAN_PMD_NULL;
+		scan_result->result = SCAN_PMD_NULL;
 		goto out;
 	}
 
@@ -1274,7 +1281,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (is_swap_pte(pteval)) {
-			if (++unmapped <= khugepaged_max_ptes_swap ||
+			if (++scan_result->unmapped <=
+				    khugepaged_max_ptes_swap ||
 			    !cc->enforce_pte_scan_limits) {
 				/*
 				 * Always be strict with uffd-wp
@@ -1282,23 +1290,24 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 				 * comment below for pte_uffd_wp().
 				 */
 				if (pte_swp_uffd_wp(pteval)) {
-					result = SCAN_PTE_UFFD_WP;
+					scan_result->result = SCAN_PTE_UFFD_WP;
 					goto out_unmap;
 				}
 				continue;
 			} else {
-				result = SCAN_EXCEED_SWAP_PTE;
+				scan_result->result = SCAN_EXCEED_SWAP_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				goto out_unmap;
 			}
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			if (!userfaultfd_armed(vma) &&
-			    (++none_or_zero <= khugepaged_max_ptes_none ||
+			    (++scan_result->none_or_zero <=
+			     khugepaged_max_ptes_none ||
 			     !cc->enforce_pte_scan_limits)) {
 				continue;
 			} else {
-				result = SCAN_EXCEED_NONE_PTE;
+				scan_result->result = SCAN_EXCEED_NONE_PTE;
 				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 				goto out_unmap;
 			}
@@ -1313,22 +1322,22 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			 * userfault messages that falls outside of
 			 * the registered range.  So, just be simple.
 			 */
-			result = SCAN_PTE_UFFD_WP;
+			scan_result->result = SCAN_PTE_UFFD_WP;
 			goto out_unmap;
 		}
 		if (pte_write(pteval))
-			writable = true;
+			scan_result->writable = true;
 
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page)) {
-			result = SCAN_PAGE_NULL;
+			scan_result->result = SCAN_PAGE_NULL;
 			goto out_unmap;
 		}
 
 		if (page_mapcount(page) > 1 &&
 				++shared > khugepaged_max_ptes_shared &&
 				cc->enforce_pte_scan_limits) {
-			result = SCAN_EXCEED_SHARED_PTE;
+			scan_result->result = SCAN_EXCEED_SHARED_PTE;
 			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 			goto out_unmap;
 		}
@@ -1338,25 +1347,25 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		/*
 		 * Record which node the original page is from and save this
 		 * information to cc->node_load[].
-		 * Khugepaged will allocate hugepage from the node has the max
+		 * Caller should allocate hugepage from the node has the max
 		 * hit record.
 		 */
 		node = page_to_nid(page);
 		if (khugepaged_scan_abort(node, cc)) {
-			result = SCAN_SCAN_ABORT;
+			scan_result->result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
 		cc->node_load[node]++;
 		if (!PageLRU(page)) {
-			result = SCAN_PAGE_LRU;
+			scan_result->result = SCAN_PAGE_LRU;
 			goto out_unmap;
 		}
 		if (PageLocked(page)) {
-			result = SCAN_PAGE_LOCK;
+			scan_result->result = SCAN_PAGE_LOCK;
 			goto out_unmap;
 		}
 		if (!PageAnon(page)) {
-			result = SCAN_PAGE_ANON;
+			scan_result->result = SCAN_PAGE_ANON;
 			goto out_unmap;
 		}
 
@@ -1378,35 +1387,53 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		 * will be done again later the risk seems low.
 		 */
 		if (!is_refcount_suitable(page)) {
-			result = SCAN_PAGE_COUNT;
+			scan_result->result = SCAN_PAGE_COUNT;
 			goto out_unmap;
 		}
 		if (pte_young(pteval) ||
 		    page_is_young(page) || PageReferenced(page) ||
 		    mmu_notifier_test_young(vma->vm_mm, address))
-			referenced++;
+			scan_result->referenced++;
 	}
-	if (!writable) {
-		result = SCAN_PAGE_RO;
-	} else if (!referenced || (unmapped && referenced < HPAGE_PMD_NR/2)) {
-		result = SCAN_LACK_REFERENCED_PAGE;
+	if (!scan_result->writable) {
+		scan_result->result = SCAN_PAGE_RO;
+	} else if (!scan_result->referenced ||
+		   (scan_result->unmapped &&
+		    scan_result->referenced < HPAGE_PMD_NR / 2)) {
+		scan_result->result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
-		result = SCAN_SUCCEED;
-		ret = 1;
+		scan_result->result = SCAN_SUCCEED;
 	}
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
-	if (ret) {
+out:
+	scan_result->head = page;
+}
+
+static int khugepaged_scan_pmd(struct mm_struct *mm,
+			       struct vm_area_struct *vma,
+			       unsigned long address,
+			       struct page **hpage,
+			       struct collapse_control *cc)
+{
+	int node;
+	struct scan_pmd_result scan_result = {};
+
+	scan_pmd(mm, vma, address, cc, &scan_result);
+	if (scan_result.result == SCAN_SUCCEED) {
 		node = khugepaged_find_target_node(cc);
 		/* collapse_huge_page will return with the mmap_lock released */
-		collapse_huge_page(mm, address, hpage, node,
-				referenced, unmapped,
-				cc->enforce_pte_scan_limits);
+		collapse_huge_page(mm, khugepaged_scan.address, hpage, node,
+				   scan_result.referenced, scan_result.unmapped,
+				   cc->enforce_pte_scan_limits);
 	}
-out:
-	trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
-				     none_or_zero, result, unmapped);
-	return ret;
+
+	trace_mm_khugepaged_scan_pmd(mm, scan_result.head, scan_result.writable,
+				     scan_result.referenced,
+				     scan_result.none_or_zero,
+				     scan_result.result, scan_result.unmapped);
+
+	return scan_result.result == SCAN_SUCCEED;
 }
 
 static void collect_mm_slot(struct mm_slot *mm_slot)
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 05/14] mm/khugepaged: add mmap_assert_locked() checks to scan_pmd()
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (3 preceding siblings ...)
  2022-03-08 21:34 ` [RFC PATCH 04/14] mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-08 21:34 ` [RFC PATCH 06/14] mm/khugepaged: add hugepage_vma_revalidate_pmd_count() Zach O'Keefe
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

scan_pmd() requires mmap_lock held in read. Add a lockdep assertion to
guard this condition, as scan_pmd() will be called from other contexts
later in the series.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b204bc1eefa7..56f2ef7146c7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1253,6 +1253,7 @@ struct scan_pmd_result {
 	struct page *head;
 };
 
+/* Called with mmap_lock held and does not drop it. */
 static void scan_pmd(struct mm_struct *mm,
 		     struct vm_area_struct *vma,
 		     unsigned long address,
@@ -1267,6 +1268,7 @@ static void scan_pmd(struct mm_struct *mm,
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE;
 
+	mmap_assert_locked(mm);
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	pmd = mm_find_pmd(mm, address);
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 06/14] mm/khugepaged: add hugepage_vma_revalidate_pmd_count()
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (4 preceding siblings ...)
  2022-03-08 21:34 ` [RFC PATCH 05/14] mm/khugepaged: add mmap_assert_locked() checks to scan_pmd() Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-09 23:15   ` Yang Shi
  2022-03-08 21:34 ` [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count() Zach O'Keefe
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

madvise collapse context operates on pmds in batch. We will want to
be able to revalidate a region that spans multiple pmds in the same
vma.

Add hugepage_vma_revalidate_pmd_count() which extends
hugepage_vma_revalidate() with number of pmds to revalidate.
hugepage_vma_revalidate() now calls through this.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 56f2ef7146c7..1d20be47bcea 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -964,18 +964,17 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 #endif
 
 /*
- * If mmap_lock temporarily dropped, revalidate vma
- * before taking mmap_lock.
- * Return 0 if succeeds, otherwise return none-zero
- * value (scan code).
+ * Revalidate a vma's eligibility to collapse nr hugepages.
  */
-
-static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
-		struct vm_area_struct **vmap)
+static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
+					     unsigned long address, int nr,
+					     struct vm_area_struct **vmap)
 {
 	struct vm_area_struct *vma;
 	unsigned long hstart, hend;
 
+	mmap_assert_locked(mm);
+
 	if (unlikely(khugepaged_test_exit(mm)))
 		return SCAN_ANY_PROCESS;
 
@@ -985,7 +984,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 
 	hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
 	hend = vma->vm_end & HPAGE_PMD_MASK;
-	if (address < hstart || address + HPAGE_PMD_SIZE > hend)
+	if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
 		return SCAN_ADDRESS_RANGE;
 	if (!hugepage_vma_check(vma, vma->vm_flags))
 		return SCAN_VMA_CHECK;
@@ -995,6 +994,17 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	return 0;
 }
 
+/*
+ * If mmap_lock temporarily dropped, revalidate vma before taking mmap_lock.
+ * Return 0 if succeeds, otherwise return none-zero value (scan code).
+ */
+
+static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
+				   struct vm_area_struct **vmap)
+{
+	return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
+}
+
 /*
  * Bring missing pages in from swap, to complete THP collapse.
  * Only done if khugepaged_scan_pmd believes it is worthwhile.
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (5 preceding siblings ...)
  2022-03-08 21:34 ` [RFC PATCH 06/14] mm/khugepaged: add hugepage_vma_revalidate_pmd_count() Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-09 23:17   ` Yang Shi
  2022-03-10 15:56   ` David Hildenbrand
  2022-03-08 21:34 ` [RFC PATCH 08/14] mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled() Zach O'Keefe
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

In madvise collapse context, we optionally want to be able to ignore
advice from MADV_NOHUGEPAGE-marked regions.

Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
which can be used to ignore vm flags used when considering thp
eligibility.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1d20be47bcea..ecbd3fc41c80 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -964,10 +964,14 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 #endif
 
 /*
- * Revalidate a vma's eligibility to collapse nr hugepages.
+ * Revalidate a vma's eligibility to collapse nr hugepages. vm_flags_ignore
+ * can be used to ignore certain vma_flags that would otherwise be checked -
+ * the principal example being VM_NOHUGEPAGE which is ignored in madvise
+ * collapse context.
  */
 static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
 					     unsigned long address, int nr,
+					     unsigned long vm_flags_ignore,
 					     struct vm_area_struct **vmap)
 {
 	struct vm_area_struct *vma;
@@ -986,7 +990,7 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
 	hend = vma->vm_end & HPAGE_PMD_MASK;
 	if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
 		return SCAN_ADDRESS_RANGE;
-	if (!hugepage_vma_check(vma, vma->vm_flags))
+	if (!hugepage_vma_check(vma, vma->vm_flags & ~vm_flags_ignore))
 		return SCAN_VMA_CHECK;
 	/* Anon VMA expected */
 	if (!vma->anon_vma || vma->vm_ops)
@@ -1000,9 +1004,11 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
  */
 
 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
+				   unsigned long vm_flags_ignore,
 				   struct vm_area_struct **vmap)
 {
-	return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
+	return hugepage_vma_revalidate_pmd_count(mm, address, 1,
+			vm_flags_ignore, vmap);
 }
 
 /*
@@ -1043,7 +1049,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 		/* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
 		if (ret & VM_FAULT_RETRY) {
 			mmap_read_lock(mm);
-			if (hugepage_vma_revalidate(mm, haddr, &vma)) {
+			if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
 				/* vma is no longer available, don't continue to swapin */
 				trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
 				return false;
@@ -1200,7 +1206,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma);
+	result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
 	if (result) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1232,7 +1238,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	mmap_write_lock(mm);
 
-	result = hugepage_vma_revalidate(mm, address, &vma);
+	result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
 	if (result)
 		goto out_up_write;
 	/* check if the pmd is still valid */
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 08/14] mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled()
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (6 preceding siblings ...)
  2022-03-08 21:34 ` [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count() Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-08 21:34 ` [RFC PATCH 09/14] mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP Zach O'Keefe
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

Later in the series, in madvise collapse context, we will want to
optionally ignore MADV_NOHUGEPAGE.  However, we'd also like to
standardize on __transparent_hugepage_enabled() for determining anon
thp eligibility.

Add a new argument to __transparent_hugepage_enabled() which represents
the vma flags to be used instead of those in vma->vm_flags for
VM_[NO]HUGEPAGE checks. I.e. checks inside __transparent_hugepage_enabled()
which previously didn't care about madvise settings, such as dax check,
or stack check, are unaffected.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/linux/huge_mm.h | 14 ++++++++++----
 mm/huge_memory.c        |  2 +-
 mm/memory.c             |  6 ++++--
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2999190adc22..fd905b0b2c71 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -143,8 +143,13 @@ static inline bool transhuge_vma_enabled(struct vm_area_struct *vma,
 /*
  * to be used on vmas which are known to support THP.
  * Use transparent_hugepage_active otherwise
+ *
+ * madv_thp_vm_flags are used instead of vma->vm_flags for VM_NOHUGEPAGE
+ * and VM_HUGEPAGE. Principal use is ignoring VM_NOHUGEPAGE when in madvise
+ * collapse context.
  */
-static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
+static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma,
+						  unsigned long madv_thp_vm_flags)
 {
 
 	/*
@@ -153,7 +158,7 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_NEVER_DAX))
 		return false;
 
-	if (!transhuge_vma_enabled(vma, vma->vm_flags))
+	if (!transhuge_vma_enabled(vma, madv_thp_vm_flags))
 		return false;
 
 	if (vma_is_temporary_stack(vma))
@@ -167,7 +172,7 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 
 	if (transparent_hugepage_flags &
 				(1 << TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))
-		return !!(vma->vm_flags & VM_HUGEPAGE);
+		return !!(madv_thp_vm_flags & VM_HUGEPAGE);
 
 	return false;
 }
@@ -316,7 +321,8 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 	return false;
 }
 
-static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
+static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma,
+						  unsigned long madv_thp_vm_flags)
 {
 	return false;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3557aabe86fe..25b7590b9846 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -83,7 +83,7 @@ bool transparent_hugepage_active(struct vm_area_struct *vma)
 	if (!transhuge_vma_suitable(vma, addr))
 		return false;
 	if (vma_is_anonymous(vma))
-		return __transparent_hugepage_enabled(vma);
+		return __transparent_hugepage_enabled(vma, vma->vm_flags);
 	if (vma_is_shmem(vma))
 		return shmem_huge_enabled(vma);
 	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
diff --git a/mm/memory.c b/mm/memory.c
index 4499cf09c21f..a6f2a8a20329 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4695,7 +4695,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	if (!vmf.pud)
 		return VM_FAULT_OOM;
 retry_pud:
-	if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
+	if (pud_none(*vmf.pud) &&
+	    __transparent_hugepage_enabled(vma, vma->vm_flags)) {
 		ret = create_huge_pud(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
@@ -4726,7 +4727,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	if (pud_trans_unstable(vmf.pud))
 		goto retry_pud;
 
-	if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
+	if (pmd_none(*vmf.pmd) &&
+	    __transparent_hugepage_enabled(vma, vma->vm_flags)) {
 		ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 09/14] mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (7 preceding siblings ...)
  2022-03-08 21:34 ` [RFC PATCH 08/14] mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled() Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-09 23:40   ` Yang Shi
  2022-03-08 21:34 ` [RFC PATCH 10/14] mm/khugepaged: rename khugepaged-specific/not functions Zach O'Keefe
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

When scanning an anon pmd to see if it's eligible for collapse, return
SCAN_PAGE_COMPOUND if the pmd already maps a thp. This is consistent
with handling when scanning file-backed memory.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 41 +++++++++++++++++++++++++++++++++++------
 1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ecbd3fc41c80..403578161a3b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1011,6 +1011,38 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 			vm_flags_ignore, vmap);
 }
 
+/*
+ * If returning NULL (meaning the pmd isn't mapped, isn't present, or thp),
+ * write the reason to *result.
+ */
+static pmd_t *find_pmd_or_thp_or_none(struct mm_struct *mm,
+				      unsigned long address,
+				      int *result)
+{
+	pmd_t *pmd = mm_find_pmd_raw(mm, address);
+	pmd_t pmde;
+
+	if (!pmd) {
+		*result = SCAN_PMD_NULL;
+		return NULL;
+	}
+
+	pmde = pmd_read_atomic(pmd);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/* See comments in pmd_none_or_trans_huge_or_clear_bad() */
+	barrier();
+#endif
+	if (!pmd_present(pmde) || !pmd_none(pmde)) {
+		*result = SCAN_PMD_NULL;
+		return NULL;
+	} else if (pmd_trans_huge(pmde)) {
+		*result = SCAN_PAGE_COMPOUND;
+		return NULL;
+	}
+	return pmd;
+}
+
 /*
  * Bring missing pages in from swap, to complete THP collapse.
  * Only done if khugepaged_scan_pmd believes it is worthwhile.
@@ -1212,9 +1244,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 		goto out_nolock;
 	}
 
-	pmd = mm_find_pmd(mm, address);
+	pmd = find_pmd_or_thp_or_none(mm, address, &result);
 	if (!pmd) {
-		result = SCAN_PMD_NULL;
 		mmap_read_unlock(mm);
 		goto out_nolock;
 	}
@@ -1287,11 +1318,9 @@ static void scan_pmd(struct mm_struct *mm,
 	mmap_assert_locked(mm);
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd) {
-		scan_result->result = SCAN_PMD_NULL;
+	pmd = find_pmd_or_thp_or_none(mm, address, &scan_result->result);
+	if (!pmd)
 		goto out;
-	}
 
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 10/14] mm/khugepaged: rename khugepaged-specific/not functions
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (8 preceding siblings ...)
  2022-03-08 21:34 ` [RFC PATCH 09/14] mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-08 21:34 ` [RFC PATCH 11/14] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

In preparation for introducing a new collapse context, rename functions
that are khugepaged-specific (or not). There is no functional change
here.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 50 +++++++++++++++++++++++++------------------------
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 403578161a3b..12ae765c5c32 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,7 +92,7 @@ struct collapse_control {
 	/* Num pages scanned per node */
 	int node_load[MAX_NUMNODES];
 
-	/* Last target selected in khugepaged_find_target_node() for this scan */
+	/* Last target selected in find_target_node() for this scan */
 	int last_target_node;
 };
 
@@ -452,7 +452,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
 	hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
 }
 
-static inline int khugepaged_test_exit(struct mm_struct *mm)
+static inline int test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
@@ -501,7 +501,7 @@ int __khugepaged_enter(struct mm_struct *mm)
 		return -ENOMEM;
 
 	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
+	VM_BUG_ON_MM(test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
 		free_mm_slot(mm_slot);
 		return 0;
@@ -565,7 +565,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 	} else if (mm_slot) {
 		/*
 		 * This is required to serialize against
-		 * khugepaged_test_exit() (which is guaranteed to run
+		 * test_exit() (which is guaranteed to run
 		 * under mmap sem read mode). Stop here (after we
 		 * return all pagetables will be destroyed) until
 		 * khugepaged has finished working on the pagetables
@@ -836,7 +836,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(struct collapse_control *cc)
+static int find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
 
@@ -895,7 +895,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 	return *hpage;
 }
 #else
-static int khugepaged_find_target_node(struct collapse_control *cc)
+static int find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -979,7 +979,7 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
 
 	mmap_assert_locked(mm);
 
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(test_exit(mm)))
 		return SCAN_ANY_PROCESS;
 
 	*vmap = vma = find_vma(mm, address);
@@ -1201,11 +1201,11 @@ static int __do_collapse_huge_page(struct mm_struct *mm,
 }
 
 
-static void collapse_huge_page(struct mm_struct *mm,
-			       unsigned long address,
-			       struct page **hpage,
-			       int node, int referenced, int unmapped,
-			       int enforce_pte_scan_limits)
+static void khugepaged_collapse_huge_page(struct mm_struct *mm,
+					  unsigned long address,
+					  struct page **hpage,
+					  int node, int referenced, int unmapped,
+					  int enforce_pte_scan_limits)
 {
 	pmd_t *pmd;
 	struct page *new_page;
@@ -1468,11 +1468,13 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 
 	scan_pmd(mm, vma, address, cc, &scan_result);
 	if (scan_result.result == SCAN_SUCCEED) {
-		node = khugepaged_find_target_node(cc);
+		node = find_target_node(cc);
 		/* collapse_huge_page will return with the mmap_lock released */
-		collapse_huge_page(mm, khugepaged_scan.address, hpage, node,
-				   scan_result.referenced, scan_result.unmapped,
-				   cc->enforce_pte_scan_limits);
+		khugepaged_collapse_huge_page(mm, khugepaged_scan.address,
+					      hpage, node,
+					      scan_result.referenced,
+					      scan_result.unmapped,
+					      cc->enforce_pte_scan_limits);
 	}
 
 	trace_mm_khugepaged_scan_pmd(mm, scan_result.head, scan_result.writable,
@@ -1489,7 +1491,7 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (khugepaged_test_exit(mm)) {
+	if (test_exit(mm)) {
 		/* free mm_slot */
 		hash_del(&mm_slot->hash);
 		list_del(&mm_slot->mm_node);
@@ -1656,7 +1658,7 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
 	if (!mmap_write_trylock(mm))
 		return;
 
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(test_exit(mm)))
 		goto out;
 
 	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
@@ -1711,7 +1713,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		 * reverse order. Trylock is a way to avoid deadlock.
 		 */
 		if (mmap_write_trylock(mm)) {
-			if (!khugepaged_test_exit(mm))
+			if (!test_exit(mm))
 				collapse_and_free_pmd(mm, vma, addr, pmd);
 			mmap_write_unlock(mm);
 		} else {
@@ -2188,7 +2190,7 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			node = khugepaged_find_target_node(cc);
+			node = find_target_node(cc);
 			collapse_file(mm, file, start, hpage, node);
 		}
 	}
@@ -2241,7 +2243,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 	vma = NULL;
 	if (unlikely(!mmap_read_trylock(mm)))
 		goto breakouterloop_mmap_lock;
-	if (likely(!khugepaged_test_exit(mm)))
+	if (likely(!test_exit(mm)))
 		vma = find_vma(mm, khugepaged_scan.address);
 
 	progress++;
@@ -2249,7 +2251,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 		unsigned long hstart, hend;
 
 		cond_resched();
-		if (unlikely(khugepaged_test_exit(mm))) {
+		if (unlikely(test_exit(mm))) {
 			progress++;
 			break;
 		}
@@ -2273,7 +2275,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 		while (khugepaged_scan.address < hend) {
 			int ret;
 			cond_resched();
-			if (unlikely(khugepaged_test_exit(mm)))
+			if (unlikely(test_exit(mm)))
 				goto breakouterloop;
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2313,7 +2315,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (khugepaged_test_exit(mm) || !vma) {
+	if (test_exit(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 11/14] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (9 preceding siblings ...)
  2022-03-08 21:34 ` [RFC PATCH 10/14] mm/khugepaged: rename khugepaged-specific/not functions Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-09 23:43   ` Yang Shi
  2022-03-08 21:34 ` [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse Zach O'Keefe
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

The idea of hugepage collapse in process context was previously
introduced by David Rientjes to linux-mm[1].

The idea is to introduce a new madvise mode, MADV_COLLAPSE, that allows
users to request a synchronous collapse of memory.

The benefits of this approach are:

* cpu is charged to the process that wants to spend the cycles for the
  THP
* avoid unpredictable timing of khugepaged collapse
* flexible separation of sync userspace and async khugepaged THP collapse
  policies

Immediate users of this new functionality include:

* malloc implementations that manage memory in hugepage-sized chunks,
  but sometimes subrelease memory back to the system in native-sized
  chunks via MADV_DONTNEED; zapping the pmd.  Later, when the memory
  is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the
  memory by THP to regain TLB performance.
* immediately back executable text by hugepages.  Current support
  provided by CONFIG_READ_ONLY_THP_FOR_FS may take too long on a large
  system.

To keep patches digestible, introduce MADV_COLLAPSE in a few stages.

Add plumbing to existing madvise infrastructure, as well as populate
uapi header files, leaving the actual madvise(MADV_COLLAPSE) handler
stubbed out.  Only privately-mapped anon memory is supported for now.

[1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/linux/huge_mm.h                | 12 +++++++
 include/uapi/asm-generic/mman-common.h |  2 ++
 mm/khugepaged.c                        | 46 ++++++++++++++++++++++++++
 mm/madvise.c                           |  5 +++
 4 files changed, 65 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fd905b0b2c71..407b63ab4185 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -226,6 +226,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
+int madvise_collapse(struct vm_area_struct *vma,
+		     struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -383,6 +386,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
 	BUG();
 	return 0;
 }
+
+static inline int madvise_collapse(struct vm_area_struct *vma,
+				   struct vm_area_struct **prev,
+				   unsigned long start, unsigned long end)
+{
+	BUG();
+	return 0;
+}
+
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6c1aa92a92e4..6ce1f1ceb432 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -77,6 +77,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 12ae765c5c32..ca1e523086ed 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2519,3 +2519,49 @@ void khugepaged_min_free_kbytes_update(void)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
+
+/*
+ * Returns 0 if successfully able to collapse range into THPs (or range already
+ * backed by THPs). Due to implementation detail, THPs collapsed here may be
+ * split again before this function returns.
+ */
+static int _madvise_collapse(struct mm_struct *mm,
+			     struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start,
+			     unsigned long end, gfp_t gfp,
+			     struct collapse_control *cc)
+{
+	/* Implemented in later patch */
+	return -ENOSYS;
+}
+
+int madvise_collapse(struct vm_area_struct *vma,
+		     struct vm_area_struct **prev, unsigned long start,
+		     unsigned long end)
+{
+	struct collapse_control cc;
+	gfp_t gfp;
+	int error;
+	struct mm_struct *mm = vma->vm_mm;
+
+	/* Requested to hold mmap_lock in read */
+	mmap_assert_locked(mm);
+
+	mmgrab(mm);
+	collapse_control_init(&cc, /* enforce_pte_scan_limits= */ false);
+	gfp = vma_thp_gfp_mask(vma);
+	lru_add_drain(); /* lru_add_drain_all() too heavy here */
+	error = _madvise_collapse(mm, vma, prev, start, end, gfp, &cc);
+	mmap_assert_locked(mm);
+	mmdrop(mm);
+
+	/*
+	 * madvise() returns EAGAIN if kernel resources are temporarily
+	 * unavailable.
+	 */
+	if (error == -ENOMEM)
+		error = -EAGAIN;
+
+	return error;
+}
diff --git a/mm/madvise.c b/mm/madvise.c
index 5b6d796e55de..292aa017c150 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -58,6 +58,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_FREE:
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
+	case MADV_COLLAPSE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -1046,6 +1047,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_COLLAPSE:
+		return madvise_collapse(vma, prev, start, end);
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1139,6 +1142,7 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
+	case MADV_COLLAPSE:
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
@@ -1328,6 +1332,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
+ *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (10 preceding siblings ...)
  2022-03-08 21:34 ` [RFC PATCH 11/14] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-10  0:06   ` Yang Shi
  2022-03-08 21:34 ` [RFC PATCH 13/14] mm/madvise: add __madvise_collapse_*_batch() actions Zach O'Keefe
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

Introduce the main madvise collapse batched logic, including the overall
locking strategy.  Stubs for individual batched actions, such as
scanning pmds in batch, have been stubbed out, and will be added later
in the series.

Note the main benefit from doing all this work in a batched manner is
that __madvise__collapse_pmd_batch() (stubbed out) can be called inside
a single mmap_lock write.

Per-batch data is stored in a struct madvise_collapse_data array, with
an entry for each pmd to collapse, and is shared between the various
*_batch actions.  This allows for partial success of collapsing a range
of pmds - we continue as long as some pmds can be successfully
collapsed.

A "success" here, is where all pmds can be (or already are) collapsed.
On failure, the caller will need to verify what, if any, partial
successes occurred via smaps or otherwise.

Also note that, where possible, if collapse fails for a particular pmd
after a hugepage has already been allocated, said hugepage is kept on a
per-node free list for the purpose of backing subsequent pmd collapses.
All unused hugepages are returned before _madvise_collapse() returns.

Note that bisect at this patch won't break; madvise(MADV_COLLAPSE) will
return -1 always.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 279 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 273 insertions(+), 6 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ca1e523086ed..ea53c706602e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -86,6 +86,9 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
 #define MAX_PTE_MAPPED_THP 8
 
 struct collapse_control {
+	/* Used by MADV_COLLAPSE batch collapse */
+	struct list_head free_hpages[MAX_NUMNODES];
+
 	/* Respect khugepaged_max_ptes_[none|swap|shared] */
 	bool enforce_pte_scan_limits;
 
@@ -99,8 +102,13 @@ struct collapse_control {
 static void collapse_control_init(struct collapse_control *cc,
 				  bool enforce_pte_scan_limits)
 {
+	int i;
+
 	cc->enforce_pte_scan_limits = enforce_pte_scan_limits;
 	cc->last_target_node = NUMA_NO_NODE;
+
+	for (i = 0; i < MAX_NUMNODES; ++i)
+		INIT_LIST_HEAD(cc->free_hpages + i);
 }
 
 /**
@@ -1033,7 +1041,7 @@ static pmd_t *find_pmd_or_thp_or_none(struct mm_struct *mm,
 	/* See comments in pmd_none_or_trans_huge_or_clear_bad() */
 	barrier();
 #endif
-	if (!pmd_present(pmde) || !pmd_none(pmde)) {
+	if (!pmd_present(pmde) || pmd_none(pmde)) {
 		*result = SCAN_PMD_NULL;
 		return NULL;
 	} else if (pmd_trans_huge(pmde)) {
@@ -1054,12 +1062,16 @@ static pmd_t *find_pmd_or_thp_or_none(struct mm_struct *mm,
 static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
-					int referenced)
+					int referenced,
+					unsigned long vm_flags_ignored,
+					bool *mmap_lock_dropped)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
 	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
 
+	if (mmap_lock_dropped)
+		*mmap_lock_dropped = false;
 	for (address = haddr; address < end; address += PAGE_SIZE) {
 		struct vm_fault vmf = {
 			.vma = vma,
@@ -1080,8 +1092,10 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 
 		/* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
 		if (ret & VM_FAULT_RETRY) {
+			if (mmap_lock_dropped)
+				*mmap_lock_dropped = true;
 			mmap_read_lock(mm);
-			if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
+			if (hugepage_vma_revalidate(mm, haddr, vm_flags_ignored, &vma)) {
 				/* vma is no longer available, don't continue to swapin */
 				trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
 				return false;
@@ -1256,7 +1270,8 @@ static void khugepaged_collapse_huge_page(struct mm_struct *mm,
 	 * Continuing to collapse causes inconsistency.
 	 */
 	if (unmapped && !__collapse_huge_page_swapin(mm, vma, address,
-						     pmd, referenced)) {
+						     pmd, referenced, VM_NONE,
+						     NULL)) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
 	}
@@ -2520,6 +2535,128 @@ void khugepaged_min_free_kbytes_update(void)
 	mutex_unlock(&khugepaged_mutex);
 }
 
+struct madvise_collapse_data {
+	struct page *hpage; /* Preallocated THP */
+	bool continue_collapse;  /* Should we attempt / continue collapse? */
+
+	struct scan_pmd_result scan_result;
+	pmd_t *pmd;
+};
+
+static int
+madvise_collapse_vma_revalidate_pmd_count(struct mm_struct *mm,
+					  unsigned long address, int nr,
+					  struct vm_area_struct **vmap)
+{
+	/* madvise_collapse() ignores MADV_NOHUGEPAGE */
+	return hugepage_vma_revalidate_pmd_count(mm, address, nr, VM_NOHUGEPAGE,
+			vmap);
+}
+
+/*
+ * Scan pmd to see which we can collapse, and to determine node to allocate on.
+ *
+ * Must be called with mmap_lock in read, and returns with the lock held in
+ * read. Does not drop the lock.
+ *
+ * Set batch_data[i]->continue_collapse to false for any pmd that can't be
+ * collapsed.
+ *
+ * Return the number of existing THPs in batch.
+ */
+static int
+__madvise_collapse_scan_pmd_batch(struct mm_struct *mm,
+				  struct vm_area_struct *vma,
+				  unsigned long batch_start,
+				  struct madvise_collapse_data *batch_data,
+				  int batch_size,
+				  struct collapse_control *cc)
+{
+	/* Implemented in later patch */
+	return 0;
+}
+
+/*
+ * Preallocate and charge huge page for each pmd in the batch, store the
+ * new page in batch_data[i]->hpage.
+ *
+ * Return the number of huge pages allocated.
+ */
+static int
+__madvise_collapse_prealloc_hpages_batch(struct mm_struct *mm,
+					 gfp_t gfp,
+					 int node,
+					 struct madvise_collapse_data *batch_data,
+					 int batch_size,
+					 struct collapse_control *cc)
+{
+	/* Implemented in later patch */
+	return 0;
+}
+
+/*
+ * Do swapin for all ranges in batch, returns true iff successful.
+ *
+ * Called with mmap_lock held in read, and returns with it held in read.
+ * Might drop the lock.
+ *
+ * Set batch_data[i]->continue_collapse to false for any pmd that can't be
+ * collapsed. Else, set batch_data[i]->pmd to the found pmd.
+ */
+static bool
+__madvise_collapse_swapin_pmd_batch(struct mm_struct *mm,
+				    int node,
+				    unsigned long batch_start,
+				    struct madvise_collapse_data *batch_data,
+				    int batch_size,
+				    struct collapse_control *cc)
+
+{
+	/* Implemented in later patch */
+	return true;
+}
+
+/*
+ * Do the collapse operation. Return number of THPs collapsed successfully.
+ *
+ * Called with mmap_lock held in write, and returns with it held. Does not
+ * drop the lock.
+ */
+static int
+__madvise_collapse_pmd_batch(struct mm_struct *mm,
+			     unsigned long batch_start,
+			     int batch_size,
+			     struct madvise_collapse_data *batch_data,
+			     int node,
+			     struct collapse_control *cc)
+{
+	/* Implemented in later patch */
+	return 0;
+}
+
+static bool continue_collapse(struct madvise_collapse_data *batch_data,
+			      int batch_size)
+{
+	int i;
+
+	for (i = 0; i < batch_size; ++i)
+		if (batch_data[i].continue_collapse)
+			return true;
+	return false;
+}
+
+static bool madvise_transparent_hugepage_enabled(struct vm_area_struct *vma)
+{
+	if (vma_is_anonymous(vma))
+		/* madvise_collapse() ignores MADV_NOHUGEPAGE */
+		return __transparent_hugepage_enabled(vma, vma->vm_flags &
+						      ~VM_NOHUGEPAGE);
+	/* TODO: Support file-backed memory */
+	return false;
+}
+
+#define MADVISE_COLLAPSE_BATCH_SIZE 8
+
 /*
  * Returns 0 if successfully able to collapse range into THPs (or range already
  * backed by THPs). Due to implementation detail, THPs collapsed here may be
@@ -2532,8 +2669,138 @@ static int _madvise_collapse(struct mm_struct *mm,
 			     unsigned long end, gfp_t gfp,
 			     struct collapse_control *cc)
 {
-	/* Implemented in later patch */
-	return -ENOSYS;
+	unsigned long hstart, hend, batch_addr;
+	int ret = -EINVAL, collapsed = 0, nr_hpages = 0, i;
+	struct madvise_collapse_data batch_data[MADVISE_COLLAPSE_BATCH_SIZE];
+
+	mmap_assert_locked(mm);
+	BUG_ON(vma->vm_start > start);
+	BUG_ON(vma->vm_end < end);
+	VM_BUG_ON_MM(atomic_read(&mm->mm_users) == 0, mm);
+
+	hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = end & HPAGE_PMD_MASK;
+	nr_hpages = (hend - hstart) >> HPAGE_PMD_SHIFT;
+	if (hstart >= hend)
+		goto out;
+
+	if (!madvise_transparent_hugepage_enabled(vma))
+		goto out;
+
+	/*
+	 * Request might cover multiple hugepages. Strategy is to batch
+	 * allocation and collapse operations so that we do more work while
+	 * mmap_lock is held exclusively.
+	 *
+	 * While processing batch, mmap_lock is locked/unlocked many times for
+	 * the supplied VMA. It's possible that the original VMA is split while
+	 * lock was dropped. If in the context of the (possibly new) VMA, THP
+	 * collapse is possible, we continue.
+	 */
+	for (batch_addr = hstart;
+	     batch_addr < hend;
+	     batch_addr += HPAGE_PMD_SIZE * MADVISE_COLLAPSE_BATCH_SIZE) {
+		int node, batch_size;
+		int thps; /* Number of existing THPs in range */
+
+		batch_size = (hend - batch_addr) >> HPAGE_PMD_SHIFT;
+		batch_size = min_t(int, batch_size,
+				   MADVISE_COLLAPSE_BATCH_SIZE);
+
+		BUG_ON(batch_size <= 0);
+		memset(batch_data, 0, sizeof(batch_data));
+		cond_resched();
+		VM_BUG_ON_MM(atomic_read(&mm->mm_users) == 0, mm);
+
+		/*
+		 * If first batch, we still hold mmap_lock from madvise
+		 * call and haven't dropped it since checking the VMA. Else,
+		 * we've dropped the lock and we need to revalidate.
+		 */
+		if (batch_addr != hstart) {
+			mmap_read_lock(mm);
+			if (madvise_collapse_vma_revalidate_pmd_count(mm,
+								      batch_addr,
+								      batch_size,
+								      &vma))
+				goto loop_unlock_break;
+		}
+
+		mmap_assert_locked(mm);
+
+		thps = __madvise_collapse_scan_pmd_batch(mm, vma, batch_addr,
+							 batch_data, batch_size,
+							 cc);
+		mmap_read_unlock(mm);
+
+		/* Count existing THPs as-if we collapsed them */
+		collapsed += thps;
+		if (thps == batch_size || !continue_collapse(batch_data,
+							     batch_size))
+			continue;
+
+		node = find_target_node(cc);
+		if (!__madvise_collapse_prealloc_hpages_batch(mm, gfp, node,
+							      batch_data,
+							      batch_size, cc)) {
+			/* No more THPs available - so give up */
+			ret = -ENOMEM;
+			break;
+		}
+
+		mmap_read_lock(mm);
+		if (!__madvise_collapse_swapin_pmd_batch(mm, node, batch_addr,
+							 batch_data, batch_size,
+							 cc))
+			goto loop_unlock_break;
+		mmap_read_unlock(mm);
+		mmap_write_lock(mm);
+		collapsed += __madvise_collapse_pmd_batch(mm,
+				batch_addr, batch_size, batch_data,
+				node, cc);
+		mmap_write_unlock(mm);
+
+		for (i = 0; i < batch_size; ++i) {
+			struct page *page = batch_data[i].hpage;
+
+			if (page && !IS_ERR(page)) {
+				list_add_tail(&page->lru,
+					      &cc->free_hpages[node]);
+				batch_data[i].hpage = NULL;
+			}
+		}
+		/* mmap_lock is unlocked here */
+		continue;
+loop_unlock_break:
+		mmap_read_unlock(mm);
+		break;
+	}
+	/* mmap_lock is unlocked here */
+
+	for (i = 0; i < MADVISE_COLLAPSE_BATCH_SIZE; ++i) {
+		struct page *page = batch_data[i].hpage;
+
+		if (page && !IS_ERR(page)) {
+			mem_cgroup_uncharge(page_folio(page));
+			put_page(page);
+		}
+	}
+	for (i = 0; i < MAX_NUMNODES; ++i) {
+		struct page *page, *tmp;
+
+		list_for_each_entry_safe(page, tmp, cc->free_hpages + i, lru) {
+			list_del(&page->lru);
+			mem_cgroup_uncharge(page_folio(page));
+			put_page(page);
+		}
+	}
+	ret = collapsed == nr_hpages ? 0 : -1;
+	vma = NULL;		/* tell sys_madvise we dropped mmap_lock */
+	mmap_read_lock(mm);	/* sys_madvise expects us to have mmap_lock */
+out:
+	*prev = vma;		/* we didn't drop mmap_lock, so this holds */
+
+	return ret;
 }
 
 int madvise_collapse(struct vm_area_struct *vma,
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 13/14] mm/madvise: add __madvise_collapse_*_batch() actions.
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (11 preceding siblings ...)
  2022-03-08 21:34 ` [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-08 21:34 ` [RFC PATCH 14/14] mm/madvise: add process_madvise(MADV_COLLAPSE) Zach O'Keefe
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

Add implementations for the following batch actions:

scan_pmd:
	Iterate over batch and scan the pmd for eligibility. Note
	that this function is called with mmap_lock in read, and
	does not drop it before returning.

	If a batch entry fails, ->continue_collapse field of its
	madvise_collapse_data is set to 'false' so that later _batch
	actions know to ignore it.

	Return the number of THPs already the batch, which is needed
	by _madvise_collapse() to determine overall "success" criteria
	(all pmds either collapsed successfully, or already THP-backed).

prealloc_hpages:
	Iterate over batch and allocate / charge hugepages. Before
	allocating a new page, check on local free hugepage list.
	Similarly, if, after allocating a hugepage, charging the memcg
	fails, save the hugepage on a local free list for future use.

swapin_pmd:
	Iterate over batch and attempt to swap-in pages that are
	currently swapped out.  Called with mmap_lock in read, and
	returns with it held; however, it might drop and require the
	lock internally.

	Specifically, __collapse_huge_page_swapin() might drop +
	require the mmap_lock.  When it does so, it only revalidates the
	vma/address for a single pmd.  Since we need to revalidate the
	vma for the entire region covered in the batch, we need to be
	notified when the lock is dropped so that we can perform the
	required revalidation. As such, add an argument to
	__collapse_huge_page_swapin() to notify caller when mmap_lock is
	dropped.

collapse_pmd:
	Iterate over the batch and perform the actual collapse for each
	pmd.  Note that this is done while holding the mmap_lock in write for
	the entire batch action.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 mm/khugepaged.c | 153 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 145 insertions(+), 8 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ea53c706602e..e8156f15a3da 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2572,8 +2572,23 @@ __madvise_collapse_scan_pmd_batch(struct mm_struct *mm,
 				  int batch_size,
 				  struct collapse_control *cc)
 {
-	/* Implemented in later patch */
-	return 0;
+	unsigned long addr, i;
+	int thps = 0;
+
+	mmap_assert_locked(mm);
+
+	for (addr = batch_start, i = 0; i < batch_size;
+	     addr += HPAGE_PMD_SIZE, ++i) {
+		struct madvise_collapse_data *data = batch_data + i;
+
+		scan_pmd(mm, vma, addr, cc, &data->scan_result);
+		data->continue_collapse =
+				data->scan_result.result == SCAN_SUCCEED;
+		if (data->scan_result.result == SCAN_PAGE_COMPOUND)
+			++thps;
+	}
+	mmap_assert_locked(mm);
+	return thps;
 }
 
 /*
@@ -2590,8 +2605,39 @@ __madvise_collapse_prealloc_hpages_batch(struct mm_struct *mm,
 					 int batch_size,
 					 struct collapse_control *cc)
 {
-	/* Implemented in later patch */
-	return 0;
+	int nr_hpages = 0;
+	int i;
+
+	for (i = 0; i < batch_size; ++i) {
+		struct madvise_collapse_data *data = batch_data + i;
+
+		if (!data->continue_collapse)
+			continue;
+
+		if (!list_empty(&cc->free_hpages[node])) {
+			data->hpage  = list_first_entry(&cc->free_hpages[node],
+							struct page, lru);
+			list_del(&data->hpage->lru);
+		} else {
+			data->hpage = __alloc_pages_node(node, gfp,
+							 HPAGE_PMD_ORDER);
+			if (unlikely(!data->hpage))
+				break;
+
+			prep_transhuge_page(data->hpage);
+
+			if (unlikely(mem_cgroup_charge(page_folio(data->hpage),
+						       mm, gfp))) {
+				/* No use reusing page, so give it back */
+				put_page(data->hpage);
+				data->hpage = NULL;
+				data->continue_collapse = false;
+				break;
+			}
+		}
+		++nr_hpages;
+	}
+	return nr_hpages;
 }
 
 /*
@@ -2612,8 +2658,67 @@ __madvise_collapse_swapin_pmd_batch(struct mm_struct *mm,
 				    struct collapse_control *cc)
 
 {
-	/* Implemented in later patch */
-	return true;
+	unsigned long addr;
+	int i;
+	bool ret = true;
+
+	/*
+	 * This function is called with mmap_lock held, and returns with it
+	 * held. However, __collapse_huge_page_swapin() may internally drop and
+	 * reaquire the lock. When it does, it only revalidates the single pmd
+	 * provided to it. We need to know when it drops the lock so that we can
+	 * revalidate the batch of pmds we are operating on.
+	 *
+	 * Initially setting this to 'true' because the caller just locked
+	 * mmap_lock and so we need to revalidate before doing anything else.
+	 */
+	bool need_revalidate_pmd_count = true;
+
+	for (addr = batch_start, i = 0;
+	     i < batch_size;
+	     addr += HPAGE_PMD_SIZE, ++i) {
+		struct vm_area_struct *vma;
+		struct madvise_collapse_data *data = batch_data + i;
+
+		mmap_assert_locked(mm);
+
+		/*
+		 * We might have dropped the lock during previous iteration.
+		 * It's acceptable to exit this function without revalidating
+		 * the vma since the caller immediately unlocks mmap_lock
+		 * anyway.
+		 */
+		if (!data->continue_collapse)
+			continue;
+
+		if (need_revalidate_pmd_count) {
+			if (madvise_collapse_vma_revalidate_pmd_count(mm,
+								      batch_start,
+								      batch_size,
+								      &vma)) {
+				ret = false;
+				break;
+			}
+			need_revalidate_pmd_count = false;
+		}
+
+		data->pmd = mm_find_pmd(mm, addr);
+
+		if (!data->pmd ||
+		    (data->scan_result.unmapped &&
+		     !__collapse_huge_page_swapin(mm, vma, addr, data->pmd,
+						  VM_NOHUGEPAGE,
+						  data->scan_result.referenced,
+						  &need_revalidate_pmd_count))) {
+			/* Hold on to the THP until we know we don't need it. */
+			data->continue_collapse = false;
+			list_add_tail(&data->hpage->lru,
+				      &cc->free_hpages[node]);
+			data->hpage = NULL;
+		}
+	}
+	mmap_assert_locked(mm);
+	return ret;
 }
 
 /*
@@ -2630,8 +2735,40 @@ __madvise_collapse_pmd_batch(struct mm_struct *mm,
 			     int node,
 			     struct collapse_control *cc)
 {
-	/* Implemented in later patch */
-	return 0;
+	unsigned long addr;
+	struct vm_area_struct *vma;
+	int i, ret = 0;
+
+	mmap_assert_write_locked(mm);
+
+	if (madvise_collapse_vma_revalidate_pmd_count(mm, batch_start,
+						      batch_size, &vma))
+		goto out;
+
+	for (addr = batch_start, i = 0;
+	     i < batch_size;
+	     addr += HPAGE_PMD_SIZE, ++i) {
+		int result;
+		struct madvise_collapse_data *data = batch_data + i;
+
+		if (!data->continue_collapse ||
+		    (mm_find_pmd(mm, addr) != data->pmd))
+			continue;
+
+		result = __do_collapse_huge_page(mm, vma, addr, data->pmd,
+						 data->hpage,
+						 cc->enforce_pte_scan_limits,
+						 NULL);
+
+		if (result == SCAN_SUCCEED)
+			++ret;
+		else
+			list_add_tail(&data->hpage->lru,
+				      &cc->free_hpages[node]);
+		data->hpage = NULL;
+	}
+out:
+	return ret;
 }
 
 static bool continue_collapse(struct madvise_collapse_data *batch_data,
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 14/14] mm/madvise: add process_madvise(MADV_COLLAPSE)
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (12 preceding siblings ...)
  2022-03-08 21:34 ` [RFC PATCH 13/14] mm/madvise: add __madvise_collapse_*_batch() actions Zach O'Keefe
@ 2022-03-08 21:34 ` Zach O'Keefe
  2022-03-21 14:32 ` [RFC PATCH 00/14] mm: userspace hugepage collapse Zi Yan
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-08 21:34 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi,
	Zach O'Keefe

This is the first advice that makes use of process_madvise() flags.

Add the necessary plumbing to make the flags available from do_madvise()
handlers.

For MADV_COLLAPSE, the added flags are:

* MADV_F_COLLAPSE_LIMITS - controls if we should respect
			   khugepaged/max_ptes_* limits
			   (requires CAP_SYS_ADMIN if not acting on
			    self)
* MADV_F_COLLAPSE_DEFRAG - force enable defrag, despite vma or system
			   settings.

These two flags together provide userspace flexibility in defining
separate policies for synchronous userspace-directed collapse, and
asynchronous kernel (khugepaged) collapse.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 fs/io_uring.c                          |  3 +-
 include/linux/huge_mm.h                |  3 +-
 include/linux/mm.h                     |  3 +-
 include/uapi/asm-generic/mman-common.h |  8 +++++
 mm/khugepaged.c                        |  7 +++--
 mm/madvise.c                           | 42 ++++++++++++++------------
 6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 23e7f93d3956..8558b7549431 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4720,7 +4720,8 @@ static int io_madvise(struct io_kiocb *req, unsigned int issue_flags)
 	if (issue_flags & IO_URING_F_NONBLOCK)
 		return -EAGAIN;
 
-	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
+	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice,
+			 MADV_F_NONE);
 	if (ret < 0)
 		req_set_fail(req);
 	io_req_complete(req, ret);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 407b63ab4185..31f514ff36be 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -228,7 +228,8 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
 int madvise_collapse(struct vm_area_struct *vma,
 		     struct vm_area_struct **prev,
-		     unsigned long start, unsigned long end);
+		     unsigned long start, unsigned long end,
+		     unsigned int flags);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc69d2a69912..f4776f4cda48 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2690,7 +2690,8 @@ extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
 		       struct list_head *uf, bool downgrade);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
-extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
+extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
+		      int behavior, unsigned int flags);
 
 #ifdef CONFIG_MMU
 extern int __mm_populate(unsigned long addr, unsigned long len,
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..b81f4b1b18ba 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -79,6 +79,14 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+/* process_madvise() flags */
+#define MADV_F_NONE		0x0
+
+/* process_madvise(MADV_COLLAPSE) flags */
+#define MADV_F_COLLAPSE_LIMITS	0x1	/* respect system khugepaged/max_ptes_* sysfs limits */
+#define MADV_F_COLLAPSE_DEFRAG	0x2	/* force enable sync collapse + reclaim */
+#define MADV_F_COLLAPSE_MASK	(MADV_F_COLLAPSE_LIMITS | MADV_F_COLLAPSE_DEFRAG)
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e8156f15a3da..993de0c6eaa9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2942,7 +2942,7 @@ static int _madvise_collapse(struct mm_struct *mm,
 
 int madvise_collapse(struct vm_area_struct *vma,
 		     struct vm_area_struct **prev, unsigned long start,
-		     unsigned long end)
+		     unsigned long end, unsigned int flags)
 {
 	struct collapse_control cc;
 	gfp_t gfp;
@@ -2953,8 +2953,9 @@ int madvise_collapse(struct vm_area_struct *vma,
 	mmap_assert_locked(mm);
 
 	mmgrab(mm);
-	collapse_control_init(&cc, /* enforce_pte_scan_limits= */ false);
-	gfp = vma_thp_gfp_mask(vma);
+	collapse_control_init(&cc, flags & MADV_F_COLLAPSE_LIMITS);
+	gfp = vma_thp_gfp_mask(vma) | (flags & MADV_F_COLLAPSE_DEFRAG
+			? __GFP_DIRECT_RECLAIM : 0);
 	lru_add_drain(); /* lru_add_drain_all() too heavy here */
 	error = _madvise_collapse(mm, vma, prev, start, end, gfp, &cc);
 	mmap_assert_locked(mm);
diff --git a/mm/madvise.c b/mm/madvise.c
index 292aa017c150..7d094d86d2f1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -979,7 +979,7 @@ static long madvise_remove(struct vm_area_struct *vma,
 static int madvise_vma_behavior(struct vm_area_struct *vma,
 				struct vm_area_struct **prev,
 				unsigned long start, unsigned long end,
-				unsigned long behavior)
+				unsigned long behavior, unsigned int flags)
 {
 	int error;
 	struct anon_vma_name *anon_name;
@@ -1048,7 +1048,7 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 			goto out;
 		break;
 	case MADV_COLLAPSE:
-		return madvise_collapse(vma, prev, start, end);
+		return madvise_collapse(vma, prev, start, end, flags);
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1160,13 +1160,19 @@ madvise_behavior_valid(int behavior)
 }
 
 static bool
-process_madvise_behavior_valid(int behavior)
+process_madvise_behavior_valid(int behavior, struct task_struct *task,
+			       unsigned int flags)
 {
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
 	case MADV_WILLNEED:
-		return true;
+		return flags == 0;
+	case MADV_COLLAPSE:
+		return (flags & ~MADV_F_COLLAPSE_MASK) == 0 &&
+				(capable(CAP_SYS_ADMIN) ||
+				 (task == current) ||
+				 (flags & MADV_F_COLLAPSE_LIMITS));
 	default:
 		return false;
 	}
@@ -1182,10 +1188,11 @@ process_madvise_behavior_valid(int behavior)
  */
 static
 int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
-		      unsigned long end, unsigned long arg,
+		      unsigned long end, unsigned long arg, unsigned int flags,
 		      int (*visit)(struct vm_area_struct *vma,
 				   struct vm_area_struct **prev, unsigned long start,
-				   unsigned long end, unsigned long arg))
+				   unsigned long end, unsigned long arg,
+				   unsigned int flags))
 {
 	struct vm_area_struct *vma;
 	struct vm_area_struct *prev;
@@ -1222,7 +1229,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 			tmp = end;
 
 		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
-		error = visit(vma, &prev, start, tmp, arg);
+		error = visit(vma, &prev, start, tmp, arg, flags);
 		if (error)
 			return error;
 		start = tmp;
@@ -1285,7 +1292,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 		return 0;
 
 	return madvise_walk_vmas(mm, start, end, (unsigned long)anon_name,
-				 madvise_vma_anon_name);
+				 madvise_vma_anon_name, MADV_F_NONE);
 }
 #endif /* CONFIG_ANON_VMA_NAME */
 /*
@@ -1359,7 +1366,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  -EBADF  - map exists, but area maps something that isn't a file.
  *  -EAGAIN - a kernel resource was temporarily unavailable.
  */
-int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
+int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
+	       int behavior, unsigned int flags)
 {
 	unsigned long end;
 	int error;
@@ -1401,8 +1409,8 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	}
 
 	blk_start_plug(&plug);
-	error = madvise_walk_vmas(mm, start, end, behavior,
-			madvise_vma_behavior);
+	error = madvise_walk_vmas(mm, start, end, behavior, flags,
+				  madvise_vma_behavior);
 	blk_finish_plug(&plug);
 	if (write)
 		mmap_write_unlock(mm);
@@ -1414,7 +1422,8 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 
 SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
-	return do_madvise(current->mm, start, len_in, behavior);
+	return do_madvise(current->mm, start, len_in, behavior,
+			  MADV_F_NONE);
 }
 
 SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
@@ -1429,11 +1438,6 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 	size_t total_len;
 	unsigned int f_flags;
 
-	if (flags != 0) {
-		ret = -EINVAL;
-		goto out;
-	}
-
 	ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
 	if (ret < 0)
 		goto out;
@@ -1444,7 +1448,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto free_iov;
 	}
 
-	if (!process_madvise_behavior_valid(behavior)) {
+	if (!process_madvise_behavior_valid(behavior, task, flags)) {
 		ret = -EINVAL;
 		goto release_task;
 	}
@@ -1470,7 +1474,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 	while (iov_iter_count(&iter)) {
 		iovec = iov_iter_iovec(&iter);
 		ret = do_madvise(mm, (unsigned long)iovec.iov_base,
-					iovec.iov_len, behavior);
+					iovec.iov_len, behavior, flags);
 		if (ret < 0)
 			break;
 		iov_iter_advance(&iter, iovec.iov_len);
-- 
2.35.1.616.g0bdcbb4464-goog



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 01/14] mm/rmap: add mm_find_pmd_raw helper
  2022-03-08 21:34 ` [RFC PATCH 01/14] mm/rmap: add mm_find_pmd_raw helper Zach O'Keefe
@ 2022-03-09 22:48   ` Yang Shi
  0 siblings, 0 replies; 57+ messages in thread
From: Yang Shi @ 2022-03-09 22:48 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer

On Tue, Mar 8, 2022 at 1:34 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Later in the series, we want to find a pmd and take different actions,
> depending on if the pmd maps a thp or not.  Currently, mm_find_pmd()
> returns NULL if a valid pmd maps a thp, and so we can't use it directly.
>
> Split mm_find_pmd() into 2 parts: mm_find_pmd_raw(), which returns a
> raw pmd pointer, and the logic that filters out non-present none, or
> huge pmds.  mm_find_pmd_raw() can then be reused later in the series.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/internal.h |  1 +
>  mm/rmap.c     | 15 +++++++++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 86277d90a5e2..aaea25bb9096 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -166,6 +166,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason
>  /*
>   * in mm/rmap.c:
>   */
> +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address);
>  extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
>
>  /*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 70375c331083..0ae99affcb27 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -758,13 +758,12 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
>         return vma_address(page, vma);
>  }
>
> -pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> +pmd_t *mm_find_pmd_raw(struct mm_struct *mm, unsigned long address)

Typically we have the new helper and the users in the same patch. It
would make the review easier.


>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
>         pud_t *pud;
>         pmd_t *pmd = NULL;
> -       pmd_t pmde;
>
>         pgd = pgd_offset(mm, address);
>         if (!pgd_present(*pgd))
> @@ -779,6 +778,18 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>                 goto out;
>
>         pmd = pmd_offset(pud, address);
> +out:
> +       return pmd;
> +}
> +
> +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> +{
> +       pmd_t pmde;
> +       pmd_t *pmd;
> +
> +       pmd = mm_find_pmd_raw(mm, address);
> +       if (!pmd)
> +               goto out;
>         /*
>          * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
>          * without holding anon_vma lock for write.  So when looking for a
> --
> 2.35.1.616.g0bdcbb4464-goog
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 02/14] mm/khugepaged: add struct collapse_control
  2022-03-08 21:34 ` [RFC PATCH 02/14] mm/khugepaged: add struct collapse_control Zach O'Keefe
@ 2022-03-09 22:53   ` Yang Shi
  0 siblings, 0 replies; 57+ messages in thread
From: Yang Shi @ 2022-03-09 22:53 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer

On Tue, Mar 8, 2022 at 1:34 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Modularize huge page collapse by introducing struct collapse_control.
> This structure serves to describe the properties of the requested
> collapse, as well as serve as a local scratch pad to use during the
> collapse itself.
>
> Later in the series when we introduce the madvise collapse context, we
> will want to be able to ignore khugepaged_max_ptes_[none|swap|shared]
> in said context, and so is included here as a property of the
> requested collapse.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/khugepaged.c | 120 ++++++++++++++++++++++++++++++------------------
>  1 file changed, 76 insertions(+), 44 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a4e5eaf3eb01..36fc0099c445 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -85,6 +85,24 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
>
>  #define MAX_PTE_MAPPED_THP 8
>
> +struct collapse_control {
> +       /* Respect khugepaged_max_ptes_[none|swap|shared] */
> +       bool enforce_pte_scan_limits;

I'm fine to have collapse_control struct, but it seems
enforce_pte_scan_limits is actually not used until a later patch. So
as patch #1, it'd better to have new functions or new variables in the
same patch with their users.

> +
> +       /* Num pages scanned per node */
> +       int node_load[MAX_NUMNODES];
> +
> +       /* Last target selected in khugepaged_find_target_node() for this scan */
> +       int last_target_node;
> +};
> +
> +static void collapse_control_init(struct collapse_control *cc,
> +                                 bool enforce_pte_scan_limits)
> +{
> +       cc->enforce_pte_scan_limits = enforce_pte_scan_limits;
> +       cc->last_target_node = NUMA_NO_NODE;
> +}
> +
>  /**
>   * struct mm_slot - hash lookup from mm to mm_slot
>   * @hash: hash collision list
> @@ -601,6 +619,7 @@ static bool is_refcount_suitable(struct page *page)
>  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                                         unsigned long address,
>                                         pte_t *pte,
> +                                       bool enforce_pte_scan_limits,
>                                         struct list_head *compound_pagelist)
>  {
>         struct page *page = NULL;
> @@ -614,7 +633,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                 if (pte_none(pteval) || (pte_present(pteval) &&
>                                 is_zero_pfn(pte_pfn(pteval)))) {
>                         if (!userfaultfd_armed(vma) &&
> -                           ++none_or_zero <= khugepaged_max_ptes_none) {
> +                           (++none_or_zero <= khugepaged_max_ptes_none ||
> +                            !enforce_pte_scan_limits)) {
>                                 continue;
>                         } else {
>                                 result = SCAN_EXCEED_NONE_PTE;
> @@ -634,8 +654,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>
>                 VM_BUG_ON_PAGE(!PageAnon(page), page);
>
> -               if (page_mapcount(page) > 1 &&
> -                               ++shared > khugepaged_max_ptes_shared) {
> +               if (page_mapcount(page) > 1 && enforce_pte_scan_limits &&
> +                   ++shared > khugepaged_max_ptes_shared) {
>                         result = SCAN_EXCEED_SHARED_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>                         goto out;
> @@ -785,9 +805,7 @@ static void khugepaged_alloc_sleep(void)
>         remove_wait_queue(&khugepaged_wait, &wait);
>  }
>
> -static int khugepaged_node_load[MAX_NUMNODES];
> -
> -static bool khugepaged_scan_abort(int nid)
> +static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
>  {
>         int i;
>
> @@ -799,11 +817,11 @@ static bool khugepaged_scan_abort(int nid)
>                 return false;
>
>         /* If there is a count for this node already, it must be acceptable */
> -       if (khugepaged_node_load[nid])
> +       if (cc->node_load[nid])
>                 return false;
>
>         for (i = 0; i < MAX_NUMNODES; i++) {
> -               if (!khugepaged_node_load[i])
> +               if (!cc->node_load[i])
>                         continue;
>                 if (node_distance(nid, i) > node_reclaim_distance)
>                         return true;
> @@ -818,28 +836,28 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
>  }
>
>  #ifdef CONFIG_NUMA
> -static int khugepaged_find_target_node(void)
> +static int khugepaged_find_target_node(struct collapse_control *cc)
>  {
> -       static int last_khugepaged_target_node = NUMA_NO_NODE;
>         int nid, target_node = 0, max_value = 0;
>
>         /* find first node with max normal pages hit */
>         for (nid = 0; nid < MAX_NUMNODES; nid++)
> -               if (khugepaged_node_load[nid] > max_value) {
> -                       max_value = khugepaged_node_load[nid];
> +               if (cc->node_load[nid] > max_value) {
> +                       max_value = cc->node_load[nid];
>                         target_node = nid;
>                 }
>
>         /* do some balance if several nodes have the same hit record */
> -       if (target_node <= last_khugepaged_target_node)
> -               for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
> -                               nid++)
> -                       if (max_value == khugepaged_node_load[nid]) {
> +       if (target_node <= cc->last_target_node)
> +               for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
> +                    nid++) {
> +                       if (max_value == cc->node_load[nid]) {
>                                 target_node = nid;
>                                 break;
>                         }
> +               }
>
> -       last_khugepaged_target_node = target_node;
> +       cc->last_target_node = target_node;
>         return target_node;
>  }
>
> @@ -877,7 +895,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>         return *hpage;
>  }
>  #else
> -static int khugepaged_find_target_node(void)
> +static int khugepaged_find_target_node(struct collapse_control *cc)
>  {
>         return 0;
>  }
> @@ -1043,7 +1061,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>  static void collapse_huge_page(struct mm_struct *mm,
>                                    unsigned long address,
>                                    struct page **hpage,
> -                                  int node, int referenced, int unmapped)
> +                                  int node, int referenced, int unmapped,
> +                                  int enforce_pte_scan_limits)
>  {
>         LIST_HEAD(compound_pagelist);
>         pmd_t *pmd, _pmd;
> @@ -1141,7 +1160,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>
>         spin_lock(pte_ptl);
>         isolated = __collapse_huge_page_isolate(vma, address, pte,
> -                       &compound_pagelist);
> +                       enforce_pte_scan_limits, &compound_pagelist);
>         spin_unlock(pte_ptl);
>
>         if (unlikely(!isolated)) {
> @@ -1206,7 +1225,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  static int khugepaged_scan_pmd(struct mm_struct *mm,
>                                struct vm_area_struct *vma,
>                                unsigned long address,
> -                              struct page **hpage)
> +                              struct page **hpage,
> +                              struct collapse_control *cc)
>  {
>         pmd_t *pmd;
>         pte_t *pte, *_pte;
> @@ -1226,13 +1246,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>                 goto out;
>         }
>
> -       memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
> +       memset(cc->node_load, 0, sizeof(cc->node_load));
>         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>         for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
>              _pte++, _address += PAGE_SIZE) {
>                 pte_t pteval = *_pte;
>                 if (is_swap_pte(pteval)) {
> -                       if (++unmapped <= khugepaged_max_ptes_swap) {
> +                       if (++unmapped <= khugepaged_max_ptes_swap ||
> +                           !cc->enforce_pte_scan_limits) {
>                                 /*
>                                  * Always be strict with uffd-wp
>                                  * enabled swap entries.  Please see
> @@ -1251,7 +1272,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>                 }
>                 if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>                         if (!userfaultfd_armed(vma) &&
> -                           ++none_or_zero <= khugepaged_max_ptes_none) {
> +                           (++none_or_zero <= khugepaged_max_ptes_none ||
> +                            !cc->enforce_pte_scan_limits)) {
>                                 continue;
>                         } else {
>                                 result = SCAN_EXCEED_NONE_PTE;
> @@ -1282,7 +1304,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>                 }
>
>                 if (page_mapcount(page) > 1 &&
> -                               ++shared > khugepaged_max_ptes_shared) {
> +                               ++shared > khugepaged_max_ptes_shared &&
> +                               cc->enforce_pte_scan_limits) {
>                         result = SCAN_EXCEED_SHARED_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>                         goto out_unmap;
> @@ -1292,16 +1315,16 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>
>                 /*
>                  * Record which node the original page is from and save this
> -                * information to khugepaged_node_load[].
> +                * information to cc->node_load[].
>                  * Khugepaged will allocate hugepage from the node has the max
>                  * hit record.
>                  */
>                 node = page_to_nid(page);
> -               if (khugepaged_scan_abort(node)) {
> +               if (khugepaged_scan_abort(node, cc)) {
>                         result = SCAN_SCAN_ABORT;
>                         goto out_unmap;
>                 }
> -               khugepaged_node_load[node]++;
> +               cc->node_load[node]++;
>                 if (!PageLRU(page)) {
>                         result = SCAN_PAGE_LRU;
>                         goto out_unmap;
> @@ -1352,10 +1375,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  out_unmap:
>         pte_unmap_unlock(pte, ptl);
>         if (ret) {
> -               node = khugepaged_find_target_node();
> +               node = khugepaged_find_target_node(cc);
>                 /* collapse_huge_page will return with the mmap_lock released */
>                 collapse_huge_page(mm, address, hpage, node,
> -                               referenced, unmapped);
> +                               referenced, unmapped,
> +                               cc->enforce_pte_scan_limits);
>         }
>  out:
>         trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
> @@ -1992,7 +2016,8 @@ static void collapse_file(struct mm_struct *mm,
>  }
>
>  static void khugepaged_scan_file(struct mm_struct *mm,
> -               struct file *file, pgoff_t start, struct page **hpage)
> +               struct file *file, pgoff_t start, struct page **hpage,
> +               struct collapse_control *cc)
>  {
>         struct page *page = NULL;
>         struct address_space *mapping = file->f_mapping;
> @@ -2003,14 +2028,15 @@ static void khugepaged_scan_file(struct mm_struct *mm,
>
>         present = 0;
>         swap = 0;
> -       memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
> +       memset(cc->node_load, 0, sizeof(cc->node_load));
>         rcu_read_lock();
>         xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
>                 if (xas_retry(&xas, page))
>                         continue;
>
>                 if (xa_is_value(page)) {
> -                       if (++swap > khugepaged_max_ptes_swap) {
> +                       if (cc->enforce_pte_scan_limits &&
> +                           ++swap > khugepaged_max_ptes_swap) {
>                                 result = SCAN_EXCEED_SWAP_PTE;
>                                 count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>                                 break;
> @@ -2028,11 +2054,11 @@ static void khugepaged_scan_file(struct mm_struct *mm,
>                 }
>
>                 node = page_to_nid(page);
> -               if (khugepaged_scan_abort(node)) {
> +               if (khugepaged_scan_abort(node, cc)) {
>                         result = SCAN_SCAN_ABORT;
>                         break;
>                 }
> -               khugepaged_node_load[node]++;
> +               cc->node_load[node]++;
>
>                 if (!PageLRU(page)) {
>                         result = SCAN_PAGE_LRU;
> @@ -2061,11 +2087,12 @@ static void khugepaged_scan_file(struct mm_struct *mm,
>         rcu_read_unlock();
>
>         if (result == SCAN_SUCCEED) {
> -               if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
> +               if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none &&
> +                   cc->enforce_pte_scan_limits) {
>                         result = SCAN_EXCEED_NONE_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>                 } else {
> -                       node = khugepaged_find_target_node();
> +                       node = khugepaged_find_target_node(cc);
>                         collapse_file(mm, file, start, hpage, node);
>                 }
>         }
> @@ -2074,7 +2101,8 @@ static void khugepaged_scan_file(struct mm_struct *mm,
>  }
>  #else
>  static void khugepaged_scan_file(struct mm_struct *mm,
> -               struct file *file, pgoff_t start, struct page **hpage)
> +               struct file *file, pgoff_t start, struct page **hpage,
> +               struct collapse_control *cc)
>  {
>         BUILD_BUG();
>  }
> @@ -2085,7 +2113,8 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
>  #endif
>
>  static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> -                                           struct page **hpage)
> +                                           struct page **hpage,
> +                                           struct collapse_control *cc)
>         __releases(&khugepaged_mm_lock)
>         __acquires(&khugepaged_mm_lock)
>  {
> @@ -2161,12 +2190,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
>
>                                 mmap_read_unlock(mm);
>                                 ret = 1;
> -                               khugepaged_scan_file(mm, file, pgoff, hpage);
> +                               khugepaged_scan_file(mm, file, pgoff, hpage, cc);
>                                 fput(file);
>                         } else {
>                                 ret = khugepaged_scan_pmd(mm, vma,
>                                                 khugepaged_scan.address,
> -                                               hpage);
> +                                               hpage, cc);
>                         }
>                         /* move to next address */
>                         khugepaged_scan.address += HPAGE_PMD_SIZE;
> @@ -2222,7 +2251,7 @@ static int khugepaged_wait_event(void)
>                 kthread_should_stop();
>  }
>
> -static void khugepaged_do_scan(void)
> +static void khugepaged_do_scan(struct collapse_control *cc)
>  {
>         struct page *hpage = NULL;
>         unsigned int progress = 0, pass_through_head = 0;
> @@ -2246,7 +2275,7 @@ static void khugepaged_do_scan(void)
>                 if (khugepaged_has_work() &&
>                     pass_through_head < 2)
>                         progress += khugepaged_scan_mm_slot(pages - progress,
> -                                                           &hpage);
> +                                                           &hpage, cc);
>                 else
>                         progress = pages;
>                 spin_unlock(&khugepaged_mm_lock);
> @@ -2285,12 +2314,15 @@ static void khugepaged_wait_work(void)
>  static int khugepaged(void *none)
>  {
>         struct mm_slot *mm_slot;
> +       struct collapse_control cc;
> +
> +       collapse_control_init(&cc, /* enforce_pte_scan_limits= */ 1);
>
>         set_freezable();
>         set_user_nice(current, MAX_NICE);
>
>         while (!kthread_should_stop()) {
> -               khugepaged_do_scan();
> +               khugepaged_do_scan(&cc);
>                 khugepaged_wait_work();
>         }
>
> --
> 2.35.1.616.g0bdcbb4464-goog
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 06/14] mm/khugepaged: add hugepage_vma_revalidate_pmd_count()
  2022-03-08 21:34 ` [RFC PATCH 06/14] mm/khugepaged: add hugepage_vma_revalidate_pmd_count() Zach O'Keefe
@ 2022-03-09 23:15   ` Yang Shi
  0 siblings, 0 replies; 57+ messages in thread
From: Yang Shi @ 2022-03-09 23:15 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer

On Tue, Mar 8, 2022 at 1:34 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> madvise collapse context operates on pmds in batch. We will want to
> be able to revalidate a region that spans multiple pmds in the same
> vma.
>
> Add hugepage_vma_revalidate_pmd_count() which extends
> hugepage_vma_revalidate() with number of pmds to revalidate.
> hugepage_vma_revalidate() now calls through this.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/khugepaged.c | 26 ++++++++++++++++++--------
>  1 file changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 56f2ef7146c7..1d20be47bcea 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -964,18 +964,17 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>  #endif
>
>  /*
> - * If mmap_lock temporarily dropped, revalidate vma
> - * before taking mmap_lock.
> - * Return 0 if succeeds, otherwise return none-zero
> - * value (scan code).
> + * Revalidate a vma's eligibility to collapse nr hugepages.
>   */
> -
> -static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> -               struct vm_area_struct **vmap)
> +static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> +                                            unsigned long address, int nr,
> +                                            struct vm_area_struct **vmap)

Same again, better to have the new helper in the same patch with its users.

>  {
>         struct vm_area_struct *vma;
>         unsigned long hstart, hend;
>
> +       mmap_assert_locked(mm);
> +
>         if (unlikely(khugepaged_test_exit(mm)))
>                 return SCAN_ANY_PROCESS;
>
> @@ -985,7 +984,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>
>         hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
>         hend = vma->vm_end & HPAGE_PMD_MASK;
> -       if (address < hstart || address + HPAGE_PMD_SIZE > hend)
> +       if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
>                 return SCAN_ADDRESS_RANGE;
>         if (!hugepage_vma_check(vma, vma->vm_flags))
>                 return SCAN_VMA_CHECK;
> @@ -995,6 +994,17 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>         return 0;
>  }
>
> +/*
> + * If mmap_lock temporarily dropped, revalidate vma before taking mmap_lock.
> + * Return 0 if succeeds, otherwise return none-zero value (scan code).
> + */
> +
> +static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> +                                  struct vm_area_struct **vmap)
> +{
> +       return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
> +}
> +
>  /*
>   * Bring missing pages in from swap, to complete THP collapse.
>   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> --
> 2.35.1.616.g0bdcbb4464-goog
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-08 21:34 ` [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count() Zach O'Keefe
@ 2022-03-09 23:17   ` Yang Shi
  2022-03-10  0:00     ` Zach O'Keefe
  2022-03-10 15:56   ` David Hildenbrand
  1 sibling, 1 reply; 57+ messages in thread
From: Yang Shi @ 2022-03-09 23:17 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer

On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> In madvise collapse context, we optionally want to be able to ignore
> advice from MADV_NOHUGEPAGE-marked regions.

Could you please elaborate why this usecase is valid? Typically
MADV_NOHUGEPAGE is set when the users really don't want to have THP
for this area. So it doesn't make too much sense to ignore it IMHO.

>
> Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> which can be used to ignore vm flags used when considering thp
> eligibility.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/khugepaged.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1d20be47bcea..ecbd3fc41c80 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -964,10 +964,14 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>  #endif
>
>  /*
> - * Revalidate a vma's eligibility to collapse nr hugepages.
> + * Revalidate a vma's eligibility to collapse nr hugepages. vm_flags_ignore
> + * can be used to ignore certain vma_flags that would otherwise be checked -
> + * the principal example being VM_NOHUGEPAGE which is ignored in madvise
> + * collapse context.
>   */
>  static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
>                                              unsigned long address, int nr,
> +                                            unsigned long vm_flags_ignore,
>                                              struct vm_area_struct **vmap)
>  {
>         struct vm_area_struct *vma;
> @@ -986,7 +990,7 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
>         hend = vma->vm_end & HPAGE_PMD_MASK;
>         if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
>                 return SCAN_ADDRESS_RANGE;
> -       if (!hugepage_vma_check(vma, vma->vm_flags))
> +       if (!hugepage_vma_check(vma, vma->vm_flags & ~vm_flags_ignore))
>                 return SCAN_VMA_CHECK;
>         /* Anon VMA expected */
>         if (!vma->anon_vma || vma->vm_ops)
> @@ -1000,9 +1004,11 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
>   */
>
>  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> +                                  unsigned long vm_flags_ignore,
>                                    struct vm_area_struct **vmap)
>  {
> -       return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
> +       return hugepage_vma_revalidate_pmd_count(mm, address, 1,
> +                       vm_flags_ignore, vmap);
>  }
>
>  /*
> @@ -1043,7 +1049,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>                 /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
>                 if (ret & VM_FAULT_RETRY) {
>                         mmap_read_lock(mm);
> -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> +                       if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
>                                 /* vma is no longer available, don't continue to swapin */
>                                 trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
>                                 return false;
> @@ -1200,7 +1206,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
>
>         mmap_read_lock(mm);
> -       result = hugepage_vma_revalidate(mm, address, &vma);
> +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
>         if (result) {
>                 mmap_read_unlock(mm);
>                 goto out_nolock;
> @@ -1232,7 +1238,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>          */
>         mmap_write_lock(mm);
>
> -       result = hugepage_vma_revalidate(mm, address, &vma);
> +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
>         if (result)
>                 goto out_up_write;
>         /* check if the pmd is still valid */
> --
> 2.35.1.616.g0bdcbb4464-goog
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 09/14] mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP
  2022-03-08 21:34 ` [RFC PATCH 09/14] mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP Zach O'Keefe
@ 2022-03-09 23:40   ` Yang Shi
  2022-03-10  0:46     ` Zach O'Keefe
  0 siblings, 1 reply; 57+ messages in thread
From: Yang Shi @ 2022-03-09 23:40 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer

On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> When scanning an anon pmd to see if it's eligible for collapse, return
> SCAN_PAGE_COMPOUND if the pmd already maps a thp. This is consistent
> with handling when scanning file-backed memory.

I'm not quite keen that we have to keep anon consistent with file for
this case. SCAN_PAGE_COMPOUND typically means the page is compound
page, but PTE mapped.

And even SCAN_PMD_NULL is not returned every time when mm_find_pmd()
returns NULL. In addition, SCAN_PMD_NULL seems ambiguous to me. The
khugepaged actually sees non-present (migration) entry or trans huge
entry, so may rename it to SCAN_PMD_NOT_SUITABLE?

>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/khugepaged.c | 41 +++++++++++++++++++++++++++++++++++------
>  1 file changed, 35 insertions(+), 6 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ecbd3fc41c80..403578161a3b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1011,6 +1011,38 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>                         vm_flags_ignore, vmap);
>  }
>
> +/*
> + * If returning NULL (meaning the pmd isn't mapped, isn't present, or thp),
> + * write the reason to *result.
> + */
> +static pmd_t *find_pmd_or_thp_or_none(struct mm_struct *mm,
> +                                     unsigned long address,
> +                                     int *result)
> +{
> +       pmd_t *pmd = mm_find_pmd_raw(mm, address);
> +       pmd_t pmde;
> +
> +       if (!pmd) {
> +               *result = SCAN_PMD_NULL;
> +               return NULL;
> +       }
> +
> +       pmde = pmd_read_atomic(pmd);
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> +       barrier();
> +#endif
> +       if (!pmd_present(pmde) || !pmd_none(pmde)) {
> +               *result = SCAN_PMD_NULL;
> +               return NULL;
> +       } else if (pmd_trans_huge(pmde)) {
> +               *result = SCAN_PAGE_COMPOUND;
> +               return NULL;
> +       }
> +       return pmd;
> +}
> +
>  /*
>   * Bring missing pages in from swap, to complete THP collapse.
>   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> @@ -1212,9 +1244,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>                 goto out_nolock;
>         }
>
> -       pmd = mm_find_pmd(mm, address);
> +       pmd = find_pmd_or_thp_or_none(mm, address, &result);
>         if (!pmd) {
> -               result = SCAN_PMD_NULL;
>                 mmap_read_unlock(mm);
>                 goto out_nolock;
>         }
> @@ -1287,11 +1318,9 @@ static void scan_pmd(struct mm_struct *mm,
>         mmap_assert_locked(mm);
>         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> -       pmd = mm_find_pmd(mm, address);
> -       if (!pmd) {
> -               scan_result->result = SCAN_PMD_NULL;
> +       pmd = find_pmd_or_thp_or_none(mm, address, &scan_result->result);
> +       if (!pmd)
>                 goto out;
> -       }
>
>         memset(cc->node_load, 0, sizeof(cc->node_load));
>         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> --
> 2.35.1.616.g0bdcbb4464-goog
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 11/14] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-03-08 21:34 ` [RFC PATCH 11/14] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
@ 2022-03-09 23:43   ` Yang Shi
  2022-03-10  1:11     ` Zach O'Keefe
  0 siblings, 1 reply; 57+ messages in thread
From: Yang Shi @ 2022-03-09 23:43 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer

On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> The idea of hugepage collapse in process context was previously
> introduced by David Rientjes to linux-mm[1].
>
> The idea is to introduce a new madvise mode, MADV_COLLAPSE, that allows
> users to request a synchronous collapse of memory.
>
> The benefits of this approach are:
>
> * cpu is charged to the process that wants to spend the cycles for the
>   THP
> * avoid unpredictable timing of khugepaged collapse
> * flexible separation of sync userspace and async khugepaged THP collapse
>   policies
>
> Immediate users of this new functionality include:
>
> * malloc implementations that manage memory in hugepage-sized chunks,
>   but sometimes subrelease memory back to the system in native-sized
>   chunks via MADV_DONTNEED; zapping the pmd.  Later, when the memory
>   is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the
>   memory by THP to regain TLB performance.
> * immediately back executable text by hugepages.  Current support
>   provided by CONFIG_READ_ONLY_THP_FOR_FS may take too long on a large
>   system.
>
> To keep patches digestible, introduce MADV_COLLAPSE in a few stages.
>
> Add plumbing to existing madvise infrastructure, as well as populate
> uapi header files, leaving the actual madvise(MADV_COLLAPSE) handler
> stubbed out.  Only privately-mapped anon memory is supported for now.
>
> [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  include/linux/huge_mm.h                | 12 +++++++
>  include/uapi/asm-generic/mman-common.h |  2 ++
>  mm/khugepaged.c                        | 46 ++++++++++++++++++++++++++
>  mm/madvise.c                           |  5 +++
>  4 files changed, 65 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index fd905b0b2c71..407b63ab4185 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -226,6 +226,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>
>  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
>                      int advice);
> +int madvise_collapse(struct vm_area_struct *vma,
> +                    struct vm_area_struct **prev,
> +                    unsigned long start, unsigned long end);
>  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
>                            unsigned long end, long adjust_next);
>  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> @@ -383,6 +386,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
>         BUG();
>         return 0;
>  }
> +
> +static inline int madvise_collapse(struct vm_area_struct *vma,
> +                                  struct vm_area_struct **prev,
> +                                  unsigned long start, unsigned long end)
> +{
> +       BUG();
> +       return 0;
> +}
> +
>  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
>                                          unsigned long start,
>                                          unsigned long end,
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6c1aa92a92e4..6ce1f1ceb432 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -77,6 +77,8 @@
>
>  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
>
> +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 12ae765c5c32..ca1e523086ed 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2519,3 +2519,49 @@ void khugepaged_min_free_kbytes_update(void)
>                 set_recommended_min_free_kbytes();
>         mutex_unlock(&khugepaged_mutex);
>  }
> +
> +/*
> + * Returns 0 if successfully able to collapse range into THPs (or range already
> + * backed by THPs). Due to implementation detail, THPs collapsed here may be
> + * split again before this function returns.
> + */
> +static int _madvise_collapse(struct mm_struct *mm,
> +                            struct vm_area_struct *vma,
> +                            struct vm_area_struct **prev,
> +                            unsigned long start,
> +                            unsigned long end, gfp_t gfp,
> +                            struct collapse_control *cc)
> +{
> +       /* Implemented in later patch */

Just like the earlier patches, as long as you introduce a new
function, it is better to keep it with its users in the same patch.
And typically we don't do the "implement in the later patch" thing, it
makes review harder.

> +       return -ENOSYS;
> +}
> +
> +int madvise_collapse(struct vm_area_struct *vma,
> +                    struct vm_area_struct **prev, unsigned long start,
> +                    unsigned long end)
> +{
> +       struct collapse_control cc;
> +       gfp_t gfp;
> +       int error;
> +       struct mm_struct *mm = vma->vm_mm;
> +
> +       /* Requested to hold mmap_lock in read */
> +       mmap_assert_locked(mm);
> +
> +       mmgrab(mm);
> +       collapse_control_init(&cc, /* enforce_pte_scan_limits= */ false);
> +       gfp = vma_thp_gfp_mask(vma);
> +       lru_add_drain(); /* lru_add_drain_all() too heavy here */
> +       error = _madvise_collapse(mm, vma, prev, start, end, gfp, &cc);
> +       mmap_assert_locked(mm);
> +       mmdrop(mm);
> +
> +       /*
> +        * madvise() returns EAGAIN if kernel resources are temporarily
> +        * unavailable.
> +        */
> +       if (error == -ENOMEM)
> +               error = -EAGAIN;
> +
> +       return error;
> +}
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 5b6d796e55de..292aa017c150 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -58,6 +58,7 @@ static int madvise_need_mmap_write(int behavior)
>         case MADV_FREE:
>         case MADV_POPULATE_READ:
>         case MADV_POPULATE_WRITE:
> +       case MADV_COLLAPSE:
>                 return 0;
>         default:
>                 /* be safe, default to 1. list exceptions explicitly */
> @@ -1046,6 +1047,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>                 if (error)
>                         goto out;
>                 break;
> +       case MADV_COLLAPSE:
> +               return madvise_collapse(vma, prev, start, end);
>         }
>
>         anon_name = anon_vma_name(vma);
> @@ -1139,6 +1142,7 @@ madvise_behavior_valid(int behavior)
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         case MADV_HUGEPAGE:
>         case MADV_NOHUGEPAGE:
> +       case MADV_COLLAPSE:
>  #endif
>         case MADV_DONTDUMP:
>         case MADV_DODUMP:
> @@ -1328,6 +1332,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
>   *             transparent huge pages so the existing pages will not be
>   *             coalesced into THP and new pages will not be allocated as THP.
> + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *             from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> --
> 2.35.1.616.g0bdcbb4464-goog
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-09 23:17   ` Yang Shi
@ 2022-03-10  0:00     ` Zach O'Keefe
  2022-03-10  0:41       ` Yang Shi
  0 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-10  0:00 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer

> On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > In madvise collapse context, we optionally want to be able to ignore
> > advice from MADV_NOHUGEPAGE-marked regions.
>
> Could you please elaborate why this usecase is valid? Typically
> MADV_NOHUGEPAGE is set when the users really don't want to have THP
> for this area. So it doesn't make too much sense to ignore it IMHO.
>

Hey Yang, thanks for taking time to review and comment.

Semantically, the way I see it, is that MADV_NOHUGEPAGE is a way for
the user to say "I don't want hugepages here", so that the kernel
knows not to do so when faulting memory, and khugepaged can stay away.
However, in MADV_COLLAPSE, the user is explicitly requesting this be
backed by hugepages - so presumably that is exactly what they want.

IOW, if the user didn't want this memory to be backed by hugepages,
they wouldn't be MADV_COLLAPSE'ing it. If there was a range of memory
the user wanted collapsed, but that had some sub-areas marked
MADV_NOHUGEPAGE, they could always issue multiple MADV_COLLAPSE
operations around the excluded regions.

In terms of use cases, I don't have a concrete example, but a user
could hypothetically choose to exclude regions from management from
khugepaged, but still be able to collapse the memory themselves,
when/if they deem appropriate.

> >
> > Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> > which can be used to ignore vm flags used when considering thp
> > eligibility.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  mm/khugepaged.c | 18 ++++++++++++------
> >  1 file changed, 12 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 1d20be47bcea..ecbd3fc41c80 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -964,10 +964,14 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> >  #endif
> >
> >  /*
> > - * Revalidate a vma's eligibility to collapse nr hugepages.
> > + * Revalidate a vma's eligibility to collapse nr hugepages. vm_flags_ignore
> > + * can be used to ignore certain vma_flags that would otherwise be checked -
> > + * the principal example being VM_NOHUGEPAGE which is ignored in madvise
> > + * collapse context.
> >   */
> >  static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> >                                              unsigned long address, int nr,
> > +                                            unsigned long vm_flags_ignore,
> >                                              struct vm_area_struct **vmap)
> >  {
> >         struct vm_area_struct *vma;
> > @@ -986,7 +990,7 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> >         hend = vma->vm_end & HPAGE_PMD_MASK;
> >         if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
> >                 return SCAN_ADDRESS_RANGE;
> > -       if (!hugepage_vma_check(vma, vma->vm_flags))
> > +       if (!hugepage_vma_check(vma, vma->vm_flags & ~vm_flags_ignore))
> >                 return SCAN_VMA_CHECK;
> >         /* Anon VMA expected */
> >         if (!vma->anon_vma || vma->vm_ops)
> > @@ -1000,9 +1004,11 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> >   */
> >
> >  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > +                                  unsigned long vm_flags_ignore,
> >                                    struct vm_area_struct **vmap)
> >  {
> > -       return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
> > +       return hugepage_vma_revalidate_pmd_count(mm, address, 1,
> > +                       vm_flags_ignore, vmap);
> >  }
> >
> >  /*
> > @@ -1043,7 +1049,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> >                 /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> >                 if (ret & VM_FAULT_RETRY) {
> >                         mmap_read_lock(mm);
> > -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> > +                       if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
> >                                 /* vma is no longer available, don't continue to swapin */
> >                                 trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> >                                 return false;
> > @@ -1200,7 +1206,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> >         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> >
> >         mmap_read_lock(mm);
> > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> >         if (result) {
> >                 mmap_read_unlock(mm);
> >                 goto out_nolock;
> > @@ -1232,7 +1238,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> >          */
> >         mmap_write_lock(mm);
> >
> > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> >         if (result)
> >                 goto out_up_write;
> >         /* check if the pmd is still valid */
> > --
> > 2.35.1.616.g0bdcbb4464-goog
> >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
  2022-03-08 21:34 ` [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse Zach O'Keefe
@ 2022-03-10  0:06   ` Yang Shi
  2022-03-10 19:26     ` David Rientjes
  0 siblings, 1 reply; 57+ messages in thread
From: Yang Shi @ 2022-03-10  0:06 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer

On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Introduce the main madvise collapse batched logic, including the overall
> locking strategy.  Stubs for individual batched actions, such as
> scanning pmds in batch, have been stubbed out, and will be added later
> in the series.
>
> Note the main benefit from doing all this work in a batched manner is
> that __madvise__collapse_pmd_batch() (stubbed out) can be called inside
> a single mmap_lock write.

I don't get why this is preferred? Isn't it more preferred to minimize
the scope of write mmap_lock? Assuming you batch large number of PMDs,
MADV_COLLAPSE may hold write mmap_lock for a long time, it doesn't
seem it could scale.

>
> Per-batch data is stored in a struct madvise_collapse_data array, with
> an entry for each pmd to collapse, and is shared between the various
> *_batch actions.  This allows for partial success of collapsing a range
> of pmds - we continue as long as some pmds can be successfully
> collapsed.
>
> A "success" here, is where all pmds can be (or already are) collapsed.
> On failure, the caller will need to verify what, if any, partial
> successes occurred via smaps or otherwise.

And the further question is why you have to batch it? In the first
place my guess is you want to achieve a binary result, all valid PMDs
get collapsed or no PMD gets collapsed. But it seems partial collapse
is fine. So I don't get why you have to batch it.

The side effect, off the top of my head, is you may preallocate a lot
of huge pages, but the later collapse is blocked, for example, can't
get mmap_lock or ptl, and the system may be under memory pressure,
however the pre-allocated huge pages can't get reclaimed at all.

Could you please elaborate why you didn't go with the non-batch approach?

>
> Also note that, where possible, if collapse fails for a particular pmd
> after a hugepage has already been allocated, said hugepage is kept on a
> per-node free list for the purpose of backing subsequent pmd collapses.
> All unused hugepages are returned before _madvise_collapse() returns.
>
> Note that bisect at this patch won't break; madvise(MADV_COLLAPSE) will
> return -1 always.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  mm/khugepaged.c | 279 ++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 273 insertions(+), 6 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ca1e523086ed..ea53c706602e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -86,6 +86,9 @@ static struct kmem_cache *mm_slot_cache __read_mostly;
>  #define MAX_PTE_MAPPED_THP 8
>
>  struct collapse_control {
> +       /* Used by MADV_COLLAPSE batch collapse */
> +       struct list_head free_hpages[MAX_NUMNODES];
> +
>         /* Respect khugepaged_max_ptes_[none|swap|shared] */
>         bool enforce_pte_scan_limits;
>
> @@ -99,8 +102,13 @@ struct collapse_control {
>  static void collapse_control_init(struct collapse_control *cc,
>                                   bool enforce_pte_scan_limits)
>  {
> +       int i;
> +
>         cc->enforce_pte_scan_limits = enforce_pte_scan_limits;
>         cc->last_target_node = NUMA_NO_NODE;
> +
> +       for (i = 0; i < MAX_NUMNODES; ++i)
> +               INIT_LIST_HEAD(cc->free_hpages + i);
>  }
>
>  /**
> @@ -1033,7 +1041,7 @@ static pmd_t *find_pmd_or_thp_or_none(struct mm_struct *mm,
>         /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
>         barrier();
>  #endif
> -       if (!pmd_present(pmde) || !pmd_none(pmde)) {
> +       if (!pmd_present(pmde) || pmd_none(pmde)) {
>                 *result = SCAN_PMD_NULL;
>                 return NULL;
>         } else if (pmd_trans_huge(pmde)) {
> @@ -1054,12 +1062,16 @@ static pmd_t *find_pmd_or_thp_or_none(struct mm_struct *mm,
>  static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>                                         struct vm_area_struct *vma,
>                                         unsigned long haddr, pmd_t *pmd,
> -                                       int referenced)
> +                                       int referenced,
> +                                       unsigned long vm_flags_ignored,
> +                                       bool *mmap_lock_dropped)
>  {
>         int swapped_in = 0;
>         vm_fault_t ret = 0;
>         unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
>
> +       if (mmap_lock_dropped)
> +               *mmap_lock_dropped = false;
>         for (address = haddr; address < end; address += PAGE_SIZE) {
>                 struct vm_fault vmf = {
>                         .vma = vma,
> @@ -1080,8 +1092,10 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>
>                 /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
>                 if (ret & VM_FAULT_RETRY) {
> +                       if (mmap_lock_dropped)
> +                               *mmap_lock_dropped = true;
>                         mmap_read_lock(mm);
> -                       if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
> +                       if (hugepage_vma_revalidate(mm, haddr, vm_flags_ignored, &vma)) {
>                                 /* vma is no longer available, don't continue to swapin */
>                                 trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
>                                 return false;
> @@ -1256,7 +1270,8 @@ static void khugepaged_collapse_huge_page(struct mm_struct *mm,
>          * Continuing to collapse causes inconsistency.
>          */
>         if (unmapped && !__collapse_huge_page_swapin(mm, vma, address,
> -                                                    pmd, referenced)) {
> +                                                    pmd, referenced, VM_NONE,
> +                                                    NULL)) {
>                 mmap_read_unlock(mm);
>                 goto out_nolock;
>         }
> @@ -2520,6 +2535,128 @@ void khugepaged_min_free_kbytes_update(void)
>         mutex_unlock(&khugepaged_mutex);
>  }
>
> +struct madvise_collapse_data {
> +       struct page *hpage; /* Preallocated THP */
> +       bool continue_collapse;  /* Should we attempt / continue collapse? */
> +
> +       struct scan_pmd_result scan_result;
> +       pmd_t *pmd;
> +};
> +
> +static int
> +madvise_collapse_vma_revalidate_pmd_count(struct mm_struct *mm,
> +                                         unsigned long address, int nr,
> +                                         struct vm_area_struct **vmap)
> +{
> +       /* madvise_collapse() ignores MADV_NOHUGEPAGE */
> +       return hugepage_vma_revalidate_pmd_count(mm, address, nr, VM_NOHUGEPAGE,
> +                       vmap);
> +}
> +
> +/*
> + * Scan pmd to see which we can collapse, and to determine node to allocate on.
> + *
> + * Must be called with mmap_lock in read, and returns with the lock held in
> + * read. Does not drop the lock.
> + *
> + * Set batch_data[i]->continue_collapse to false for any pmd that can't be
> + * collapsed.
> + *
> + * Return the number of existing THPs in batch.
> + */
> +static int
> +__madvise_collapse_scan_pmd_batch(struct mm_struct *mm,
> +                                 struct vm_area_struct *vma,
> +                                 unsigned long batch_start,
> +                                 struct madvise_collapse_data *batch_data,
> +                                 int batch_size,
> +                                 struct collapse_control *cc)
> +{
> +       /* Implemented in later patch */
> +       return 0;
> +}
> +
> +/*
> + * Preallocate and charge huge page for each pmd in the batch, store the
> + * new page in batch_data[i]->hpage.
> + *
> + * Return the number of huge pages allocated.
> + */
> +static int
> +__madvise_collapse_prealloc_hpages_batch(struct mm_struct *mm,
> +                                        gfp_t gfp,
> +                                        int node,
> +                                        struct madvise_collapse_data *batch_data,
> +                                        int batch_size,
> +                                        struct collapse_control *cc)
> +{
> +       /* Implemented in later patch */
> +       return 0;
> +}
> +
> +/*
> + * Do swapin for all ranges in batch, returns true iff successful.
> + *
> + * Called with mmap_lock held in read, and returns with it held in read.
> + * Might drop the lock.
> + *
> + * Set batch_data[i]->continue_collapse to false for any pmd that can't be
> + * collapsed. Else, set batch_data[i]->pmd to the found pmd.
> + */
> +static bool
> +__madvise_collapse_swapin_pmd_batch(struct mm_struct *mm,
> +                                   int node,
> +                                   unsigned long batch_start,
> +                                   struct madvise_collapse_data *batch_data,
> +                                   int batch_size,
> +                                   struct collapse_control *cc)
> +
> +{
> +       /* Implemented in later patch */
> +       return true;
> +}
> +
> +/*
> + * Do the collapse operation. Return number of THPs collapsed successfully.
> + *
> + * Called with mmap_lock held in write, and returns with it held. Does not
> + * drop the lock.
> + */
> +static int
> +__madvise_collapse_pmd_batch(struct mm_struct *mm,
> +                            unsigned long batch_start,
> +                            int batch_size,
> +                            struct madvise_collapse_data *batch_data,
> +                            int node,
> +                            struct collapse_control *cc)
> +{
> +       /* Implemented in later patch */
> +       return 0;
> +}
> +
> +static bool continue_collapse(struct madvise_collapse_data *batch_data,
> +                             int batch_size)
> +{
> +       int i;
> +
> +       for (i = 0; i < batch_size; ++i)
> +               if (batch_data[i].continue_collapse)
> +                       return true;
> +       return false;
> +}
> +
> +static bool madvise_transparent_hugepage_enabled(struct vm_area_struct *vma)
> +{
> +       if (vma_is_anonymous(vma))
> +               /* madvise_collapse() ignores MADV_NOHUGEPAGE */
> +               return __transparent_hugepage_enabled(vma, vma->vm_flags &
> +                                                     ~VM_NOHUGEPAGE);
> +       /* TODO: Support file-backed memory */
> +       return false;
> +}
> +
> +#define MADVISE_COLLAPSE_BATCH_SIZE 8
> +
>  /*
>   * Returns 0 if successfully able to collapse range into THPs (or range already
>   * backed by THPs). Due to implementation detail, THPs collapsed here may be
> @@ -2532,8 +2669,138 @@ static int _madvise_collapse(struct mm_struct *mm,
>                              unsigned long end, gfp_t gfp,
>                              struct collapse_control *cc)
>  {
> -       /* Implemented in later patch */
> -       return -ENOSYS;
> +       unsigned long hstart, hend, batch_addr;
> +       int ret = -EINVAL, collapsed = 0, nr_hpages = 0, i;
> +       struct madvise_collapse_data batch_data[MADVISE_COLLAPSE_BATCH_SIZE];
> +
> +       mmap_assert_locked(mm);
> +       BUG_ON(vma->vm_start > start);
> +       BUG_ON(vma->vm_end < end);
> +       VM_BUG_ON_MM(atomic_read(&mm->mm_users) == 0, mm);
> +
> +       hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
> +       hend = end & HPAGE_PMD_MASK;
> +       nr_hpages = (hend - hstart) >> HPAGE_PMD_SHIFT;
> +       if (hstart >= hend)
> +               goto out;
> +
> +       if (!madvise_transparent_hugepage_enabled(vma))
> +               goto out;
> +
> +       /*
> +        * Request might cover multiple hugepages. Strategy is to batch
> +        * allocation and collapse operations so that we do more work while
> +        * mmap_lock is held exclusively.
> +        *
> +        * While processing batch, mmap_lock is locked/unlocked many times for
> +        * the supplied VMA. It's possible that the original VMA is split while
> +        * lock was dropped. If in the context of the (possibly new) VMA, THP
> +        * collapse is possible, we continue.
> +        */
> +       for (batch_addr = hstart;
> +            batch_addr < hend;
> +            batch_addr += HPAGE_PMD_SIZE * MADVISE_COLLAPSE_BATCH_SIZE) {
> +               int node, batch_size;
> +               int thps; /* Number of existing THPs in range */
> +
> +               batch_size = (hend - batch_addr) >> HPAGE_PMD_SHIFT;
> +               batch_size = min_t(int, batch_size,
> +                                  MADVISE_COLLAPSE_BATCH_SIZE);
> +
> +               BUG_ON(batch_size <= 0);
> +               memset(batch_data, 0, sizeof(batch_data));
> +               cond_resched();
> +               VM_BUG_ON_MM(atomic_read(&mm->mm_users) == 0, mm);
> +
> +               /*
> +                * If first batch, we still hold mmap_lock from madvise
> +                * call and haven't dropped it since checking the VMA. Else,
> +                * we've dropped the lock and we need to revalidate.
> +                */
> +               if (batch_addr != hstart) {
> +                       mmap_read_lock(mm);
> +                       if (madvise_collapse_vma_revalidate_pmd_count(mm,
> +                                                                     batch_addr,
> +                                                                     batch_size,
> +                                                                     &vma))
> +                               goto loop_unlock_break;
> +               }
> +
> +               mmap_assert_locked(mm);
> +
> +               thps = __madvise_collapse_scan_pmd_batch(mm, vma, batch_addr,
> +                                                        batch_data, batch_size,
> +                                                        cc);
> +               mmap_read_unlock(mm);
> +
> +               /* Count existing THPs as-if we collapsed them */
> +               collapsed += thps;
> +               if (thps == batch_size || !continue_collapse(batch_data,
> +                                                            batch_size))
> +                       continue;
> +
> +               node = find_target_node(cc);
> +               if (!__madvise_collapse_prealloc_hpages_batch(mm, gfp, node,
> +                                                             batch_data,
> +                                                             batch_size, cc)) {
> +                       /* No more THPs available - so give up */
> +                       ret = -ENOMEM;
> +                       break;
> +               }
> +
> +               mmap_read_lock(mm);
> +               if (!__madvise_collapse_swapin_pmd_batch(mm, node, batch_addr,
> +                                                        batch_data, batch_size,
> +                                                        cc))
> +                       goto loop_unlock_break;
> +               mmap_read_unlock(mm);
> +               mmap_write_lock(mm);
> +               collapsed += __madvise_collapse_pmd_batch(mm,
> +                               batch_addr, batch_size, batch_data,
> +                               node, cc);
> +               mmap_write_unlock(mm);
> +
> +               for (i = 0; i < batch_size; ++i) {
> +                       struct page *page = batch_data[i].hpage;
> +
> +                       if (page && !IS_ERR(page)) {
> +                               list_add_tail(&page->lru,
> +                                             &cc->free_hpages[node]);
> +                               batch_data[i].hpage = NULL;
> +                       }
> +               }
> +               /* mmap_lock is unlocked here */
> +               continue;
> +loop_unlock_break:
> +               mmap_read_unlock(mm);
> +               break;
> +       }
> +       /* mmap_lock is unlocked here */
> +
> +       for (i = 0; i < MADVISE_COLLAPSE_BATCH_SIZE; ++i) {
> +               struct page *page = batch_data[i].hpage;
> +
> +               if (page && !IS_ERR(page)) {
> +                       mem_cgroup_uncharge(page_folio(page));
> +                       put_page(page);
> +               }
> +       }
> +       for (i = 0; i < MAX_NUMNODES; ++i) {
> +               struct page *page, *tmp;
> +
> +               list_for_each_entry_safe(page, tmp, cc->free_hpages + i, lru) {
> +                       list_del(&page->lru);
> +                       mem_cgroup_uncharge(page_folio(page));
> +                       put_page(page);
> +               }
> +       }
> +       ret = collapsed == nr_hpages ? 0 : -1;
> +       vma = NULL;             /* tell sys_madvise we dropped mmap_lock */
> +       mmap_read_lock(mm);     /* sys_madvise expects us to have mmap_lock */
> +out:
> +       *prev = vma;            /* we didn't drop mmap_lock, so this holds */
> +
> +       return ret;
>  }
>
>  int madvise_collapse(struct vm_area_struct *vma,
> --
> 2.35.1.616.g0bdcbb4464-goog
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10  0:00     ` Zach O'Keefe
@ 2022-03-10  0:41       ` Yang Shi
  2022-03-10  1:09         ` Zach O'Keefe
  0 siblings, 1 reply; 57+ messages in thread
From: Yang Shi @ 2022-03-10  0:41 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer

On Wed, Mar 9, 2022 at 4:01 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> > On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > In madvise collapse context, we optionally want to be able to ignore
> > > advice from MADV_NOHUGEPAGE-marked regions.
> >
> > Could you please elaborate why this usecase is valid? Typically
> > MADV_NOHUGEPAGE is set when the users really don't want to have THP
> > for this area. So it doesn't make too much sense to ignore it IMHO.
> >
>
> Hey Yang, thanks for taking time to review and comment.
>
> Semantically, the way I see it, is that MADV_NOHUGEPAGE is a way for
> the user to say "I don't want hugepages here", so that the kernel
> knows not to do so when faulting memory, and khugepaged can stay away.
> However, in MADV_COLLAPSE, the user is explicitly requesting this be
> backed by hugepages - so presumably that is exactly what they want.
>
> IOW, if the user didn't want this memory to be backed by hugepages,
> they wouldn't be MADV_COLLAPSE'ing it. If there was a range of memory
> the user wanted collapsed, but that had some sub-areas marked
> MADV_NOHUGEPAGE, they could always issue multiple MADV_COLLAPSE
> operations around the excluded regions.
>
> In terms of use cases, I don't have a concrete example, but a user
> could hypothetically choose to exclude regions from management from
> khugepaged, but still be able to collapse the memory themselves,
> when/if they deem appropriate.

I see. It seems you thought MADV_COLLAPSE actually unsets
VM_NOHUGEPAGE, and is kind of equal to MADV_HUGEPAGE + doing collapse
right away, right? To some degree, it makes some sense. If this is the
behavior you'd like to achieve, I'd suggest making it more explicit,
for example, setting VM_HUGEPAGE for the MADV_COLLAPSE area rather
than ignore or change vm flags silently. When using madvise mode, but
not having VM_HUGEPAGE set, the vma check should fail in the current
code (I didn't look hard if you already covered this or not).

>
> > >
> > > Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> > > which can be used to ignore vm flags used when considering thp
> > > eligibility.
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >  mm/khugepaged.c | 18 ++++++++++++------
> > >  1 file changed, 12 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 1d20be47bcea..ecbd3fc41c80 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -964,10 +964,14 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > >  #endif
> > >
> > >  /*
> > > - * Revalidate a vma's eligibility to collapse nr hugepages.
> > > + * Revalidate a vma's eligibility to collapse nr hugepages. vm_flags_ignore
> > > + * can be used to ignore certain vma_flags that would otherwise be checked -
> > > + * the principal example being VM_NOHUGEPAGE which is ignored in madvise
> > > + * collapse context.
> > >   */
> > >  static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > >                                              unsigned long address, int nr,
> > > +                                            unsigned long vm_flags_ignore,
> > >                                              struct vm_area_struct **vmap)
> > >  {
> > >         struct vm_area_struct *vma;
> > > @@ -986,7 +990,7 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > >         hend = vma->vm_end & HPAGE_PMD_MASK;
> > >         if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
> > >                 return SCAN_ADDRESS_RANGE;
> > > -       if (!hugepage_vma_check(vma, vma->vm_flags))
> > > +       if (!hugepage_vma_check(vma, vma->vm_flags & ~vm_flags_ignore))
> > >                 return SCAN_VMA_CHECK;
> > >         /* Anon VMA expected */
> > >         if (!vma->anon_vma || vma->vm_ops)
> > > @@ -1000,9 +1004,11 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > >   */
> > >
> > >  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > +                                  unsigned long vm_flags_ignore,
> > >                                    struct vm_area_struct **vmap)
> > >  {
> > > -       return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
> > > +       return hugepage_vma_revalidate_pmd_count(mm, address, 1,
> > > +                       vm_flags_ignore, vmap);
> > >  }
> > >
> > >  /*
> > > @@ -1043,7 +1049,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> > >                 /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> > >                 if (ret & VM_FAULT_RETRY) {
> > >                         mmap_read_lock(mm);
> > > -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> > > +                       if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
> > >                                 /* vma is no longer available, don't continue to swapin */
> > >                                 trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > >                                 return false;
> > > @@ -1200,7 +1206,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > >         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> > >
> > >         mmap_read_lock(mm);
> > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > >         if (result) {
> > >                 mmap_read_unlock(mm);
> > >                 goto out_nolock;
> > > @@ -1232,7 +1238,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > >          */
> > >         mmap_write_lock(mm);
> > >
> > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > >         if (result)
> > >                 goto out_up_write;
> > >         /* check if the pmd is still valid */
> > > --
> > > 2.35.1.616.g0bdcbb4464-goog
> > >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 09/14] mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP
  2022-03-09 23:40   ` Yang Shi
@ 2022-03-10  0:46     ` Zach O'Keefe
  2022-03-10  2:05       ` Yang Shi
  0 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-10  0:46 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer

On Wed, Mar 9, 2022 at 3:40 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > When scanning an anon pmd to see if it's eligible for collapse, return
> > SCAN_PAGE_COMPOUND if the pmd already maps a thp. This is consistent
> > with handling when scanning file-backed memory.
>
> I'm not quite keen that we have to keep anon consistent with file for
> this case. SCAN_PAGE_COMPOUND typically means the page is compound
> page, but PTE mapped.
>

Good point.

> And even SCAN_PMD_NULL is not returned every time when mm_find_pmd()
> returns NULL. In addition, SCAN_PMD_NULL seems ambiguous to me. The
> khugepaged actually sees non-present (migration) entry or trans huge
> entry, so may rename it to SCAN_PMD_NOT_SUITABLE?
>

Sorry, I'm not sure I understand the suggestion here. What this patch
would like to do, is to identify what pmds map thps. This will be
important later, since if a user requests a collapse of an
already-collapsed region, we want to return successfully (even if no
work to be done).

Maybe there should be a SCAN_PMD_MAPPED used here instead? Just to not
overload SCAN_PAGE_COMPOUND?

Though, note that when MADV_COLLAPSE supports file-backed memory, a
similar check for pmd-mapping will need to be made on the file-side of
things.

> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  mm/khugepaged.c | 41 +++++++++++++++++++++++++++++++++++------
> >  1 file changed, 35 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index ecbd3fc41c80..403578161a3b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1011,6 +1011,38 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >                         vm_flags_ignore, vmap);
> >  }
> >
> > +/*
> > + * If returning NULL (meaning the pmd isn't mapped, isn't present, or thp),
> > + * write the reason to *result.
> > + */
> > +static pmd_t *find_pmd_or_thp_or_none(struct mm_struct *mm,
> > +                                     unsigned long address,
> > +                                     int *result)
> > +{
> > +       pmd_t *pmd = mm_find_pmd_raw(mm, address);
> > +       pmd_t pmde;
> > +
> > +       if (!pmd) {
> > +               *result = SCAN_PMD_NULL;
> > +               return NULL;
> > +       }
> > +
> > +       pmde = pmd_read_atomic(pmd);
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > +       barrier();
> > +#endif
> > +       if (!pmd_present(pmde) || !pmd_none(pmde)) {
> > +               *result = SCAN_PMD_NULL;
> > +               return NULL;
> > +       } else if (pmd_trans_huge(pmde)) {
> > +               *result = SCAN_PAGE_COMPOUND;
> > +               return NULL;
> > +       }
> > +       return pmd;
> > +}
> > +
> >  /*
> >   * Bring missing pages in from swap, to complete THP collapse.
> >   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> > @@ -1212,9 +1244,8 @@ static void collapse_huge_page(struct mm_struct *mm,
> >                 goto out_nolock;
> >         }
> >
> > -       pmd = mm_find_pmd(mm, address);
> > +       pmd = find_pmd_or_thp_or_none(mm, address, &result);
> >         if (!pmd) {
> > -               result = SCAN_PMD_NULL;
> >                 mmap_read_unlock(mm);
> >                 goto out_nolock;
> >         }
> > @@ -1287,11 +1318,9 @@ static void scan_pmd(struct mm_struct *mm,
> >         mmap_assert_locked(mm);
> >         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > -       pmd = mm_find_pmd(mm, address);
> > -       if (!pmd) {
> > -               scan_result->result = SCAN_PMD_NULL;
> > +       pmd = find_pmd_or_thp_or_none(mm, address, &scan_result->result);
> > +       if (!pmd)
> >                 goto out;
> > -       }
> >
> >         memset(cc->node_load, 0, sizeof(cc->node_load));
> >         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > --
> > 2.35.1.616.g0bdcbb4464-goog
> >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10  0:41       ` Yang Shi
@ 2022-03-10  1:09         ` Zach O'Keefe
  2022-03-10  2:16           ` Yang Shi
  0 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-10  1:09 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer

On Wed, Mar 9, 2022 at 4:41 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 4:01 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > > On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > In madvise collapse context, we optionally want to be able to ignore
> > > > advice from MADV_NOHUGEPAGE-marked regions.
> > >
> > > Could you please elaborate why this usecase is valid? Typically
> > > MADV_NOHUGEPAGE is set when the users really don't want to have THP
> > > for this area. So it doesn't make too much sense to ignore it IMHO.
> > >
> >
> > Hey Yang, thanks for taking time to review and comment.
> >
> > Semantically, the way I see it, is that MADV_NOHUGEPAGE is a way for
> > the user to say "I don't want hugepages here", so that the kernel
> > knows not to do so when faulting memory, and khugepaged can stay away.
> > However, in MADV_COLLAPSE, the user is explicitly requesting this be
> > backed by hugepages - so presumably that is exactly what they want.
> >
> > IOW, if the user didn't want this memory to be backed by hugepages,
> > they wouldn't be MADV_COLLAPSE'ing it. If there was a range of memory
> > the user wanted collapsed, but that had some sub-areas marked
> > MADV_NOHUGEPAGE, they could always issue multiple MADV_COLLAPSE
> > operations around the excluded regions.
> >
> > In terms of use cases, I don't have a concrete example, but a user
> > could hypothetically choose to exclude regions from management from
> > khugepaged, but still be able to collapse the memory themselves,
> > when/if they deem appropriate.
>
> I see. It seems you thought MADV_COLLAPSE actually unsets
> VM_NOHUGEPAGE, and is kind of equal to MADV_HUGEPAGE + doing collapse
> right away, right? To some degree, it makes some sense.

Currently, MADV_COLLAPSE doesn't alter the vma flags at all - it just
ignores VM_NOHUGEPAGE, and so it's not really the same as
MADV_HUGEPAGE + MADV_COLLAPSE (which would set VM_HUGEPAGE in addition
to clearing VM_NOHUGEPAGE). If my use case has any merit (and I'm not
sure it does) then we don't want to be altering the vma flags since we
don't want to touch khugepaged behavior.

> If this is the
> behavior you'd like to achieve, I'd suggest making it more explicit,
> for example, setting VM_HUGEPAGE for the MADV_COLLAPSE area rather
> than ignore or change vm flags silently. When using madvise mode, but
> not having VM_HUGEPAGE set, the vma check should fail in the current
> code (I didn't look hard if you already covered this or not).
>

You're correct, this will fail, since it's following the same
semantics as the fault path. I see what you're saying though; that
perhaps this is inconsistent with my above reasoning that "the user
asked to collapse this memory, and so we should do it". If so, then
perhaps MADV_COLLAPSE just ignores madise mode and VM_[NO]HUGEPAGE
entirely for the purposes of eligibility, and only uses it for the
purposes of determining gfp flags for compaction/reclaim. Pushing that
further, compaction/reclaim could entirely be specified by the user
using a process_madvise(2) flag (later in the series, we do something
like this).


> >
> > > >
> > > > Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> > > > which can be used to ignore vm flags used when considering thp
> > > > eligibility.
> > > >
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >  mm/khugepaged.c | 18 ++++++++++++------
> > > >  1 file changed, 12 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 1d20be47bcea..ecbd3fc41c80 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -964,10 +964,14 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > > >  #endif
> > > >
> > > >  /*
> > > > - * Revalidate a vma's eligibility to collapse nr hugepages.
> > > > + * Revalidate a vma's eligibility to collapse nr hugepages. vm_flags_ignore
> > > > + * can be used to ignore certain vma_flags that would otherwise be checked -
> > > > + * the principal example being VM_NOHUGEPAGE which is ignored in madvise
> > > > + * collapse context.
> > > >   */
> > > >  static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > >                                              unsigned long address, int nr,
> > > > +                                            unsigned long vm_flags_ignore,
> > > >                                              struct vm_area_struct **vmap)
> > > >  {
> > > >         struct vm_area_struct *vma;
> > > > @@ -986,7 +990,7 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > >         hend = vma->vm_end & HPAGE_PMD_MASK;
> > > >         if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
> > > >                 return SCAN_ADDRESS_RANGE;
> > > > -       if (!hugepage_vma_check(vma, vma->vm_flags))
> > > > +       if (!hugepage_vma_check(vma, vma->vm_flags & ~vm_flags_ignore))
> > > >                 return SCAN_VMA_CHECK;
> > > >         /* Anon VMA expected */
> > > >         if (!vma->anon_vma || vma->vm_ops)
> > > > @@ -1000,9 +1004,11 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > >   */
> > > >
> > > >  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > > +                                  unsigned long vm_flags_ignore,
> > > >                                    struct vm_area_struct **vmap)
> > > >  {
> > > > -       return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
> > > > +       return hugepage_vma_revalidate_pmd_count(mm, address, 1,
> > > > +                       vm_flags_ignore, vmap);
> > > >  }
> > > >
> > > >  /*
> > > > @@ -1043,7 +1049,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> > > >                 /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> > > >                 if (ret & VM_FAULT_RETRY) {
> > > >                         mmap_read_lock(mm);
> > > > -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> > > > +                       if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
> > > >                                 /* vma is no longer available, don't continue to swapin */
> > > >                                 trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > > >                                 return false;
> > > > @@ -1200,7 +1206,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > >         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> > > >
> > > >         mmap_read_lock(mm);
> > > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > >         if (result) {
> > > >                 mmap_read_unlock(mm);
> > > >                 goto out_nolock;
> > > > @@ -1232,7 +1238,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > >          */
> > > >         mmap_write_lock(mm);
> > > >
> > > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > >         if (result)
> > > >                 goto out_up_write;
> > > >         /* check if the pmd is still valid */
> > > > --
> > > > 2.35.1.616.g0bdcbb4464-goog
> > > >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 11/14] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  2022-03-09 23:43   ` Yang Shi
@ 2022-03-10  1:11     ` Zach O'Keefe
  0 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-10  1:11 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer

Hey Yang. Ack this, as well similar feedback from earlier in the
series. Really was just trying to avoid a giant monolithic patch that
would also make review harder. I can rework though. Thank you

On Wed, Mar 9, 2022 at 3:44 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > The idea of hugepage collapse in process context was previously
> > introduced by David Rientjes to linux-mm[1].
> >
> > The idea is to introduce a new madvise mode, MADV_COLLAPSE, that allows
> > users to request a synchronous collapse of memory.
> >
> > The benefits of this approach are:
> >
> > * cpu is charged to the process that wants to spend the cycles for the
> >   THP
> > * avoid unpredictable timing of khugepaged collapse
> > * flexible separation of sync userspace and async khugepaged THP collapse
> >   policies
> >
> > Immediate users of this new functionality include:
> >
> > * malloc implementations that manage memory in hugepage-sized chunks,
> >   but sometimes subrelease memory back to the system in native-sized
> >   chunks via MADV_DONTNEED; zapping the pmd.  Later, when the memory
> >   is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the
> >   memory by THP to regain TLB performance.
> > * immediately back executable text by hugepages.  Current support
> >   provided by CONFIG_READ_ONLY_THP_FOR_FS may take too long on a large
> >   system.
> >
> > To keep patches digestible, introduce MADV_COLLAPSE in a few stages.
> >
> > Add plumbing to existing madvise infrastructure, as well as populate
> > uapi header files, leaving the actual madvise(MADV_COLLAPSE) handler
> > stubbed out.  Only privately-mapped anon memory is supported for now.
> >
> > [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  include/linux/huge_mm.h                | 12 +++++++
> >  include/uapi/asm-generic/mman-common.h |  2 ++
> >  mm/khugepaged.c                        | 46 ++++++++++++++++++++++++++
> >  mm/madvise.c                           |  5 +++
> >  4 files changed, 65 insertions(+)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index fd905b0b2c71..407b63ab4185 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -226,6 +226,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >
> >  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> >                      int advice);
> > +int madvise_collapse(struct vm_area_struct *vma,
> > +                    struct vm_area_struct **prev,
> > +                    unsigned long start, unsigned long end);
> >  void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> >                            unsigned long end, long adjust_next);
> >  spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > @@ -383,6 +386,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> >         BUG();
> >         return 0;
> >  }
> > +
> > +static inline int madvise_collapse(struct vm_area_struct *vma,
> > +                                  struct vm_area_struct **prev,
> > +                                  unsigned long start, unsigned long end)
> > +{
> > +       BUG();
> > +       return 0;
> > +}
> > +
> >  static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> >                                          unsigned long start,
> >                                          unsigned long end,
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 6c1aa92a92e4..6ce1f1ceb432 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -77,6 +77,8 @@
> >
> >  #define MADV_DONTNEED_LOCKED   24      /* like DONTNEED, but drop locked pages too */
> >
> > +#define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE       0
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 12ae765c5c32..ca1e523086ed 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2519,3 +2519,49 @@ void khugepaged_min_free_kbytes_update(void)
> >                 set_recommended_min_free_kbytes();
> >         mutex_unlock(&khugepaged_mutex);
> >  }
> > +
> > +/*
> > + * Returns 0 if successfully able to collapse range into THPs (or range already
> > + * backed by THPs). Due to implementation detail, THPs collapsed here may be
> > + * split again before this function returns.
> > + */
> > +static int _madvise_collapse(struct mm_struct *mm,
> > +                            struct vm_area_struct *vma,
> > +                            struct vm_area_struct **prev,
> > +                            unsigned long start,
> > +                            unsigned long end, gfp_t gfp,
> > +                            struct collapse_control *cc)
> > +{
> > +       /* Implemented in later patch */
>
> Just like the earlier patches, as long as you introduce a new
> function, it is better to keep it with its users in the same patch.
> And typically we don't do the "implement in the later patch" thing, it
> makes review harder.
>
> > +       return -ENOSYS;
> > +}
> > +
> > +int madvise_collapse(struct vm_area_struct *vma,
> > +                    struct vm_area_struct **prev, unsigned long start,
> > +                    unsigned long end)
> > +{
> > +       struct collapse_control cc;
> > +       gfp_t gfp;
> > +       int error;
> > +       struct mm_struct *mm = vma->vm_mm;
> > +
> > +       /* Requested to hold mmap_lock in read */
> > +       mmap_assert_locked(mm);
> > +
> > +       mmgrab(mm);
> > +       collapse_control_init(&cc, /* enforce_pte_scan_limits= */ false);
> > +       gfp = vma_thp_gfp_mask(vma);
> > +       lru_add_drain(); /* lru_add_drain_all() too heavy here */
> > +       error = _madvise_collapse(mm, vma, prev, start, end, gfp, &cc);
> > +       mmap_assert_locked(mm);
> > +       mmdrop(mm);
> > +
> > +       /*
> > +        * madvise() returns EAGAIN if kernel resources are temporarily
> > +        * unavailable.
> > +        */
> > +       if (error == -ENOMEM)
> > +               error = -EAGAIN;
> > +
> > +       return error;
> > +}
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 5b6d796e55de..292aa017c150 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -58,6 +58,7 @@ static int madvise_need_mmap_write(int behavior)
> >         case MADV_FREE:
> >         case MADV_POPULATE_READ:
> >         case MADV_POPULATE_WRITE:
> > +       case MADV_COLLAPSE:
> >                 return 0;
> >         default:
> >                 /* be safe, default to 1. list exceptions explicitly */
> > @@ -1046,6 +1047,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> >                 if (error)
> >                         goto out;
> >                 break;
> > +       case MADV_COLLAPSE:
> > +               return madvise_collapse(vma, prev, start, end);
> >         }
> >
> >         anon_name = anon_vma_name(vma);
> > @@ -1139,6 +1142,7 @@ madvise_behavior_valid(int behavior)
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >         case MADV_HUGEPAGE:
> >         case MADV_NOHUGEPAGE:
> > +       case MADV_COLLAPSE:
> >  #endif
> >         case MADV_DONTDUMP:
> >         case MADV_DODUMP:
> > @@ -1328,6 +1332,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >   *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> >   *             transparent huge pages so the existing pages will not be
> >   *             coalesced into THP and new pages will not be allocated as THP.
> > + *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> >   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> >   *             from being included in its core dump.
> >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > --
> > 2.35.1.616.g0bdcbb4464-goog
> >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 09/14] mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP
  2022-03-10  0:46     ` Zach O'Keefe
@ 2022-03-10  2:05       ` Yang Shi
  2022-03-10  8:37         ` Zach O'Keefe
  0 siblings, 1 reply; 57+ messages in thread
From: Yang Shi @ 2022-03-10  2:05 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer

On Wed, Mar 9, 2022 at 4:46 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Wed, Mar 9, 2022 at 3:40 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > When scanning an anon pmd to see if it's eligible for collapse, return
> > > SCAN_PAGE_COMPOUND if the pmd already maps a thp. This is consistent
> > > with handling when scanning file-backed memory.
> >
> > I'm not quite keen that we have to keep anon consistent with file for
> > this case. SCAN_PAGE_COMPOUND typically means the page is compound
> > page, but PTE mapped.
> >
>
> Good point.
>
> > And even SCAN_PMD_NULL is not returned every time when mm_find_pmd()
> > returns NULL. In addition, SCAN_PMD_NULL seems ambiguous to me. The
> > khugepaged actually sees non-present (migration) entry or trans huge
> > entry, so may rename it to SCAN_PMD_NOT_SUITABLE?
> >
>
> Sorry, I'm not sure I understand the suggestion here. What this patch
> would like to do, is to identify what pmds map thps. This will be
> important later, since if a user requests a collapse of an
> already-collapsed region, we want to return successfully (even if no
> work to be done).

Makes sense.

>
> Maybe there should be a SCAN_PMD_MAPPED used here instead? Just to not
> overload SCAN_PAGE_COMPOUND?

I see. SCAN_PMD_MAPPED sounds more self-explained and suitable IMHO.



>
> Though, note that when MADV_COLLAPSE supports file-backed memory, a
> similar check for pmd-mapping will need to be made on the file-side of
> things.
>
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >  mm/khugepaged.c | 41 +++++++++++++++++++++++++++++++++++------
> > >  1 file changed, 35 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index ecbd3fc41c80..403578161a3b 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1011,6 +1011,38 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > >                         vm_flags_ignore, vmap);
> > >  }
> > >
> > > +/*
> > > + * If returning NULL (meaning the pmd isn't mapped, isn't present, or thp),
> > > + * write the reason to *result.
> > > + */
> > > +static pmd_t *find_pmd_or_thp_or_none(struct mm_struct *mm,
> > > +                                     unsigned long address,
> > > +                                     int *result)
> > > +{
> > > +       pmd_t *pmd = mm_find_pmd_raw(mm, address);
> > > +       pmd_t pmde;
> > > +
> > > +       if (!pmd) {
> > > +               *result = SCAN_PMD_NULL;
> > > +               return NULL;
> > > +       }
> > > +
> > > +       pmde = pmd_read_atomic(pmd);
> > > +
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > +       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > > +       barrier();
> > > +#endif
> > > +       if (!pmd_present(pmde) || !pmd_none(pmde)) {
> > > +               *result = SCAN_PMD_NULL;
> > > +               return NULL;
> > > +       } else if (pmd_trans_huge(pmde)) {
> > > +               *result = SCAN_PAGE_COMPOUND;
> > > +               return NULL;
> > > +       }
> > > +       return pmd;
> > > +}
> > > +
> > >  /*
> > >   * Bring missing pages in from swap, to complete THP collapse.
> > >   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> > > @@ -1212,9 +1244,8 @@ static void collapse_huge_page(struct mm_struct *mm,
> > >                 goto out_nolock;
> > >         }
> > >
> > > -       pmd = mm_find_pmd(mm, address);
> > > +       pmd = find_pmd_or_thp_or_none(mm, address, &result);
> > >         if (!pmd) {
> > > -               result = SCAN_PMD_NULL;
> > >                 mmap_read_unlock(mm);
> > >                 goto out_nolock;
> > >         }
> > > @@ -1287,11 +1318,9 @@ static void scan_pmd(struct mm_struct *mm,
> > >         mmap_assert_locked(mm);
> > >         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > >
> > > -       pmd = mm_find_pmd(mm, address);
> > > -       if (!pmd) {
> > > -               scan_result->result = SCAN_PMD_NULL;
> > > +       pmd = find_pmd_or_thp_or_none(mm, address, &scan_result->result);
> > > +       if (!pmd)
> > >                 goto out;
> > > -       }
> > >
> > >         memset(cc->node_load, 0, sizeof(cc->node_load));
> > >         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > > --
> > > 2.35.1.616.g0bdcbb4464-goog
> > >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10  1:09         ` Zach O'Keefe
@ 2022-03-10  2:16           ` Yang Shi
  2022-03-10 15:50             ` Zach O'Keefe
  0 siblings, 1 reply; 57+ messages in thread
From: Yang Shi @ 2022-03-10  2:16 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer

On Wed, Mar 9, 2022 at 5:10 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Wed, Mar 9, 2022 at 4:41 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 4:01 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > > On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > >
> > > > > In madvise collapse context, we optionally want to be able to ignore
> > > > > advice from MADV_NOHUGEPAGE-marked regions.
> > > >
> > > > Could you please elaborate why this usecase is valid? Typically
> > > > MADV_NOHUGEPAGE is set when the users really don't want to have THP
> > > > for this area. So it doesn't make too much sense to ignore it IMHO.
> > > >
> > >
> > > Hey Yang, thanks for taking time to review and comment.
> > >
> > > Semantically, the way I see it, is that MADV_NOHUGEPAGE is a way for
> > > the user to say "I don't want hugepages here", so that the kernel
> > > knows not to do so when faulting memory, and khugepaged can stay away.
> > > However, in MADV_COLLAPSE, the user is explicitly requesting this be
> > > backed by hugepages - so presumably that is exactly what they want.
> > >
> > > IOW, if the user didn't want this memory to be backed by hugepages,
> > > they wouldn't be MADV_COLLAPSE'ing it. If there was a range of memory
> > > the user wanted collapsed, but that had some sub-areas marked
> > > MADV_NOHUGEPAGE, they could always issue multiple MADV_COLLAPSE
> > > operations around the excluded regions.
> > >
> > > In terms of use cases, I don't have a concrete example, but a user
> > > could hypothetically choose to exclude regions from management from
> > > khugepaged, but still be able to collapse the memory themselves,
> > > when/if they deem appropriate.
> >
> > I see. It seems you thought MADV_COLLAPSE actually unsets
> > VM_NOHUGEPAGE, and is kind of equal to MADV_HUGEPAGE + doing collapse
> > right away, right? To some degree, it makes some sense.
>
> Currently, MADV_COLLAPSE doesn't alter the vma flags at all - it just
> ignores VM_NOHUGEPAGE, and so it's not really the same as
> MADV_HUGEPAGE + MADV_COLLAPSE (which would set VM_HUGEPAGE in addition
> to clearing VM_NOHUGEPAGE). If my use case has any merit (and I'm not
> sure it does) then we don't want to be altering the vma flags since we
> don't want to touch khugepaged behavior.
>
> > If this is the
> > behavior you'd like to achieve, I'd suggest making it more explicit,
> > for example, setting VM_HUGEPAGE for the MADV_COLLAPSE area rather
> > than ignore or change vm flags silently. When using madvise mode, but
> > not having VM_HUGEPAGE set, the vma check should fail in the current
> > code (I didn't look hard if you already covered this or not).
> >
>
> You're correct, this will fail, since it's following the same
> semantics as the fault path. I see what you're saying though; that
> perhaps this is inconsistent with my above reasoning that "the user
> asked to collapse this memory, and so we should do it". If so, then
> perhaps MADV_COLLAPSE just ignores madise mode and VM_[NO]HUGEPAGE
> entirely for the purposes of eligibility, and only uses it for the
> purposes of determining gfp flags for compaction/reclaim. Pushing that
> further, compaction/reclaim could entirely be specified by the user
> using a process_madvise(2) flag (later in the series, we do something
> like this).

Anyway I think we could have two options for MADV_COLLAPSE:

1. Just treat it as a hint (nice to have, best effort). It should obey
all the settings. Skip VM_NOHUGEPAGE vmas or vmas without VM_HUGEPAGE
if madvise mode, etc.

2. Much stronger. It equals MADV_HUGEPAGE + synchronous collapse. It
should set vma flags properly as I suggested.

Either is fine to me. But I don't prefer something in between personally.

>
>
> > >
> > > > >
> > > > > Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> > > > > which can be used to ignore vm flags used when considering thp
> > > > > eligibility.
> > > > >
> > > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > > ---
> > > > >  mm/khugepaged.c | 18 ++++++++++++------
> > > > >  1 file changed, 12 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > index 1d20be47bcea..ecbd3fc41c80 100644
> > > > > --- a/mm/khugepaged.c
> > > > > +++ b/mm/khugepaged.c
> > > > > @@ -964,10 +964,14 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > > > >  #endif
> > > > >
> > > > >  /*
> > > > > - * Revalidate a vma's eligibility to collapse nr hugepages.
> > > > > + * Revalidate a vma's eligibility to collapse nr hugepages. vm_flags_ignore
> > > > > + * can be used to ignore certain vma_flags that would otherwise be checked -
> > > > > + * the principal example being VM_NOHUGEPAGE which is ignored in madvise
> > > > > + * collapse context.
> > > > >   */
> > > > >  static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > >                                              unsigned long address, int nr,
> > > > > +                                            unsigned long vm_flags_ignore,
> > > > >                                              struct vm_area_struct **vmap)
> > > > >  {
> > > > >         struct vm_area_struct *vma;
> > > > > @@ -986,7 +990,7 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > >         hend = vma->vm_end & HPAGE_PMD_MASK;
> > > > >         if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
> > > > >                 return SCAN_ADDRESS_RANGE;
> > > > > -       if (!hugepage_vma_check(vma, vma->vm_flags))
> > > > > +       if (!hugepage_vma_check(vma, vma->vm_flags & ~vm_flags_ignore))
> > > > >                 return SCAN_VMA_CHECK;
> > > > >         /* Anon VMA expected */
> > > > >         if (!vma->anon_vma || vma->vm_ops)
> > > > > @@ -1000,9 +1004,11 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > >   */
> > > > >
> > > > >  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > > > +                                  unsigned long vm_flags_ignore,
> > > > >                                    struct vm_area_struct **vmap)
> > > > >  {
> > > > > -       return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
> > > > > +       return hugepage_vma_revalidate_pmd_count(mm, address, 1,
> > > > > +                       vm_flags_ignore, vmap);
> > > > >  }
> > > > >
> > > > >  /*
> > > > > @@ -1043,7 +1049,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> > > > >                 /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> > > > >                 if (ret & VM_FAULT_RETRY) {
> > > > >                         mmap_read_lock(mm);
> > > > > -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> > > > > +                       if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
> > > > >                                 /* vma is no longer available, don't continue to swapin */
> > > > >                                 trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > > > >                                 return false;
> > > > > @@ -1200,7 +1206,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > >         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> > > > >
> > > > >         mmap_read_lock(mm);
> > > > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > > >         if (result) {
> > > > >                 mmap_read_unlock(mm);
> > > > >                 goto out_nolock;
> > > > > @@ -1232,7 +1238,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > >          */
> > > > >         mmap_write_lock(mm);
> > > > >
> > > > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > > >         if (result)
> > > > >                 goto out_up_write;
> > > > >         /* check if the pmd is still valid */
> > > > > --
> > > > > 2.35.1.616.g0bdcbb4464-goog
> > > > >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 09/14] mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP
  2022-03-10  2:05       ` Yang Shi
@ 2022-03-10  8:37         ` Zach O'Keefe
  0 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-10  8:37 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer

On Wed, Mar 9, 2022 at 6:06 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 4:46 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 3:40 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > When scanning an anon pmd to see if it's eligible for collapse, return
> > > > SCAN_PAGE_COMPOUND if the pmd already maps a thp. This is consistent
> > > > with handling when scanning file-backed memory.
> > >
> > > I'm not quite keen that we have to keep anon consistent with file for
> > > this case. SCAN_PAGE_COMPOUND typically means the page is compound
> > > page, but PTE mapped.
> > >
> >
> > Good point.
> >
> > > And even SCAN_PMD_NULL is not returned every time when mm_find_pmd()
> > > returns NULL. In addition, SCAN_PMD_NULL seems ambiguous to me. The
> > > khugepaged actually sees non-present (migration) entry or trans huge
> > > entry, so may rename it to SCAN_PMD_NOT_SUITABLE?
> > >
> >
> > Sorry, I'm not sure I understand the suggestion here. What this patch
> > would like to do, is to identify what pmds map thps. This will be
> > important later, since if a user requests a collapse of an
> > already-collapsed region, we want to return successfully (even if no
> > work to be done).
>
> Makes sense.
>
> >
> > Maybe there should be a SCAN_PMD_MAPPED used here instead? Just to not
> > overload SCAN_PAGE_COMPOUND?
>
> I see. SCAN_PMD_MAPPED sounds more self-explained and suitable IMHO.
>
>

Makes sense not to conflate the two. Thanks for the feedback!

>
> >
> > Though, note that when MADV_COLLAPSE supports file-backed memory, a
> > similar check for pmd-mapping will need to be made on the file-side of
> > things.
> >
> > > >
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >  mm/khugepaged.c | 41 +++++++++++++++++++++++++++++++++++------
> > > >  1 file changed, 35 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index ecbd3fc41c80..403578161a3b 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -1011,6 +1011,38 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > >                         vm_flags_ignore, vmap);
> > > >  }
> > > >
> > > > +/*
> > > > + * If returning NULL (meaning the pmd isn't mapped, isn't present, or thp),
> > > > + * write the reason to *result.
> > > > + */
> > > > +static pmd_t *find_pmd_or_thp_or_none(struct mm_struct *mm,
> > > > +                                     unsigned long address,
> > > > +                                     int *result)
> > > > +{
> > > > +       pmd_t *pmd = mm_find_pmd_raw(mm, address);
> > > > +       pmd_t pmde;
> > > > +
> > > > +       if (!pmd) {
> > > > +               *result = SCAN_PMD_NULL;
> > > > +               return NULL;
> > > > +       }
> > > > +
> > > > +       pmde = pmd_read_atomic(pmd);
> > > > +
> > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > +       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > > > +       barrier();
> > > > +#endif
> > > > +       if (!pmd_present(pmde) || !pmd_none(pmde)) {
> > > > +               *result = SCAN_PMD_NULL;
> > > > +               return NULL;
> > > > +       } else if (pmd_trans_huge(pmde)) {
> > > > +               *result = SCAN_PAGE_COMPOUND;
> > > > +               return NULL;
> > > > +       }
> > > > +       return pmd;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Bring missing pages in from swap, to complete THP collapse.
> > > >   * Only done if khugepaged_scan_pmd believes it is worthwhile.
> > > > @@ -1212,9 +1244,8 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > >                 goto out_nolock;
> > > >         }
> > > >
> > > > -       pmd = mm_find_pmd(mm, address);
> > > > +       pmd = find_pmd_or_thp_or_none(mm, address, &result);
> > > >         if (!pmd) {
> > > > -               result = SCAN_PMD_NULL;
> > > >                 mmap_read_unlock(mm);
> > > >                 goto out_nolock;
> > > >         }
> > > > @@ -1287,11 +1318,9 @@ static void scan_pmd(struct mm_struct *mm,
> > > >         mmap_assert_locked(mm);
> > > >         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > >
> > > > -       pmd = mm_find_pmd(mm, address);
> > > > -       if (!pmd) {
> > > > -               scan_result->result = SCAN_PMD_NULL;
> > > > +       pmd = find_pmd_or_thp_or_none(mm, address, &scan_result->result);
> > > > +       if (!pmd)
> > > >                 goto out;
> > > > -       }
> > > >
> > > >         memset(cc->node_load, 0, sizeof(cc->node_load));
> > > >         pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> > > > --
> > > > 2.35.1.616.g0bdcbb4464-goog
> > > >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10  2:16           ` Yang Shi
@ 2022-03-10 15:50             ` Zach O'Keefe
  2022-03-10 18:17               ` Yang Shi
  0 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-10 15:50 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer

On Wed, Mar 9, 2022 at 6:16 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 5:10 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 4:41 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Wed, Mar 9, 2022 at 4:01 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > > On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > > >
> > > > > > In madvise collapse context, we optionally want to be able to ignore
> > > > > > advice from MADV_NOHUGEPAGE-marked regions.
> > > > >
> > > > > Could you please elaborate why this usecase is valid? Typically
> > > > > MADV_NOHUGEPAGE is set when the users really don't want to have THP
> > > > > for this area. So it doesn't make too much sense to ignore it IMHO.
> > > > >
> > > >
> > > > Hey Yang, thanks for taking time to review and comment.
> > > >
> > > > Semantically, the way I see it, is that MADV_NOHUGEPAGE is a way for
> > > > the user to say "I don't want hugepages here", so that the kernel
> > > > knows not to do so when faulting memory, and khugepaged can stay away.
> > > > However, in MADV_COLLAPSE, the user is explicitly requesting this be
> > > > backed by hugepages - so presumably that is exactly what they want.
> > > >
> > > > IOW, if the user didn't want this memory to be backed by hugepages,
> > > > they wouldn't be MADV_COLLAPSE'ing it. If there was a range of memory
> > > > the user wanted collapsed, but that had some sub-areas marked
> > > > MADV_NOHUGEPAGE, they could always issue multiple MADV_COLLAPSE
> > > > operations around the excluded regions.
> > > >
> > > > In terms of use cases, I don't have a concrete example, but a user
> > > > could hypothetically choose to exclude regions from management from
> > > > khugepaged, but still be able to collapse the memory themselves,
> > > > when/if they deem appropriate.
> > >
> > > I see. It seems you thought MADV_COLLAPSE actually unsets
> > > VM_NOHUGEPAGE, and is kind of equal to MADV_HUGEPAGE + doing collapse
> > > right away, right? To some degree, it makes some sense.
> >
> > Currently, MADV_COLLAPSE doesn't alter the vma flags at all - it just
> > ignores VM_NOHUGEPAGE, and so it's not really the same as
> > MADV_HUGEPAGE + MADV_COLLAPSE (which would set VM_HUGEPAGE in addition
> > to clearing VM_NOHUGEPAGE). If my use case has any merit (and I'm not
> > sure it does) then we don't want to be altering the vma flags since we
> > don't want to touch khugepaged behavior.
> >
> > > If this is the
> > > behavior you'd like to achieve, I'd suggest making it more explicit,
> > > for example, setting VM_HUGEPAGE for the MADV_COLLAPSE area rather
> > > than ignore or change vm flags silently. When using madvise mode, but
> > > not having VM_HUGEPAGE set, the vma check should fail in the current
> > > code (I didn't look hard if you already covered this or not).
> > >
> >
> > You're correct, this will fail, since it's following the same
> > semantics as the fault path. I see what you're saying though; that
> > perhaps this is inconsistent with my above reasoning that "the user
> > asked to collapse this memory, and so we should do it". If so, then
> > perhaps MADV_COLLAPSE just ignores madise mode and VM_[NO]HUGEPAGE
> > entirely for the purposes of eligibility, and only uses it for the
> > purposes of determining gfp flags for compaction/reclaim. Pushing that
> > further, compaction/reclaim could entirely be specified by the user
> > using a process_madvise(2) flag (later in the series, we do something
> > like this).
>
> Anyway I think we could have two options for MADV_COLLAPSE:
>
> 1. Just treat it as a hint (nice to have, best effort). It should obey
> all the settings. Skip VM_NOHUGEPAGE vmas or vmas without VM_HUGEPAGE
> if madvise mode, etc.
>
> 2. Much stronger. It equals MADV_HUGEPAGE + synchronous collapse. It
> should set vma flags properly as I suggested.
>
> Either is fine to me. But I don't prefer something in between personally.
>

Makes sense to be consistent. Of these, #1 seems the most
straightforward to use. Doing an MADV_COLLAPSE on a VM_NOHUGEPAGE vma
seems like a corner case. The more likely scenario is MADV_COLLAPSE on
an unflagged (neither VM_HUGEPAGE or VM_NOHUGEPAGE) vma - in which
case it's less intrusive to not additionally set VM_HUGEPAGE (though
the user can always do so if they wish). It's a little more consistent
with "always" mode, where MADV_HUGEPAGE isn't necessary for
eligibility. It'll also reduce some code complexity.

I'll float one last option your way, however:

3. The collapsed region is always eligible, regardless of vma flags or
thp settings (except "never"?). For process_madvise(2), a flag will
explicitly specify defrag semantics.

This separates "async-hint" vs "sync-explicit" madvise requests.
MADV_[NO]HUGEPAGE are hints, and together with thp settings, advise
the kernel how to treat memory in the future. The kernel uses
VM_[NO]HUGEPAGE to aid with this. MADV_COLLAPSE, as an explicit
request, is free to define its own defrag semantics.

This would allow flexibility to separately define async vs sync thp
policies. For example, highly tuned userspace applications that are
sensitive to unexpected latency might want to manage their hugepages
utilization themselves, and ask khugepaged to stay away. There is no
way in "always" mode to do this without setting VM_NOHUGEPAGE.

> >
> >
> > > >
> > > > > >
> > > > > > Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> > > > > > which can be used to ignore vm flags used when considering thp
> > > > > > eligibility.
> > > > > >
> > > > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > > > ---
> > > > > >  mm/khugepaged.c | 18 ++++++++++++------
> > > > > >  1 file changed, 12 insertions(+), 6 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > index 1d20be47bcea..ecbd3fc41c80 100644
> > > > > > --- a/mm/khugepaged.c
> > > > > > +++ b/mm/khugepaged.c
> > > > > > @@ -964,10 +964,14 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > > > > >  #endif
> > > > > >
> > > > > >  /*
> > > > > > - * Revalidate a vma's eligibility to collapse nr hugepages.
> > > > > > + * Revalidate a vma's eligibility to collapse nr hugepages. vm_flags_ignore
> > > > > > + * can be used to ignore certain vma_flags that would otherwise be checked -
> > > > > > + * the principal example being VM_NOHUGEPAGE which is ignored in madvise
> > > > > > + * collapse context.
> > > > > >   */
> > > > > >  static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > >                                              unsigned long address, int nr,
> > > > > > +                                            unsigned long vm_flags_ignore,
> > > > > >                                              struct vm_area_struct **vmap)
> > > > > >  {
> > > > > >         struct vm_area_struct *vma;
> > > > > > @@ -986,7 +990,7 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > >         hend = vma->vm_end & HPAGE_PMD_MASK;
> > > > > >         if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
> > > > > >                 return SCAN_ADDRESS_RANGE;
> > > > > > -       if (!hugepage_vma_check(vma, vma->vm_flags))
> > > > > > +       if (!hugepage_vma_check(vma, vma->vm_flags & ~vm_flags_ignore))
> > > > > >                 return SCAN_VMA_CHECK;
> > > > > >         /* Anon VMA expected */
> > > > > >         if (!vma->anon_vma || vma->vm_ops)
> > > > > > @@ -1000,9 +1004,11 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > >   */
> > > > > >
> > > > > >  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > > > > +                                  unsigned long vm_flags_ignore,
> > > > > >                                    struct vm_area_struct **vmap)
> > > > > >  {
> > > > > > -       return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
> > > > > > +       return hugepage_vma_revalidate_pmd_count(mm, address, 1,
> > > > > > +                       vm_flags_ignore, vmap);
> > > > > >  }
> > > > > >
> > > > > >  /*
> > > > > > @@ -1043,7 +1049,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> > > > > >                 /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> > > > > >                 if (ret & VM_FAULT_RETRY) {
> > > > > >                         mmap_read_lock(mm);
> > > > > > -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> > > > > > +                       if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
> > > > > >                                 /* vma is no longer available, don't continue to swapin */
> > > > > >                                 trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > > > > >                                 return false;
> > > > > > @@ -1200,7 +1206,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > > >         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> > > > > >
> > > > > >         mmap_read_lock(mm);
> > > > > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > > > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > > > >         if (result) {
> > > > > >                 mmap_read_unlock(mm);
> > > > > >                 goto out_nolock;
> > > > > > @@ -1232,7 +1238,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > > >          */
> > > > > >         mmap_write_lock(mm);
> > > > > >
> > > > > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > > > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > > > >         if (result)
> > > > > >                 goto out_up_write;
> > > > > >         /* check if the pmd is still valid */
> > > > > > --
> > > > > > 2.35.1.616.g0bdcbb4464-goog
> > > > > >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-08 21:34 ` [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count() Zach O'Keefe
  2022-03-09 23:17   ` Yang Shi
@ 2022-03-10 15:56   ` David Hildenbrand
  2022-03-10 18:39     ` Zach O'Keefe
  2022-03-10 18:54     ` David Rientjes
  1 sibling, 2 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-10 15:56 UTC (permalink / raw)
  To: Zach O'Keefe, Alex Shi, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi

On 08.03.22 22:34, Zach O'Keefe wrote:
> In madvise collapse context, we optionally want to be able to ignore
> advice from MADV_NOHUGEPAGE-marked regions.
> 
> Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> which can be used to ignore vm flags used when considering thp
> eligibility.

arch/s390/mm/gmap.c:thp_split_mm() sets VM_NOHUGEPAGE to make sure there
are *really* no thp. Being able to bypass that would break KVM horribly.

Ignoring MADV_NOHUGEPAGE/VM_NOHUGEPAGE feels like the wrong way to go.


What about a prctl instead, to disable any khugepagd activity and just
let that process control it manually?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10 15:50             ` Zach O'Keefe
@ 2022-03-10 18:17               ` Yang Shi
  2022-03-10 18:46                 ` David Rientjes
  2022-03-10 18:53                 ` Zach O'Keefe
  0 siblings, 2 replies; 57+ messages in thread
From: Yang Shi @ 2022-03-10 18:17 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer

On Thu, Mar 10, 2022 at 7:51 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Wed, Mar 9, 2022 at 6:16 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 5:10 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > On Wed, Mar 9, 2022 at 4:41 PM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Wed, Mar 9, 2022 at 4:01 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > >
> > > > > > On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > > > >
> > > > > > > In madvise collapse context, we optionally want to be able to ignore
> > > > > > > advice from MADV_NOHUGEPAGE-marked regions.
> > > > > >
> > > > > > Could you please elaborate why this usecase is valid? Typically
> > > > > > MADV_NOHUGEPAGE is set when the users really don't want to have THP
> > > > > > for this area. So it doesn't make too much sense to ignore it IMHO.
> > > > > >
> > > > >
> > > > > Hey Yang, thanks for taking time to review and comment.
> > > > >
> > > > > Semantically, the way I see it, is that MADV_NOHUGEPAGE is a way for
> > > > > the user to say "I don't want hugepages here", so that the kernel
> > > > > knows not to do so when faulting memory, and khugepaged can stay away.
> > > > > However, in MADV_COLLAPSE, the user is explicitly requesting this be
> > > > > backed by hugepages - so presumably that is exactly what they want.
> > > > >
> > > > > IOW, if the user didn't want this memory to be backed by hugepages,
> > > > > they wouldn't be MADV_COLLAPSE'ing it. If there was a range of memory
> > > > > the user wanted collapsed, but that had some sub-areas marked
> > > > > MADV_NOHUGEPAGE, they could always issue multiple MADV_COLLAPSE
> > > > > operations around the excluded regions.
> > > > >
> > > > > In terms of use cases, I don't have a concrete example, but a user
> > > > > could hypothetically choose to exclude regions from management from
> > > > > khugepaged, but still be able to collapse the memory themselves,
> > > > > when/if they deem appropriate.
> > > >
> > > > I see. It seems you thought MADV_COLLAPSE actually unsets
> > > > VM_NOHUGEPAGE, and is kind of equal to MADV_HUGEPAGE + doing collapse
> > > > right away, right? To some degree, it makes some sense.
> > >
> > > Currently, MADV_COLLAPSE doesn't alter the vma flags at all - it just
> > > ignores VM_NOHUGEPAGE, and so it's not really the same as
> > > MADV_HUGEPAGE + MADV_COLLAPSE (which would set VM_HUGEPAGE in addition
> > > to clearing VM_NOHUGEPAGE). If my use case has any merit (and I'm not
> > > sure it does) then we don't want to be altering the vma flags since we
> > > don't want to touch khugepaged behavior.
> > >
> > > > If this is the
> > > > behavior you'd like to achieve, I'd suggest making it more explicit,
> > > > for example, setting VM_HUGEPAGE for the MADV_COLLAPSE area rather
> > > > than ignore or change vm flags silently. When using madvise mode, but
> > > > not having VM_HUGEPAGE set, the vma check should fail in the current
> > > > code (I didn't look hard if you already covered this or not).
> > > >
> > >
> > > You're correct, this will fail, since it's following the same
> > > semantics as the fault path. I see what you're saying though; that
> > > perhaps this is inconsistent with my above reasoning that "the user
> > > asked to collapse this memory, and so we should do it". If so, then
> > > perhaps MADV_COLLAPSE just ignores madise mode and VM_[NO]HUGEPAGE
> > > entirely for the purposes of eligibility, and only uses it for the
> > > purposes of determining gfp flags for compaction/reclaim. Pushing that
> > > further, compaction/reclaim could entirely be specified by the user
> > > using a process_madvise(2) flag (later in the series, we do something
> > > like this).
> >
> > Anyway I think we could have two options for MADV_COLLAPSE:
> >
> > 1. Just treat it as a hint (nice to have, best effort). It should obey
> > all the settings. Skip VM_NOHUGEPAGE vmas or vmas without VM_HUGEPAGE
> > if madvise mode, etc.
> >
> > 2. Much stronger. It equals MADV_HUGEPAGE + synchronous collapse. It
> > should set vma flags properly as I suggested.
> >
> > Either is fine to me. But I don't prefer something in between personally.
> >
>
> Makes sense to be consistent. Of these, #1 seems the most
> straightforward to use. Doing an MADV_COLLAPSE on a VM_NOHUGEPAGE vma
> seems like a corner case. The more likely scenario is MADV_COLLAPSE on
> an unflagged (neither VM_HUGEPAGE or VM_NOHUGEPAGE) vma - in which
> case it's less intrusive to not additionally set VM_HUGEPAGE (though
> the user can always do so if they wish). It's a little more consistent
> with "always" mode, where MADV_HUGEPAGE isn't necessary for
> eligibility. It'll also reduce some code complexity.
>
> I'll float one last option your way, however:
>
> 3. The collapsed region is always eligible, regardless of vma flags or
> thp settings (except "never"?). For process_madvise(2), a flag will
> explicitly specify defrag semantics.

This is what I meant for #2 IIUC. Defrag could follow the system's
defrag setting rather than the khugepaged's.

But it may break s390 as David pointed out.

>
> This separates "async-hint" vs "sync-explicit" madvise requests.
> MADV_[NO]HUGEPAGE are hints, and together with thp settings, advise
> the kernel how to treat memory in the future. The kernel uses
> VM_[NO]HUGEPAGE to aid with this. MADV_COLLAPSE, as an explicit
> request, is free to define its own defrag semantics.
>
> This would allow flexibility to separately define async vs sync thp
> policies. For example, highly tuned userspace applications that are
> sensitive to unexpected latency might want to manage their hugepages
> utilization themselves, and ask khugepaged to stay away. There is no
> way in "always" mode to do this without setting VM_NOHUGEPAGE.

I don't quite get why you set THP to always but don't want to
khugepaged do its job. It may be slow, I think this is why you
introduce MADV_COLLAPSE, right? But it doesn't mean khugepaged can't
scan the same area, it just doesn't do any real work and waste some
cpu cycles. But I guess MADV_COLLAPSE doesn't prevent the PMD/THP from
being split, right? So khugepaged still plays a role to re-collapse
the area without calling MADV_COLLAPSE over again and again.

>
> > >
> > >
> > > > >
> > > > > > >
> > > > > > > Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> > > > > > > which can be used to ignore vm flags used when considering thp
> > > > > > > eligibility.
> > > > > > >
> > > > > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > > > > ---
> > > > > > >  mm/khugepaged.c | 18 ++++++++++++------
> > > > > > >  1 file changed, 12 insertions(+), 6 deletions(-)
> > > > > > >
> > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > index 1d20be47bcea..ecbd3fc41c80 100644
> > > > > > > --- a/mm/khugepaged.c
> > > > > > > +++ b/mm/khugepaged.c
> > > > > > > @@ -964,10 +964,14 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > > > > > >  #endif
> > > > > > >
> > > > > > >  /*
> > > > > > > - * Revalidate a vma's eligibility to collapse nr hugepages.
> > > > > > > + * Revalidate a vma's eligibility to collapse nr hugepages. vm_flags_ignore
> > > > > > > + * can be used to ignore certain vma_flags that would otherwise be checked -
> > > > > > > + * the principal example being VM_NOHUGEPAGE which is ignored in madvise
> > > > > > > + * collapse context.
> > > > > > >   */
> > > > > > >  static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > > >                                              unsigned long address, int nr,
> > > > > > > +                                            unsigned long vm_flags_ignore,
> > > > > > >                                              struct vm_area_struct **vmap)
> > > > > > >  {
> > > > > > >         struct vm_area_struct *vma;
> > > > > > > @@ -986,7 +990,7 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > > >         hend = vma->vm_end & HPAGE_PMD_MASK;
> > > > > > >         if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
> > > > > > >                 return SCAN_ADDRESS_RANGE;
> > > > > > > -       if (!hugepage_vma_check(vma, vma->vm_flags))
> > > > > > > +       if (!hugepage_vma_check(vma, vma->vm_flags & ~vm_flags_ignore))
> > > > > > >                 return SCAN_VMA_CHECK;
> > > > > > >         /* Anon VMA expected */
> > > > > > >         if (!vma->anon_vma || vma->vm_ops)
> > > > > > > @@ -1000,9 +1004,11 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > > >   */
> > > > > > >
> > > > > > >  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > > > > > +                                  unsigned long vm_flags_ignore,
> > > > > > >                                    struct vm_area_struct **vmap)
> > > > > > >  {
> > > > > > > -       return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
> > > > > > > +       return hugepage_vma_revalidate_pmd_count(mm, address, 1,
> > > > > > > +                       vm_flags_ignore, vmap);
> > > > > > >  }
> > > > > > >
> > > > > > >  /*
> > > > > > > @@ -1043,7 +1049,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> > > > > > >                 /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> > > > > > >                 if (ret & VM_FAULT_RETRY) {
> > > > > > >                         mmap_read_lock(mm);
> > > > > > > -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> > > > > > > +                       if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
> > > > > > >                                 /* vma is no longer available, don't continue to swapin */
> > > > > > >                                 trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > > > > > >                                 return false;
> > > > > > > @@ -1200,7 +1206,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > > > >         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> > > > > > >
> > > > > > >         mmap_read_lock(mm);
> > > > > > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > > > > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > > > > >         if (result) {
> > > > > > >                 mmap_read_unlock(mm);
> > > > > > >                 goto out_nolock;
> > > > > > > @@ -1232,7 +1238,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > > > >          */
> > > > > > >         mmap_write_lock(mm);
> > > > > > >
> > > > > > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > > > > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > > > > >         if (result)
> > > > > > >                 goto out_up_write;
> > > > > > >         /* check if the pmd is still valid */
> > > > > > > --
> > > > > > > 2.35.1.616.g0bdcbb4464-goog
> > > > > > >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10 15:56   ` David Hildenbrand
@ 2022-03-10 18:39     ` Zach O'Keefe
  2022-03-10 18:54     ` David Rientjes
  1 sibling, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-10 18:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alex Shi, David Rientjes, Michal Hocko, Pasha Tatashin,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi

On Thu, Mar 10, 2022 at 7:56 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.03.22 22:34, Zach O'Keefe wrote:
> > In madvise collapse context, we optionally want to be able to ignore
> > advice from MADV_NOHUGEPAGE-marked regions.
> >
> > Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> > which can be used to ignore vm flags used when considering thp
> > eligibility.
>
> arch/s390/mm/gmap.c:thp_split_mm() sets VM_NOHUGEPAGE to make sure there
> are *really* no thp. Being able to bypass that would break KVM horribly.
>
> Ignoring MADV_NOHUGEPAGE/VM_NOHUGEPAGE feels like the wrong way to go.
>
>
> What about a prctl instead, to disable any khugepagd activity and just
> let that process control it manually?
>
> --
> Thanks,
>
> David / dhildenb
>

Hey David - thanks for reviewing and commenting, and certainly thank
you for finding this bug.

This seems like good motivation for not ignoring VM_NOHUGEPAGE.
arch/powerpc also uses this flag for its own purposes (though it's
user-directed in that particular case).

prctrl(2) sounds very reasonable to me - thanks for the suggestion!

Thanks,
Zach


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10 18:17               ` Yang Shi
@ 2022-03-10 18:46                 ` David Rientjes
  2022-03-10 18:58                   ` Zach O'Keefe
  2022-03-10 19:54                   ` Yang Shi
  2022-03-10 18:53                 ` Zach O'Keefe
  1 sibling, 2 replies; 57+ messages in thread
From: David Rientjes @ 2022-03-10 18:46 UTC (permalink / raw)
  To: Yang Shi
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer

On Thu, 10 Mar 2022, Yang Shi wrote:

> > This separates "async-hint" vs "sync-explicit" madvise requests.
> > MADV_[NO]HUGEPAGE are hints, and together with thp settings, advise
> > the kernel how to treat memory in the future. The kernel uses
> > VM_[NO]HUGEPAGE to aid with this. MADV_COLLAPSE, as an explicit
> > request, is free to define its own defrag semantics.
> >
> > This would allow flexibility to separately define async vs sync thp
> > policies. For example, highly tuned userspace applications that are
> > sensitive to unexpected latency might want to manage their hugepages
> > utilization themselves, and ask khugepaged to stay away. There is no
> > way in "always" mode to do this without setting VM_NOHUGEPAGE.
> 
> I don't quite get why you set THP to always but don't want to
> khugepaged do its job. It may be slow, I think this is why you
> introduce MADV_COLLAPSE, right? But it doesn't mean khugepaged can't
> scan the same area, it just doesn't do any real work and waste some
> cpu cycles. But I guess MADV_COLLAPSE doesn't prevent the PMD/THP from
> being split, right? So khugepaged still plays a role to re-collapse
> the area without calling MADV_COLLAPSE over again and again.
> 

My only real concern for MADV_COLLAPSE was when the span being collapsed 
includes a mixture of both VM_HUGEPAGE and VM_NOHUGEPAGE.  Does this 
collapse over the eligible memory or does it fail entirely?

I'd think it was the former, that we should respect VM_NOHUGEPAGE and only 
collapse eligible memory when doing MADV_COLLAPSE but now userspace 
struggles to know whether it was a partial collapse because of 
ineligiblity or because we just couldn't allocate a hugepage.

It has the information to figure this out on its own, so given the use of 
VM_NOHUGEPAGE for non-MADV_NOHUGEPAGE purposes, I think it makes sense to 
simply ignore these vmas as part of the collapse request.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10 18:17               ` Yang Shi
  2022-03-10 18:46                 ` David Rientjes
@ 2022-03-10 18:53                 ` Zach O'Keefe
  1 sibling, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-10 18:53 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer

On Thu, Mar 10, 2022 at 10:17 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Mar 10, 2022 at 7:51 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 6:16 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Wed, Mar 9, 2022 at 5:10 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > On Wed, Mar 9, 2022 at 4:41 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Wed, Mar 9, 2022 at 4:01 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > > >
> > > > > > > On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe <zokeefe@google.com> wrote:
> > > > > > > >
> > > > > > > > In madvise collapse context, we optionally want to be able to ignore
> > > > > > > > advice from MADV_NOHUGEPAGE-marked regions.
> > > > > > >
> > > > > > > Could you please elaborate why this usecase is valid? Typically
> > > > > > > MADV_NOHUGEPAGE is set when the users really don't want to have THP
> > > > > > > for this area. So it doesn't make too much sense to ignore it IMHO.
> > > > > > >
> > > > > >
> > > > > > Hey Yang, thanks for taking time to review and comment.
> > > > > >
> > > > > > Semantically, the way I see it, is that MADV_NOHUGEPAGE is a way for
> > > > > > the user to say "I don't want hugepages here", so that the kernel
> > > > > > knows not to do so when faulting memory, and khugepaged can stay away.
> > > > > > However, in MADV_COLLAPSE, the user is explicitly requesting this be
> > > > > > backed by hugepages - so presumably that is exactly what they want.
> > > > > >
> > > > > > IOW, if the user didn't want this memory to be backed by hugepages,
> > > > > > they wouldn't be MADV_COLLAPSE'ing it. If there was a range of memory
> > > > > > the user wanted collapsed, but that had some sub-areas marked
> > > > > > MADV_NOHUGEPAGE, they could always issue multiple MADV_COLLAPSE
> > > > > > operations around the excluded regions.
> > > > > >
> > > > > > In terms of use cases, I don't have a concrete example, but a user
> > > > > > could hypothetically choose to exclude regions from management from
> > > > > > khugepaged, but still be able to collapse the memory themselves,
> > > > > > when/if they deem appropriate.
> > > > >
> > > > > I see. It seems you thought MADV_COLLAPSE actually unsets
> > > > > VM_NOHUGEPAGE, and is kind of equal to MADV_HUGEPAGE + doing collapse
> > > > > right away, right? To some degree, it makes some sense.
> > > >
> > > > Currently, MADV_COLLAPSE doesn't alter the vma flags at all - it just
> > > > ignores VM_NOHUGEPAGE, and so it's not really the same as
> > > > MADV_HUGEPAGE + MADV_COLLAPSE (which would set VM_HUGEPAGE in addition
> > > > to clearing VM_NOHUGEPAGE). If my use case has any merit (and I'm not
> > > > sure it does) then we don't want to be altering the vma flags since we
> > > > don't want to touch khugepaged behavior.
> > > >
> > > > > If this is the
> > > > > behavior you'd like to achieve, I'd suggest making it more explicit,
> > > > > for example, setting VM_HUGEPAGE for the MADV_COLLAPSE area rather
> > > > > than ignore or change vm flags silently. When using madvise mode, but
> > > > > not having VM_HUGEPAGE set, the vma check should fail in the current
> > > > > code (I didn't look hard if you already covered this or not).
> > > > >
> > > >
> > > > You're correct, this will fail, since it's following the same
> > > > semantics as the fault path. I see what you're saying though; that
> > > > perhaps this is inconsistent with my above reasoning that "the user
> > > > asked to collapse this memory, and so we should do it". If so, then
> > > > perhaps MADV_COLLAPSE just ignores madise mode and VM_[NO]HUGEPAGE
> > > > entirely for the purposes of eligibility, and only uses it for the
> > > > purposes of determining gfp flags for compaction/reclaim. Pushing that
> > > > further, compaction/reclaim could entirely be specified by the user
> > > > using a process_madvise(2) flag (later in the series, we do something
> > > > like this).
> > >
> > > Anyway I think we could have two options for MADV_COLLAPSE:
> > >
> > > 1. Just treat it as a hint (nice to have, best effort). It should obey
> > > all the settings. Skip VM_NOHUGEPAGE vmas or vmas without VM_HUGEPAGE
> > > if madvise mode, etc.
> > >
> > > 2. Much stronger. It equals MADV_HUGEPAGE + synchronous collapse. It
> > > should set vma flags properly as I suggested.
> > >
> > > Either is fine to me. But I don't prefer something in between personally.
> > >
> >
> > Makes sense to be consistent. Of these, #1 seems the most
> > straightforward to use. Doing an MADV_COLLAPSE on a VM_NOHUGEPAGE vma
> > seems like a corner case. The more likely scenario is MADV_COLLAPSE on
> > an unflagged (neither VM_HUGEPAGE or VM_NOHUGEPAGE) vma - in which
> > case it's less intrusive to not additionally set VM_HUGEPAGE (though
> > the user can always do so if they wish). It's a little more consistent
> > with "always" mode, where MADV_HUGEPAGE isn't necessary for
> > eligibility. It'll also reduce some code complexity.
> >
> > I'll float one last option your way, however:
> >
> > 3. The collapsed region is always eligible, regardless of vma flags or
> > thp settings (except "never"?). For process_madvise(2), a flag will
> > explicitly specify defrag semantics.
>
> This is what I meant for #2 IIUC. Defrag could follow the system's
> defrag setting rather than the khugepaged's.
>
> But it may break s390 as David pointed out.
>
> >
> > This separates "async-hint" vs "sync-explicit" madvise requests.
> > MADV_[NO]HUGEPAGE are hints, and together with thp settings, advise
> > the kernel how to treat memory in the future. The kernel uses
> > VM_[NO]HUGEPAGE to aid with this. MADV_COLLAPSE, as an explicit
> > request, is free to define its own defrag semantics.
> >
> > This would allow flexibility to separately define async vs sync thp
> > policies. For example, highly tuned userspace applications that are
> > sensitive to unexpected latency might want to manage their hugepages
> > utilization themselves, and ask khugepaged to stay away. There is no
> > way in "always" mode to do this without setting VM_NOHUGEPAGE.
>
> I don't quite get why you set THP to always but don't want to
> khugepaged do its job. It may be slow, I think this is why you
> introduce MADV_COLLAPSE, right? But it doesn't mean khugepaged can't
> scan the same area, it just doesn't do any real work and waste some
> cpu cycles. But I guess MADV_COLLAPSE doesn't prevent the PMD/THP from
> being split, right? So khugepaged still plays a role to re-collapse
> the area without calling MADV_COLLAPSE over again and again.
>

Ya, I agree that the common case is that, if you are MADV_COLLAPSE'ing
memory, chances are you just want that memory backed by hugepages -
and so if that area were to be split, presumably we'd want khugepaged
to come and recollapse when possible.

I think the (possibly contrived) use case I was thinking about a
program that (a) didn't have ability to change thp settings
("always"), and (b) wanted to manage it's own hugepages. If a concrete
use case like this did arise, I think David H.'s suggestion of using
prctl(2) would work.

> >
> > > >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> > > > > > > > which can be used to ignore vm flags used when considering thp
> > > > > > > > eligibility.
> > > > > > > >
> > > > > > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > > > > > ---
> > > > > > > >  mm/khugepaged.c | 18 ++++++++++++------
> > > > > > > >  1 file changed, 12 insertions(+), 6 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > > index 1d20be47bcea..ecbd3fc41c80 100644
> > > > > > > > --- a/mm/khugepaged.c
> > > > > > > > +++ b/mm/khugepaged.c
> > > > > > > > @@ -964,10 +964,14 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
> > > > > > > >  #endif
> > > > > > > >
> > > > > > > >  /*
> > > > > > > > - * Revalidate a vma's eligibility to collapse nr hugepages.
> > > > > > > > + * Revalidate a vma's eligibility to collapse nr hugepages. vm_flags_ignore
> > > > > > > > + * can be used to ignore certain vma_flags that would otherwise be checked -
> > > > > > > > + * the principal example being VM_NOHUGEPAGE which is ignored in madvise
> > > > > > > > + * collapse context.
> > > > > > > >   */
> > > > > > > >  static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > > > >                                              unsigned long address, int nr,
> > > > > > > > +                                            unsigned long vm_flags_ignore,
> > > > > > > >                                              struct vm_area_struct **vmap)
> > > > > > > >  {
> > > > > > > >         struct vm_area_struct *vma;
> > > > > > > > @@ -986,7 +990,7 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > > > >         hend = vma->vm_end & HPAGE_PMD_MASK;
> > > > > > > >         if (address < hstart || (address + nr * HPAGE_PMD_SIZE) > hend)
> > > > > > > >                 return SCAN_ADDRESS_RANGE;
> > > > > > > > -       if (!hugepage_vma_check(vma, vma->vm_flags))
> > > > > > > > +       if (!hugepage_vma_check(vma, vma->vm_flags & ~vm_flags_ignore))
> > > > > > > >                 return SCAN_VMA_CHECK;
> > > > > > > >         /* Anon VMA expected */
> > > > > > > >         if (!vma->anon_vma || vma->vm_ops)
> > > > > > > > @@ -1000,9 +1004,11 @@ static int hugepage_vma_revalidate_pmd_count(struct mm_struct *mm,
> > > > > > > >   */
> > > > > > > >
> > > > > > > >  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > > > > > > +                                  unsigned long vm_flags_ignore,
> > > > > > > >                                    struct vm_area_struct **vmap)
> > > > > > > >  {
> > > > > > > > -       return hugepage_vma_revalidate_pmd_count(mm, address, 1, vmap);
> > > > > > > > +       return hugepage_vma_revalidate_pmd_count(mm, address, 1,
> > > > > > > > +                       vm_flags_ignore, vmap);
> > > > > > > >  }
> > > > > > > >
> > > > > > > >  /*
> > > > > > > > @@ -1043,7 +1049,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> > > > > > > >                 /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> > > > > > > >                 if (ret & VM_FAULT_RETRY) {
> > > > > > > >                         mmap_read_lock(mm);
> > > > > > > > -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> > > > > > > > +                       if (hugepage_vma_revalidate(mm, haddr, VM_NONE, &vma)) {
> > > > > > > >                                 /* vma is no longer available, don't continue to swapin */
> > > > > > > >                                 trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > > > > > > >                                 return false;
> > > > > > > > @@ -1200,7 +1206,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > > > > >         count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
> > > > > > > >
> > > > > > > >         mmap_read_lock(mm);
> > > > > > > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > > > > > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > > > > > >         if (result) {
> > > > > > > >                 mmap_read_unlock(mm);
> > > > > > > >                 goto out_nolock;
> > > > > > > > @@ -1232,7 +1238,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> > > > > > > >          */
> > > > > > > >         mmap_write_lock(mm);
> > > > > > > >
> > > > > > > > -       result = hugepage_vma_revalidate(mm, address, &vma);
> > > > > > > > +       result = hugepage_vma_revalidate(mm, address, VM_NONE, &vma);
> > > > > > > >         if (result)
> > > > > > > >                 goto out_up_write;
> > > > > > > >         /* check if the pmd is still valid */
> > > > > > > > --
> > > > > > > > 2.35.1.616.g0bdcbb4464-goog
> > > > > > > >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10 15:56   ` David Hildenbrand
  2022-03-10 18:39     ` Zach O'Keefe
@ 2022-03-10 18:54     ` David Rientjes
  2022-03-21 14:27       ` Michal Hocko
  1 sibling, 1 reply; 57+ messages in thread
From: David Rientjes @ 2022-03-10 18:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zach O'Keefe, Alex Shi, Michal Hocko, Pasha Tatashin,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi

On Thu, 10 Mar 2022, David Hildenbrand wrote:

> On 08.03.22 22:34, Zach O'Keefe wrote:
> > In madvise collapse context, we optionally want to be able to ignore
> > advice from MADV_NOHUGEPAGE-marked regions.
> > 
> > Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> > which can be used to ignore vm flags used when considering thp
> > eligibility.
> 
> arch/s390/mm/gmap.c:thp_split_mm() sets VM_NOHUGEPAGE to make sure there
> are *really* no thp. Being able to bypass that would break KVM horribly.
> 
> Ignoring MADV_NOHUGEPAGE/VM_NOHUGEPAGE feels like the wrong way to go.
> 

Agreed, we'll have to remove this possibility.

> What about a prctl instead, to disable any khugepagd activity and just
> let that process control it manually?
> 

No objection to the prctl, although it's unfortunate that the existing 
PR_SET_THP_DISABLE simply disables thp for the process entirely for any 
non-zero value and that this wasn't implemented as a bitmask to specify 
future behavior where this new behavior could be defined :/

I'll note, however, that we'd have no immediate use case ourselves for the 
prctl, although others may.  Our approach will likely be to disable 
khugepaged entirely in favor of outsourcing hugepage policy decisions to 
userspace based on a number of different signals.  (In fact, also doing 
thp enabled = madvise system wide)


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10 18:46                 ` David Rientjes
@ 2022-03-10 18:58                   ` Zach O'Keefe
  2022-03-10 19:54                   ` Yang Shi
  1 sibling, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-10 18:58 UTC (permalink / raw)
  To: David Rientjes, Yang Shi
  Cc: Alex Shi, David Hildenbrand, Michal Hocko, Pasha Tatashin,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, Linux MM,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer

On Thu, Mar 10, 2022 at 10:46 AM David Rientjes <rientjes@google.com> wrote:
>
> On Thu, 10 Mar 2022, Yang Shi wrote:
>
> > > This separates "async-hint" vs "sync-explicit" madvise requests.
> > > MADV_[NO]HUGEPAGE are hints, and together with thp settings, advise
> > > the kernel how to treat memory in the future. The kernel uses
> > > VM_[NO]HUGEPAGE to aid with this. MADV_COLLAPSE, as an explicit
> > > request, is free to define its own defrag semantics.
> > >
> > > This would allow flexibility to separately define async vs sync thp
> > > policies. For example, highly tuned userspace applications that are
> > > sensitive to unexpected latency might want to manage their hugepages
> > > utilization themselves, and ask khugepaged to stay away. There is no
> > > way in "always" mode to do this without setting VM_NOHUGEPAGE.
> >
> > I don't quite get why you set THP to always but don't want to
> > khugepaged do its job. It may be slow, I think this is why you
> > introduce MADV_COLLAPSE, right? But it doesn't mean khugepaged can't
> > scan the same area, it just doesn't do any real work and waste some
> > cpu cycles. But I guess MADV_COLLAPSE doesn't prevent the PMD/THP from
> > being split, right? So khugepaged still plays a role to re-collapse
> > the area without calling MADV_COLLAPSE over again and again.
> >
>
> My only real concern for MADV_COLLAPSE was when the span being collapsed
> includes a mixture of both VM_HUGEPAGE and VM_NOHUGEPAGE.  Does this
> collapse over the eligible memory or does it fail entirely?
>
> I'd think it was the former, that we should respect VM_NOHUGEPAGE and only
> collapse eligible memory when doing MADV_COLLAPSE but now userspace
> struggles to know whether it was a partial collapse because of
> ineligiblity or because we just couldn't allocate a hugepage.
>
> It has the information to figure this out on its own, so given the use of
> VM_NOHUGEPAGE for non-MADV_NOHUGEPAGE purposes, I think it makes sense to
> simply ignore these vmas as part of the collapse request.

Ignoring these vmas SGTM. Thanks All.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
  2022-03-10  0:06   ` Yang Shi
@ 2022-03-10 19:26     ` David Rientjes
  2022-03-10 20:16       ` Matthew Wilcox
  0 siblings, 1 reply; 57+ messages in thread
From: David Rientjes @ 2022-03-10 19:26 UTC (permalink / raw)
  To: Yang Shi
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer

On Wed, 9 Mar 2022, Yang Shi wrote:

> > Introduce the main madvise collapse batched logic, including the overall
> > locking strategy.  Stubs for individual batched actions, such as
> > scanning pmds in batch, have been stubbed out, and will be added later
> > in the series.
> >
> > Note the main benefit from doing all this work in a batched manner is
> > that __madvise__collapse_pmd_batch() (stubbed out) can be called inside
> > a single mmap_lock write.
> 
> I don't get why this is preferred? Isn't it more preferred to minimize
> the scope of write mmap_lock? Assuming you batch large number of PMDs,
> MADV_COLLAPSE may hold write mmap_lock for a long time, it doesn't
> seem it could scale.
> 

One concern might be the queueing of read locks needed for page faults 
behind a collapser of a long range of memory that is otherwise looping 
and repeatedly taking the write lock.

To have minimal impact on concurrent page faults, which I think we should 
be optimizing for, I don't know the answer without data.  Any ideas you 
have as a general rule-of-thumb for what would be optimal here between 
collapsing one page at a time vs handling multiple collapses per mmap_lock 
write so that readers aren't constantly getting queued?

The easiest answer would be to not do batching at all and leave the impact 
to readers up to the userspace doing the MADV_COLLAPSE :)  I was wondering 
if there was a better default behavior we could implement in the kernel, 
however.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10 18:46                 ` David Rientjes
  2022-03-10 18:58                   ` Zach O'Keefe
@ 2022-03-10 19:54                   ` Yang Shi
  2022-03-10 20:24                     ` Zach O'Keefe
  1 sibling, 1 reply; 57+ messages in thread
From: Yang Shi @ 2022-03-10 19:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer

On Thu, Mar 10, 2022 at 10:46 AM David Rientjes <rientjes@google.com> wrote:
>
> On Thu, 10 Mar 2022, Yang Shi wrote:
>
> > > This separates "async-hint" vs "sync-explicit" madvise requests.
> > > MADV_[NO]HUGEPAGE are hints, and together with thp settings, advise
> > > the kernel how to treat memory in the future. The kernel uses
> > > VM_[NO]HUGEPAGE to aid with this. MADV_COLLAPSE, as an explicit
> > > request, is free to define its own defrag semantics.
> > >
> > > This would allow flexibility to separately define async vs sync thp
> > > policies. For example, highly tuned userspace applications that are
> > > sensitive to unexpected latency might want to manage their hugepages
> > > utilization themselves, and ask khugepaged to stay away. There is no
> > > way in "always" mode to do this without setting VM_NOHUGEPAGE.
> >
> > I don't quite get why you set THP to always but don't want to
> > khugepaged do its job. It may be slow, I think this is why you
> > introduce MADV_COLLAPSE, right? But it doesn't mean khugepaged can't
> > scan the same area, it just doesn't do any real work and waste some
> > cpu cycles. But I guess MADV_COLLAPSE doesn't prevent the PMD/THP from
> > being split, right? So khugepaged still plays a role to re-collapse
> > the area without calling MADV_COLLAPSE over again and again.
> >
>
> My only real concern for MADV_COLLAPSE was when the span being collapsed
> includes a mixture of both VM_HUGEPAGE and VM_NOHUGEPAGE.  Does this
> collapse over the eligible memory or does it fail entirely?
>
> I'd think it was the former, that we should respect VM_NOHUGEPAGE and only
> collapse eligible memory when doing MADV_COLLAPSE but now userspace
> struggles to know whether it was a partial collapse because of
> ineligiblity or because we just couldn't allocate a hugepage.

Yes, I agree we should just try to collapse eligible vmas.

Since we are using madvise, we'd better follow its convention. We
could return different values for different failures, for example:
1. All vmas are collapsed successfully, return 0 (success)
2. Run into ineligible vma, return -EINVAL
3. Can't allocate hugepage, return -ENOMEM

Or just simply return 0 for success or a single error code for all
failure cases.

>
> It has the information to figure this out on its own, so given the use of
> VM_NOHUGEPAGE for non-MADV_NOHUGEPAGE purposes, I think it makes sense to
> simply ignore these vmas as part of the collapse request.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
  2022-03-10 19:26     ` David Rientjes
@ 2022-03-10 20:16       ` Matthew Wilcox
  2022-03-11  0:06         ` Zach O'Keefe
  0 siblings, 1 reply; 57+ messages in thread
From: Matthew Wilcox @ 2022-03-10 20:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Yang Shi, Zach O'Keefe, Alex Shi, David Hildenbrand,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Zi Yan, Linux MM, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Peter Xu, Richard Henderson, Thomas Bogendoerfer

On Thu, Mar 10, 2022 at 11:26:15AM -0800, David Rientjes wrote:
> One concern might be the queueing of read locks needed for page faults 
> behind a collapser of a long range of memory that is otherwise looping 
> and repeatedly taking the write lock.

I would have thought that _not_ batching would improve this situation.
Unless our implementation of rwsems has changed since the last time I
looked, dropping-and-reacquiring a rwsem while there are pending readers
means you go to the end of the line and they all get to handle their
page faults.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10 19:54                   ` Yang Shi
@ 2022-03-10 20:24                     ` Zach O'Keefe
  0 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-10 20:24 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Rientjes, Alex Shi, David Hildenbrand, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	Linux MM, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer

On Thu, Mar 10, 2022 at 11:54 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Mar 10, 2022 at 10:46 AM David Rientjes <rientjes@google.com> wrote:
> >
> > On Thu, 10 Mar 2022, Yang Shi wrote:
> >
> > > > This separates "async-hint" vs "sync-explicit" madvise requests.
> > > > MADV_[NO]HUGEPAGE are hints, and together with thp settings, advise
> > > > the kernel how to treat memory in the future. The kernel uses
> > > > VM_[NO]HUGEPAGE to aid with this. MADV_COLLAPSE, as an explicit
> > > > request, is free to define its own defrag semantics.
> > > >
> > > > This would allow flexibility to separately define async vs sync thp
> > > > policies. For example, highly tuned userspace applications that are
> > > > sensitive to unexpected latency might want to manage their hugepages
> > > > utilization themselves, and ask khugepaged to stay away. There is no
> > > > way in "always" mode to do this without setting VM_NOHUGEPAGE.
> > >
> > > I don't quite get why you set THP to always but don't want to
> > > khugepaged do its job. It may be slow, I think this is why you
> > > introduce MADV_COLLAPSE, right? But it doesn't mean khugepaged can't
> > > scan the same area, it just doesn't do any real work and waste some
> > > cpu cycles. But I guess MADV_COLLAPSE doesn't prevent the PMD/THP from
> > > being split, right? So khugepaged still plays a role to re-collapse
> > > the area without calling MADV_COLLAPSE over again and again.
> > >
> >
> > My only real concern for MADV_COLLAPSE was when the span being collapsed
> > includes a mixture of both VM_HUGEPAGE and VM_NOHUGEPAGE.  Does this
> > collapse over the eligible memory or does it fail entirely?
> >
> > I'd think it was the former, that we should respect VM_NOHUGEPAGE and only
> > collapse eligible memory when doing MADV_COLLAPSE but now userspace
> > struggles to know whether it was a partial collapse because of
> > ineligiblity or because we just couldn't allocate a hugepage.
>
> Yes, I agree we should just try to collapse eligible vmas.
>
> Since we are using madvise, we'd better follow its convention. We
> could return different values for different failures, for example:
> 1. All vmas are collapsed successfully, return 0 (success)
> 2. Run into ineligible vma, return -EINVAL
> 3. Can't allocate hugepage, return -ENOMEM
>
> Or just simply return 0 for success or a single error code for all
> failure cases.
>

Different codes has a benefit (assuming -EINVAL takes precedence over
-EAGAIN (AFAIK madvise convention for mem not available)): A lazy user
wouldn't need to read smaps if -EAGAIN, they could just reissue the
syscall again over the same range, at a later time.

> >
> > It has the information to figure this out on its own, so given the use of
> > VM_NOHUGEPAGE for non-MADV_NOHUGEPAGE purposes, I think it makes sense to
> > simply ignore these vmas as part of the collapse request.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
  2022-03-10 20:16       ` Matthew Wilcox
@ 2022-03-11  0:06         ` Zach O'Keefe
  2022-03-25 16:51           ` Zach O'Keefe
  0 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-11  0:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Rientjes, Yang Shi, Alex Shi, David Hildenbrand,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Zi Yan, Linux MM, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Peter Xu, Richard Henderson, Thomas Bogendoerfer

On Thu, Mar 10, 2022 at 12:17 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Mar 10, 2022 at 11:26:15AM -0800, David Rientjes wrote:
> > One concern might be the queueing of read locks needed for page faults
> > behind a collapser of a long range of memory that is otherwise looping
> > and repeatedly taking the write lock.
>
> I would have thought that _not_ batching would improve this situation.
> Unless our implementation of rwsems has changed since the last time I
> looked, dropping-and-reacquiring a rwsem while there are pending readers
> means you go to the end of the line and they all get to handle their
> page faults.
>

Hey Matthew, thanks for the review / feedback.

I don't have great intuition here, so I'll try to put together a
simple synthetic test to get some data. Though the code would be
different, I can functionally approximate a non-batched approach with
a batch size of 1, and compare that against N.

My file-backed patches likewise weren't able to take advantage of
batching outside mmap lock contention, so the data should equally
apply there.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()
  2022-03-10 18:54     ` David Rientjes
@ 2022-03-21 14:27       ` Michal Hocko
  0 siblings, 0 replies; 57+ messages in thread
From: Michal Hocko @ 2022-03-21 14:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: David Hildenbrand, Zach O'Keefe, Alex Shi, Pasha Tatashin,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer, Yang Shi

[Dropped Richard Henderson from the CC list as the delivery fails for
 him]

On Thu 10-03-22 10:54:03, David Rientjes wrote:
> On Thu, 10 Mar 2022, David Hildenbrand wrote:
> 
> > On 08.03.22 22:34, Zach O'Keefe wrote:
> > > In madvise collapse context, we optionally want to be able to ignore
> > > advice from MADV_NOHUGEPAGE-marked regions.
> > > 
> > > Add a vm_flags_ignore argument to hugepage_vma_revalidate_pmd_count()
> > > which can be used to ignore vm flags used when considering thp
> > > eligibility.
> > 
> > arch/s390/mm/gmap.c:thp_split_mm() sets VM_NOHUGEPAGE to make sure there
> > are *really* no thp. Being able to bypass that would break KVM horribly.
> > 
> > Ignoring MADV_NOHUGEPAGE/VM_NOHUGEPAGE feels like the wrong way to go.
> > 
> 
> Agreed, we'll have to remove this possibility.

yeah, this sounds like a bug to me.

> > What about a prctl instead, to disable any khugepagd activity and just
> > let that process control it manually?
> > 
> 
> No objection to the prctl, although it's unfortunate that the existing 
> PR_SET_THP_DISABLE simply disables thp for the process entirely for any 
> non-zero value and that this wasn't implemented as a bitmask to specify 
> future behavior where this new behavior could be defined :/

I do not think PR_SET_THP_DISABLE is any different from VM_NOHUGEPAGE.
The process (owner) has opeted out from THPs for different reasons.
Those might be unknown to whoever calls the madvise call (including
itself).
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/14] mm: userspace hugepage collapse
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (13 preceding siblings ...)
  2022-03-08 21:34 ` [RFC PATCH 14/14] mm/madvise: add process_madvise(MADV_COLLAPSE) Zach O'Keefe
@ 2022-03-21 14:32 ` Zi Yan
  2022-03-21 14:51   ` Zach O'Keefe
  2022-03-21 14:37 ` Michal Hocko
  2022-03-22  6:40 ` Zach O'Keefe
  16 siblings, 1 reply; 57+ messages in thread
From: Zi Yan @ 2022-03-21 14:32 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka,
	linux-mm, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Richard Henderson, Thomas Bogendoerfer, Yang Shi

[-- Attachment #1: Type: text/plain, Size: 1439 bytes --]

On 8 Mar 2022, at 16:34, Zach O'Keefe wrote:

> Introduction
> --------------------------------
>
> This series provides a mechanism for userspace to induce a collapse of
> eligible ranges of memory into transparent hugepages in process context,
> thus permitting users to more tightly control their own hugepage
> utilization policy at their own expense.
>
> This idea was previously introduced by David Rientjes, and thanks to
> everyone for your patience while I prepared these patches resulting from
> that discussion[1].
>
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FC8C89F13-3F04-456B-BA76-DE2C378D30BF%40nvidia.com%2F&amp;data=04%7C01%7Cziy%40nvidia.com%7C7bcd2b7a8e4a424ab75908da014b76f9%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637823721375395857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=4bHCbskcQmp0Nu7ds7XCDFLty964672zCQPXILC25C8%3D&amp;reserved=0
>
> Interface
> --------------------------------
>
> The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and

Can we have a better name instead of MADV_COLLAPSE? It sounds like it is
destroying a huge page but in fact doing the opposite. Something like
MADV_CREATE_HUGE_PAGE? I know the kernel functions uses collapse everywhere
but it might be better not to confuse the user.

Thanks.

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/14] mm: userspace hugepage collapse
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (14 preceding siblings ...)
  2022-03-21 14:32 ` [RFC PATCH 00/14] mm: userspace hugepage collapse Zi Yan
@ 2022-03-21 14:37 ` Michal Hocko
  2022-03-21 15:46   ` Zach O'Keefe
  2022-03-22  6:40 ` Zach O'Keefe
  16 siblings, 1 reply; 57+ messages in thread
From: Michal Hocko @ 2022-03-21 14:37 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Pasha Tatashin,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer, Yang Shi

[ Removed  Richard Henderson from the CC list as the delivery fails for
  his address]
On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> Introduction
> --------------------------------
> 
> This series provides a mechanism for userspace to induce a collapse of
> eligible ranges of memory into transparent hugepages in process context,
> thus permitting users to more tightly control their own hugepage
> utilization policy at their own expense.
> 
> This idea was previously introduced by David Rientjes, and thanks to
> everyone for your patience while I prepared these patches resulting from
> that discussion[1].
> 
> [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> 
> Interface
> --------------------------------
> 
> The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> leverages the new process_madvise(2) call.
> 
> (*) process_madvise(2)
> 
>         Performs a synchronous collapse of the native pages mapped by
>         the list of iovecs into transparent hugepages. The default gfp
>         flags used will be the same as those used at-fault for the VMA
>         region(s) covered.

Could you expand on reasoning here? The default allocation mode for #PF
is rather light. Madvised will try harder. The reasoning is that we want
to make stalls due to #PF as small as possible and only try harder for
madvised areas (also a subject of configuration). Wouldn't it make more
sense to try harder for an explicit calls like madvise?

>	  When multiple VMA regions are spanned, if
>         faulting-in memory from any VMA would permit synchronous
>         compaction and reclaim, then all hugepage allocations required
>         to satisfy the request may enter compaction and reclaim.

I am not sure I follow here. Let's have a memory range spanning two
vmas, one with MADV_HUGEPAGE.

>         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
>         by default, as the user is explicitly requesting this action.
>         Define two flags to control collapse semantics, passed through
>         process_madvise(2)’s optional flags parameter:

This part is discussed later in the thread.

> 
>         MADV_F_COLLAPSE_LIMITS
> 
>         If supplied, collapse respects pte collapse limits set via
>         sysfs:
>         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
>         Required if calling on behalf of another process and not
>         CAP_SYS_ADMIN.
> 
>         MADV_F_COLLAPSE_DEFRAG
> 
>         If supplied, permit synchronous compaction and reclaim,
>         regardless of VMA flags.

Why do we need this?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/14] mm: userspace hugepage collapse
  2022-03-21 14:32 ` [RFC PATCH 00/14] mm: userspace hugepage collapse Zi Yan
@ 2022-03-21 14:51   ` Zach O'Keefe
  0 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-21 14:51 UTC (permalink / raw)
  To: Zi Yan
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka,
	linux-mm, Andrea Arcangeli, Andrew Morton, Arnd Bergmann,
	Axel Rasmussen, Chris Kennelly, Chris Zankel, Helge Deller,
	Hugh Dickins, Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer, Yang Shi

On Mon, Mar 21, 2022 at 7:32 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 8 Mar 2022, at 16:34, Zach O'Keefe wrote:
>
> > Introduction
> > --------------------------------
> >
> > This series provides a mechanism for userspace to induce a collapse of
> > eligible ranges of memory into transparent hugepages in process context,
> > thus permitting users to more tightly control their own hugepage
> > utilization policy at their own expense.
> >
> > This idea was previously introduced by David Rientjes, and thanks to
> > everyone for your patience while I prepared these patches resulting from
> > that discussion[1].
> >
> > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FC8C89F13-3F04-456B-BA76-DE2C378D30BF%40nvidia.com%2F&amp;data=04%7C01%7Cziy%40nvidia.com%7C7bcd2b7a8e4a424ab75908da014b76f9%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637823721375395857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=4bHCbskcQmp0Nu7ds7XCDFLty964672zCQPXILC25C8%3D&amp;reserved=0
> >
> > Interface
> > --------------------------------
> >
> > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
>
> Can we have a better name instead of MADV_COLLAPSE? It sounds like it is
> destroying a huge page but in fact doing the opposite. Something like
> MADV_CREATE_HUGE_PAGE? I know the kernel functions uses collapse everywhere
> but it might be better not to confuse the user.
>

Hey Zi, thanks for reviewing / commenting. I briefly thought about
"coalesce", but, "collapse" isn't just used within the kernel; it's
already part of existing user apis such as the thp sysfs interface
(/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed),
vmstat (ex /proc/vmstat:thp_collapse_alloc[_failed]), per-memcg stats
(memory.stat:thp_collapse_alloc) and tracepoints (ex
mm_collapse_huge_page). I'm not married to it though.


> Thanks.
>
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/14] mm: userspace hugepage collapse
  2022-03-21 14:37 ` Michal Hocko
@ 2022-03-21 15:46   ` Zach O'Keefe
  2022-03-22 12:11     ` Michal Hocko
  0 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-21 15:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Pasha Tatashin,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer, Yang Shi

Hey Michal, thanks for taking the time to review / comment.

On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@suse.com> wrote:
>
> [ Removed  Richard Henderson from the CC list as the delivery fails for
>   his address]

Thank you :)

> On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> > Introduction
> > --------------------------------
> >
> > This series provides a mechanism for userspace to induce a collapse of
> > eligible ranges of memory into transparent hugepages in process context,
> > thus permitting users to more tightly control their own hugepage
> > utilization policy at their own expense.
> >
> > This idea was previously introduced by David Rientjes, and thanks to
> > everyone for your patience while I prepared these patches resulting from
> > that discussion[1].
> >
> > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> >
> > Interface
> > --------------------------------
> >
> > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> > leverages the new process_madvise(2) call.
> >
> > (*) process_madvise(2)
> >
> >         Performs a synchronous collapse of the native pages mapped by
> >         the list of iovecs into transparent hugepages. The default gfp
> >         flags used will be the same as those used at-fault for the VMA
> >         region(s) covered.
>
> Could you expand on reasoning here? The default allocation mode for #PF
> is rather light. Madvised will try harder. The reasoning is that we want
> to make stalls due to #PF as small as possible and only try harder for
> madvised areas (also a subject of configuration). Wouldn't it make more
> sense to try harder for an explicit calls like madvise?
>

The reasoning is that the user has presumably configured system/vmas
to tell the kernel how badly they want thps, and so this call aligns
with current expectations. I.e. a user who goes about the trouble of
trying to fault-in a thp at a given memory address likely wants a thp
"as bad" as the same user MADV_COLLAPSE'ing the same memory to get a
thp.

If this is not the case, then the MADV_F_COLLAPSE_DEFRAG flag could be
used to explicitly request the kernel to try harder, as you mention.

> >         When multiple VMA regions are spanned, if
> >         faulting-in memory from any VMA would permit synchronous
> >         compaction and reclaim, then all hugepage allocations required
> >         to satisfy the request may enter compaction and reclaim.
>
> I am not sure I follow here. Let's have a memory range spanning two
> vmas, one with MADV_HUGEPAGE.

I think you are rightly confused here, since the code doesn't
currently match this description - thanks for pointing it out.

The idea* was that, in the case you provided, the gfp flags used for
all thp allocations would match those used for a MADV_HUGEPAGE vma,
under current system settings. IOW, we treat the semantics of the
collapse for the entire range uniformly (aside from MADV_NOHUGEPAGE,
as per earlier discussions).

So, for example, if transparent_hugepage/enabled was set to "always"
and transparent_hugepage/defrag was set to "madvise", then all
allocations could enter direct reclaim. The reasoning for this is, #1
the user has already told us that entering direct reclaim is tolerable
for this syscall, and they can wait. #2 is that MADV_COLLAPSE might
yield confusing results otherwise; some ranges might get backed by
thps, while others may not. Also, a single MADV_HUGEPAGE vma early in
the range might permit enough reclaim/compaction that allows
successive non-MADV_HUGEPAGE allocations to succeed where they
otherwise may not have.

However, the code and this description disagree, since madvise
decomposes the call over multiple vmas into iterative
madvise_vma_behavior() over a single vma, with no state shared between
calls. If the motivation above is sufficient, then this could be
added.

>
> >         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
> >         by default, as the user is explicitly requesting this action.
> >         Define two flags to control collapse semantics, passed through
> >         process_madvise(2)’s optional flags parameter:
>
> This part is discussed later in the thread.
>
> >
> >         MADV_F_COLLAPSE_LIMITS
> >
> >         If supplied, collapse respects pte collapse limits set via
> >         sysfs:
> >         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
> >         Required if calling on behalf of another process and not
> >         CAP_SYS_ADMIN.
> >
> >         MADV_F_COLLAPSE_DEFRAG
> >
> >         If supplied, permit synchronous compaction and reclaim,
> >         regardless of VMA flags.
>
> Why do we need this?

Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both?

* MADV_F_COLLAPSE_LIMITS is included because we'd like some form of
inter-process protection for collapsing memory in another process'
address space (which a malevolent program could exploit to cause oom
conditions in another memcg hierarchy, for example), but we want
privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp
utilization as they wish.

* MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want
to explicitly tell the kernel to try harder to back this by thps,
regardless of the current system/vma configuration.

Note that when used together, these flags can be used to implement the
exact behavior of khugepaged, through MADV_COLLAPSE.

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/14] mm: userspace hugepage collapse
  2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
                   ` (15 preceding siblings ...)
  2022-03-21 14:37 ` Michal Hocko
@ 2022-03-22  6:40 ` Zach O'Keefe
  2022-03-22 12:05   ` Michal Hocko
  16 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-22  6:40 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Michal Hocko,
	Pasha Tatashin, SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan,
	linux-mm, Yang Shi, Matthew Wilcox
  Cc: Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matt Turner, Max Filippov, Miaohe Lin,
	Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer

On Tue, Mar 8, 2022 at 1:34 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Introduction
> --------------------------------
>
> This series provides a mechanism for userspace to induce a collapse of
> eligible ranges of memory into transparent hugepages in process context,
> thus permitting users to more tightly control their own hugepage
> utilization policy at their own expense.
>
> This idea was previously introduced by David Rientjes, and thanks to
> everyone for your patience while I prepared these patches resulting from
> that discussion[1].
>
> [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
>
> Interface
> --------------------------------
>
> The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> leverages the new process_madvise(2) call.
>
> (*) process_madvise(2)
>
>         Performs a synchronous collapse of the native pages mapped by
>         the list of iovecs into transparent hugepages. The default gfp
>         flags used will be the same as those used at-fault for the VMA
>         region(s) covered. When multiple VMA regions are spanned, if
>         faulting-in memory from any VMA would permit synchronous
>         compaction and reclaim, then all hugepage allocations required
>         to satisfy the request may enter compaction and reclaim.
>         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
>         by default, as the user is explicitly requesting this action.
>         Define two flags to control collapse semantics, passed through
>         process_madvise(2)’s optional flags parameter:
>
>         MADV_F_COLLAPSE_LIMITS
>
>         If supplied, collapse respects pte collapse limits set via
>         sysfs:
>         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
>         Required if calling on behalf of another process and not
>         CAP_SYS_ADMIN.
>
>         MADV_F_COLLAPSE_DEFRAG
>
>         If supplied, permit synchronous compaction and reclaim,
>         regardless of VMA flags.
>
> (*) madvise(2)
>
>         Equivalent to process_madvise(2) on self, with no flags
>         passed; pte collapse limits are ignored, and the gfp flags will
>         be the same as those used at-fault for the VMA region(s)
>         covered. Note that, users wanting different collapse semantics
>         can always use process_madvise(2) on themselves.
>
> Discussion
> --------------------------------
>
> The mechanism is fully compatible with khugepaged, allowing userspace to
> separately define synchronous and asynchronous hugepage policies, as
> priority dictates. It also naturally permits a DAMON scheme,
> DAMOS_COLLAPSE, to make efficient use of the available hugepages on the
> system by backing the most frequently accessed memory by hugepages[2].
> Though not required to justify this series, hugepage management could be
> offloaded entirely to a sufficiently informed userspace agent,
> supplanting the need for khugepaged in the kernel.
>
> Along with the interface, this series proposes a batched implementation
> to collapse a range of memory. The motivation for this is to limit
> contention on mmap_lock, doing multiple page table modifications while
> the lock is held exclusively.
>
> Only private anonymous memory is supported by this series. File-backed
> memory support will be added later.
>
> Multiple hugepages support (such as 1 GiB gigantic hugepages) were not
> considered at this time, but could be supported by the flags parameter
> in the future.
>
> kselftests were omitted from this series for brevity, but would be
> included in an eventual patch submission.
>
> [2] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/
>
> Sequence of Patches
> --------------------------------
>
> Patches 1-10 perform refactoring of collapse logic within khugepaged.c:
> introducing the notion of a collapse context and isolating logic that
> can be reused later in the series for the madvise collapse context.
>
> Patches 11-14 introduce logic for the proposed madvise collapse
> mechanism. Patch 11 adds madvise and header file plumbing. Patch 12 and
> 13, separately, add the core collapse logic, with the former introducing
> the overall batched approach and locking strategy, and the latter
> fills-in batch action details. This separation was purely to keep patch
> size down. Patch 14 adds process_madvise support.
>
> Applies against next-20220308.
>
> Zach O'Keefe (14):
>   mm/rmap: add mm_find_pmd_raw helper
>   mm/khugepaged: add struct collapse_control
>   mm/khugepaged: add __do_collapse_huge_page() helper
>   mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse
>   mm/khugepaged: add mmap_assert_locked() checks to scan_pmd()
>   mm/khugepaged: add hugepage_vma_revalidate_pmd_count()
>   mm/khugepaged: add vm_flags_ignore to
>     hugepage_vma_revalidate_pmd_count()
>   mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled()
>   mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP
>   mm/khugepaged: rename khugepaged-specific/not functions
>   mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
>   mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
>   mm/madvise: add __madvise_collapse_*_batch() actions.
>   mm/madvise: add process_madvise(MADV_COLLAPSE)
>
>  fs/io_uring.c                          |   3 +-
>  include/linux/huge_mm.h                |  27 +-
>  include/linux/mm.h                     |   3 +-
>  include/uapi/asm-generic/mman-common.h |  10 +
>  mm/huge_memory.c                       |   2 +-
>  mm/internal.h                          |   1 +
>  mm/khugepaged.c                        | 937 ++++++++++++++++++++-----
>  mm/madvise.c                           |  45 +-
>  mm/memory.c                            |   6 +-
>  mm/rmap.c                              |  15 +-
>  10 files changed, 842 insertions(+), 207 deletions(-)
>
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks to the many people who took the time to review and provide
feedback on this series.

In preparation of a V1 PATCH series which will incorporate the
feedback received here, one item I'd specifically like feedback from
the community on is whether support for privately-mapped anonymous
memory is sufficient to motivate an initial landing of MADV_COLLAPSE,
with file-backed support coming later. I have local patches to support
file-backed memory, but my thought was to keep the series no longer
than necessary, for the consideration of reviewers. No substantial
infrastructure changes are required to support file-backed memory; it
naturally builds on top of the existing series (as it was developed
with file-backed support flushed-out).

Thanks,
Zach


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/14] mm: userspace hugepage collapse
  2022-03-22  6:40 ` Zach O'Keefe
@ 2022-03-22 12:05   ` Michal Hocko
  2022-03-23 13:30     ` Zach O'Keefe
  0 siblings, 1 reply; 57+ messages in thread
From: Michal Hocko @ 2022-03-22 12:05 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Pasha Tatashin,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, linux-mm,
	Yang Shi, Matthew Wilcox, Andrea Arcangeli, Andrew Morton,
	Arnd Bergmann, Axel Rasmussen, Chris Kennelly, Chris Zankel,
	Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Peter Xu, Thomas Bogendoerfer

On Mon 21-03-22 23:40:39, Zach O'Keefe wrote:
[...]
> In preparation of a V1 PATCH series which will incorporate the
> feedback received here, one item I'd specifically like feedback from
> the community on is whether support for privately-mapped anonymous
> memory is sufficient to motivate an initial landing of MADV_COLLAPSE,
> with file-backed support coming later.

Yes I think this should be sufficient for the initial implementation.

> I have local patches to support
> file-backed memory, but my thought was to keep the series no longer
> than necessary, for the consideration of reviewers.

Agreed! I think we should focus on the semantic of the anonymous memory
first.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/14] mm: userspace hugepage collapse
  2022-03-21 15:46   ` Zach O'Keefe
@ 2022-03-22 12:11     ` Michal Hocko
  2022-03-22 15:53       ` Zach O'Keefe
  0 siblings, 1 reply; 57+ messages in thread
From: Michal Hocko @ 2022-03-22 12:11 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Pasha Tatashin,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer, Yang Shi

On Mon 21-03-22 08:46:35, Zach O'Keefe wrote:
> Hey Michal, thanks for taking the time to review / comment.
> 
> On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > [ Removed  Richard Henderson from the CC list as the delivery fails for
> >   his address]
> 
> Thank you :)
> 
> > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> > > Introduction
> > > --------------------------------
> > >
> > > This series provides a mechanism for userspace to induce a collapse of
> > > eligible ranges of memory into transparent hugepages in process context,
> > > thus permitting users to more tightly control their own hugepage
> > > utilization policy at their own expense.
> > >
> > > This idea was previously introduced by David Rientjes, and thanks to
> > > everyone for your patience while I prepared these patches resulting from
> > > that discussion[1].
> > >
> > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> > >
> > > Interface
> > > --------------------------------
> > >
> > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> > > leverages the new process_madvise(2) call.
> > >
> > > (*) process_madvise(2)
> > >
> > >         Performs a synchronous collapse of the native pages mapped by
> > >         the list of iovecs into transparent hugepages. The default gfp
> > >         flags used will be the same as those used at-fault for the VMA
> > >         region(s) covered.
> >
> > Could you expand on reasoning here? The default allocation mode for #PF
> > is rather light. Madvised will try harder. The reasoning is that we want
> > to make stalls due to #PF as small as possible and only try harder for
> > madvised areas (also a subject of configuration). Wouldn't it make more
> > sense to try harder for an explicit calls like madvise?
> >
> 
> The reasoning is that the user has presumably configured system/vmas
> to tell the kernel how badly they want thps, and so this call aligns
> with current expectations. I.e. a user who goes about the trouble of
> trying to fault-in a thp at a given memory address likely wants a thp
> "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a
> thp.

If the syscall tries only as hard as the #PF doesn't that limit the
functionality? I mean a non #PF can consume more resources to allocate
and collapse a THP as it won't inflict any measurable latency to the
targetting process (except for potential CPU contention). From that
perspective madvise is much more similar to khugepaged. I would even
argue that it could try even harder because madvise is focused on a very
specific memory range and the execution is not shared among all
processes that are scanned by khugepaged.

> If this is not the case, then the MADV_F_COLLAPSE_DEFRAG flag could be
> used to explicitly request the kernel to try harder, as you mention.

Do we really need that? How many do_harder levels do we want to support?

What would be typical usecases for #PF based and DEFRAG usages?

[...]

> > >         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
> > >         by default, as the user is explicitly requesting this action.
> > >         Define two flags to control collapse semantics, passed through
> > >         process_madvise(2)’s optional flags parameter:
> >
> > This part is discussed later in the thread.
> >
> > >
> > >         MADV_F_COLLAPSE_LIMITS
> > >
> > >         If supplied, collapse respects pte collapse limits set via
> > >         sysfs:
> > >         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
> > >         Required if calling on behalf of another process and not
> > >         CAP_SYS_ADMIN.
> > >
> > >         MADV_F_COLLAPSE_DEFRAG
> > >
> > >         If supplied, permit synchronous compaction and reclaim,
> > >         regardless of VMA flags.
> >
> > Why do we need this?
> 
> Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both?
> 
> * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of
> inter-process protection for collapsing memory in another process'
> address space (which a malevolent program could exploit to cause oom
> conditions in another memcg hierarchy, for example), but we want
> privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp
> utilization as they wish.

Could you expand some more please? How is this any different from
khugepaged (well, except that you can trigger the collapsing explicitly
rather than rely on khugepaged to find that mm)?

> * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want
> to explicitly tell the kernel to try harder to back this by thps,
> regardless of the current system/vma configuration.
> 
> Note that when used together, these flags can be used to implement the
> exact behavior of khugepaged, through MADV_COLLAPSE.

IMHO this is stretching the interface and this can backfire in the
future. The interface should be really trivial. I want to collapse a
memory area. Let the kernel do the right thing and do not bother with
all the implementation details. I would use the same allocation strategy
as khugepaged as this seems to be closesest from the latency and
application awareness POV. In a way you can look at the madvise call as
a way to trigger khugepaged functionality on he particular memory range.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/14] mm: userspace hugepage collapse
  2022-03-22 12:11     ` Michal Hocko
@ 2022-03-22 15:53       ` Zach O'Keefe
  2022-03-29 12:24         ` Michal Hocko
  0 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-22 15:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Pasha Tatashin,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer, Yang Shi

On Tue, Mar 22, 2022 at 5:11 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-03-22 08:46:35, Zach O'Keefe wrote:
> > Hey Michal, thanks for taking the time to review / comment.
> >
> > On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > [ Removed  Richard Henderson from the CC list as the delivery fails for
> > >   his address]
> >
> > Thank you :)
> >
> > > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> > > > Introduction
> > > > --------------------------------
> > > >
> > > > This series provides a mechanism for userspace to induce a collapse of
> > > > eligible ranges of memory into transparent hugepages in process context,
> > > > thus permitting users to more tightly control their own hugepage
> > > > utilization policy at their own expense.
> > > >
> > > > This idea was previously introduced by David Rientjes, and thanks to
> > > > everyone for your patience while I prepared these patches resulting from
> > > > that discussion[1].
> > > >
> > > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> > > >
> > > > Interface
> > > > --------------------------------
> > > >
> > > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> > > > leverages the new process_madvise(2) call.
> > > >
> > > > (*) process_madvise(2)
> > > >
> > > >         Performs a synchronous collapse of the native pages mapped by
> > > >         the list of iovecs into transparent hugepages. The default gfp
> > > >         flags used will be the same as those used at-fault for the VMA
> > > >         region(s) covered.
> > >
> > > Could you expand on reasoning here? The default allocation mode for #PF
> > > is rather light. Madvised will try harder. The reasoning is that we want
> > > to make stalls due to #PF as small as possible and only try harder for
> > > madvised areas (also a subject of configuration). Wouldn't it make more
> > > sense to try harder for an explicit calls like madvise?
> > >
> >
> > The reasoning is that the user has presumably configured system/vmas
> > to tell the kernel how badly they want thps, and so this call aligns
> > with current expectations. I.e. a user who goes about the trouble of
> > trying to fault-in a thp at a given memory address likely wants a thp
> > "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a
> > thp.
>
> If the syscall tries only as hard as the #PF doesn't that limit the
> functionality?

I'd argue that, the various allocation semantics possible through
existing thp knobs / vma flags, in addition to the proposed
MADV_F_COLLAPSE_DEFRAG flag provides a flexible functional space to
work with. Relatively speaking, in what way would we be lacking
functionality?

> I mean a non #PF can consume more resources to allocate
> and collapse a THP as it won't inflict any measurable latency to the
> targetting process (except for potential CPU contention).

Sorry, I'm not sure I understand this. What latency are we discussing
in this point? Do you mean to say that since MADV_COLLAPSE isn't in
the fault path, it doesn't necessarily need to be fast / direct
reclaim wouldn't be noticed?

> From that
> perspective madvise is much more similar to khugepaged. I would even
> argue that it could try even harder because madvise is focused on a very
> specific memory range and the execution is not shared among all
> processes that are scanned by khugepaged.
>

Good point. Covered at the end.

> > If this is not the case, then the MADV_F_COLLAPSE_DEFRAG flag could be
> > used to explicitly request the kernel to try harder, as you mention.
>
> Do we really need that? How many do_harder levels do we want to support?
>
> What would be typical usecases for #PF based and DEFRAG usages?
>

Thanks for challenging this. Covered at the end.

> [...]
>
> > > >         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
> > > >         by default, as the user is explicitly requesting this action.
> > > >         Define two flags to control collapse semantics, passed through
> > > >         process_madvise(2)’s optional flags parameter:
> > >
> > > This part is discussed later in the thread.
> > >
> > > >
> > > >         MADV_F_COLLAPSE_LIMITS
> > > >
> > > >         If supplied, collapse respects pte collapse limits set via
> > > >         sysfs:
> > > >         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
> > > >         Required if calling on behalf of another process and not
> > > >         CAP_SYS_ADMIN.
> > > >
> > > >         MADV_F_COLLAPSE_DEFRAG
> > > >
> > > >         If supplied, permit synchronous compaction and reclaim,
> > > >         regardless of VMA flags.
> > >
> > > Why do we need this?
> >
> > Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both?
> >
> > * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of
> > inter-process protection for collapsing memory in another process'
> > address space (which a malevolent program could exploit to cause oom
> > conditions in another memcg hierarchy, for example), but we want
> > privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp
> > utilization as they wish.
>
> Could you expand some more please? How is this any different from
> khugepaged (well, except that you can trigger the collapsing explicitly
> rather than rely on khugepaged to find that mm)?
>

MADV_F_COLLAPSE_LIMITS was motivated by being able to replicate &
extend khugepaged in userspace, where the benefit is precisely that we
can choose that mm/vma more intelligently.

> > * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want
> > to explicitly tell the kernel to try harder to back this by thps,
> > regardless of the current system/vma configuration.
> >
> > Note that when used together, these flags can be used to implement the
> > exact behavior of khugepaged, through MADV_COLLAPSE.
>
> IMHO this is stretching the interface and this can backfire in the
> future. The interface should be really trivial. I want to collapse a
> memory area. Let the kernel do the right thing and do not bother with
> all the implementation details. I would use the same allocation strategy
> as khugepaged as this seems to be closesest from the latency and
> application awareness POV. In a way you can look at the madvise call as
> a way to trigger khugepaged functionality on he particular memory range.

Trying to summarize a few earlier comments centering around
MADV_F_COLLAPSE_DEFRAG and allocation semantics.

This series presupposes the existence of an informed userspace agent
that is aware of what processes/memory ranges would benefit most from
thps. Such an agent might either be:
(1) A system-level daemon optimizing thp utilization system-wide
(2) A highly tuned process / malloc implementation optimizing their
own thp usage

The different types of agents reflects the divide between #PF and
DEFRAG semantics.

For (1), we want to view this exactly like triggering khugepaged
functionality from userspace, and likely want DEFRAG semantics.

For (2), I was viewing this as the "live" symmetric counterpart to
at-fault thp allocation where the process has decided, at runtime,
that this memory could benefit from thp backing, and so #PF semantics
seemed like sane default. I'd worry that using DEFRAG semantics by
default might deter adoption by users who might not be willing to wait
an unbounded amount of time for direct reclaim.


> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/14] mm: userspace hugepage collapse
  2022-03-22 12:05   ` Michal Hocko
@ 2022-03-23 13:30     ` Zach O'Keefe
  0 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-23 13:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Pasha Tatashin,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, linux-mm,
	Yang Shi, Matthew Wilcox, Andrea Arcangeli, Andrew Morton,
	Arnd Bergmann, Axel Rasmussen, Chris Kennelly, Chris Zankel,
	Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Peter Xu, Thomas Bogendoerfer

On Tue, Mar 22, 2022 at 5:06 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-03-22 23:40:39, Zach O'Keefe wrote:
> [...]
> > In preparation of a V1 PATCH series which will incorporate the
> > feedback received here, one item I'd specifically like feedback from
> > the community on is whether support for privately-mapped anonymous
> > memory is sufficient to motivate an initial landing of MADV_COLLAPSE,
> > with file-backed support coming later.
>
> Yes I think this should be sufficient for the initial implementation.
>
> > I have local patches to support
> > file-backed memory, but my thought was to keep the series no longer
> > than necessary, for the consideration of reviewers.
>
> Agreed! I think we should focus on the semantic of the anonymous memory
> first.

Great! Sounds good to me and thanks again for the review & feedback.

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
  2022-03-11  0:06         ` Zach O'Keefe
@ 2022-03-25 16:51           ` Zach O'Keefe
  2022-03-25 19:54             ` Yang Shi
  0 siblings, 1 reply; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-25 16:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Rientjes, Yang Shi, Alex Shi, David Hildenbrand,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Zi Yan, Linux MM, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Peter Xu, Thomas Bogendoerfer

Hey All,

Sorry for the delay. So, I ran some synthetic tests on a dual socket
Skylake with configured batch sizes of 1, 8, 32, and 64. Basic setup
was: 1 thread continuously madvise(MADV_COLLAPSE)'ing memory, 20
threads continuously faulting-in pages, and some basic synchronization
so that all threads follow a "only do work when all other threads have
work to do" model (i.e. so we don't measure faults in the absence of
simultaneous collapses, or vice versa). I used bpftrace attached to
tracepoint:mmap_lock to measure r/w mmap_lock contention over 20
minutes.

Assuming we want to optimize for fault-path readers, the results are
pretty clear: BATCH-1 outperforms BATCH-8, BATCH-32, and BATCH-64 by
254%, 381%, and 425% respectively, in terms of mean time for
fault-threads to acquire mmap_lock in read, while also having less
tail latency (didn't calculate, just looked at bpftrace histograms).
If we cared at all about madvise(MADV_COLLAPSE) performance, then
BATCH-1 is 83-86% as fast as the others and holds mmap_lock in write
for about the same amount of time in aggregate (~0 +/- 2%).

I've included the bpftrace histograms for fault-threads acquiring
mmap_lock in read at the end for posterity, and can provide more data
/ info if folks are interested.

In light of these results, I'll rework the code to iteratively operate
on single hugepages, which should have the added benefit of
considerably simplifying the code for an eminent V1 series.

Thanks,
Zach

bpftrace data:

/*****************************************************************************/
batch size: 1

@mmap_lock_r_acquire[fault-thread]:
[128, 256)          1254 |                                                    |
[256, 512)       2691261 |@@@@@@@@@@@@@@@@@                                   |
[512, 1K)        2969500 |@@@@@@@@@@@@@@@@@@@                                 |
[1K, 2K)         1794738 |@@@@@@@@@@@                                         |
[2K, 4K)         1590984 |@@@@@@@@@@                                          |
[4K, 8K)         3273349 |@@@@@@@@@@@@@@@@@@@@@                               |
[8K, 16K)         851467 |@@@@@                                               |
[16K, 32K)        460653 |@@                                                  |
[32K, 64K)          7274 |                                                    |
[64K, 128K)           25 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)       8085437 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1M, 2M)          381735 |@@                                                  |
[2M, 4M)              28 |                                                    |

@mmap_lock_r_acquire_stat[fault-thread]: count 22107705, average
326480, total 7217743234867

/*****************************************************************************/
batch size: 8

@mmap_lock_r_acquire[fault-thread]:
[128, 256)            55 |                                                    |
[256, 512)        247028 |@@@@@@                                              |
[512, 1K)         239083 |@@@@@@                                              |
[1K, 2K)          142296 |@@@                                                 |
[2K, 4K)          153149 |@@@@                                                |
[4K, 8K)         1899396 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)        1780734 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
[16K, 32K)         95645 |@@                                                  |
[32K, 64K)          1933 |                                                    |
[64K, 128K)            3 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)         1132899 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
[8M, 16M)           3953 |                                                    |

@mmap_lock_r_acquire_stat[fault-thread]: count 5696174, average
1156055, total 6585091744973

/*****************************************************************************/
batch size: 32

@mmap_lock_r_acquire[fault-thread]:
[128, 256)            35 |                                                    |
[256, 512)         63413 |@                                                   |
[512, 1K)          78130 |@                                                   |
[1K, 2K)           39548 |                                                    |
[2K, 4K)           44331 |                                                    |
[4K, 8K)         2398751 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)        1316932 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[16K, 32K)         54798 |@                                                   |
[32K, 64K)           771 |                                                    |
[64K, 128K)            2 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)        280791 |@@@@@@                                              |
[32M, 64M)           809 |                                                    |

@mmap_lock_r_acquire_stat[fault-thread]: count 4278311, average
1571585, total 6723733081824

/*****************************************************************************/
batch size: 64

@mmap_lock_r_acquire[fault-thread]:
[256, 512)         30303 |                                                    |
[512, 1K)          42366 |@                                                   |
[1K, 2K)           23679 |                                                    |
[2K, 4K)           22781 |                                                    |
[4K, 8K)         1637566 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[8K, 16K)        1955773 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)         41832 |@                                                   |
[32K, 64K)           563 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             0 |                                                    |
[32M, 64M)        140723 |@@@                                                 |
[64M, 128M)           77 |                                                    |

@mmap_lock_r_acquire_stat[fault-thread]: count 3895663, average
1715797, total 6684170171691

On Thu, Mar 10, 2022 at 4:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Thu, Mar 10, 2022 at 12:17 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Mar 10, 2022 at 11:26:15AM -0800, David Rientjes wrote:
> > > One concern might be the queueing of read locks needed for page faults
> > > behind a collapser of a long range of memory that is otherwise looping
> > > and repeatedly taking the write lock.
> >
> > I would have thought that _not_ batching would improve this situation.
> > Unless our implementation of rwsems has changed since the last time I
> > looked, dropping-and-reacquiring a rwsem while there are pending readers
> > means you go to the end of the line and they all get to handle their
> > page faults.
> >
>
> Hey Matthew, thanks for the review / feedback.
>
> I don't have great intuition here, so I'll try to put together a
> simple synthetic test to get some data. Though the code would be
> different, I can functionally approximate a non-batched approach with
> a batch size of 1, and compare that against N.
>
> My file-backed patches likewise weren't able to take advantage of
> batching outside mmap lock contention, so the data should equally
> apply there.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
  2022-03-25 16:51           ` Zach O'Keefe
@ 2022-03-25 19:54             ` Yang Shi
  0 siblings, 0 replies; 57+ messages in thread
From: Yang Shi @ 2022-03-25 19:54 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Matthew Wilcox, David Rientjes, Alex Shi, David Hildenbrand,
	Michal Hocko, Pasha Tatashin, SeongJae Park, Song Liu,
	Vlastimil Babka, Zi Yan, Linux MM, Andrea Arcangeli,
	Andrew Morton, Arnd Bergmann, Axel Rasmussen, Chris Kennelly,
	Chris Zankel, Helge Deller, Hugh Dickins, Ivan Kokshaysky,
	James E.J. Bottomley, Jens Axboe, Kirill A. Shutemov,
	Matt Turner, Max Filippov, Miaohe Lin, Minchan Kim, Patrick Xia,
	Pavel Begunkov, Peter Xu, Thomas Bogendoerfer

On Fri, Mar 25, 2022 at 9:51 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Hey All,
>
> Sorry for the delay. So, I ran some synthetic tests on a dual socket
> Skylake with configured batch sizes of 1, 8, 32, and 64. Basic setup
> was: 1 thread continuously madvise(MADV_COLLAPSE)'ing memory, 20
> threads continuously faulting-in pages, and some basic synchronization
> so that all threads follow a "only do work when all other threads have
> work to do" model (i.e. so we don't measure faults in the absence of
> simultaneous collapses, or vice versa). I used bpftrace attached to
> tracepoint:mmap_lock to measure r/w mmap_lock contention over 20
> minutes.
>
> Assuming we want to optimize for fault-path readers, the results are
> pretty clear: BATCH-1 outperforms BATCH-8, BATCH-32, and BATCH-64 by
> 254%, 381%, and 425% respectively, in terms of mean time for
> fault-threads to acquire mmap_lock in read, while also having less
> tail latency (didn't calculate, just looked at bpftrace histograms).
> If we cared at all about madvise(MADV_COLLAPSE) performance, then
> BATCH-1 is 83-86% as fast as the others and holds mmap_lock in write
> for about the same amount of time in aggregate (~0 +/- 2%).
>
> I've included the bpftrace histograms for fault-threads acquiring
> mmap_lock in read at the end for posterity, and can provide more data
> / info if folks are interested.
>
> In light of these results, I'll rework the code to iteratively operate
> on single hugepages, which should have the added benefit of
> considerably simplifying the code for an eminent V1 series.

Thanks for the data. Yeah, I agree this is the best tradeoff.

>
> Thanks,
> Zach
>
> bpftrace data:
>
> /*****************************************************************************/
> batch size: 1
>
> @mmap_lock_r_acquire[fault-thread]:
> [128, 256)          1254 |                                                    |
> [256, 512)       2691261 |@@@@@@@@@@@@@@@@@                                   |
> [512, 1K)        2969500 |@@@@@@@@@@@@@@@@@@@                                 |
> [1K, 2K)         1794738 |@@@@@@@@@@@                                         |
> [2K, 4K)         1590984 |@@@@@@@@@@                                          |
> [4K, 8K)         3273349 |@@@@@@@@@@@@@@@@@@@@@                               |
> [8K, 16K)         851467 |@@@@@                                               |
> [16K, 32K)        460653 |@@                                                  |
> [32K, 64K)          7274 |                                                    |
> [64K, 128K)           25 |                                                    |
> [128K, 256K)           0 |                                                    |
> [256K, 512K)           0 |                                                    |
> [512K, 1M)       8085437 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1M, 2M)          381735 |@@                                                  |
> [2M, 4M)              28 |                                                    |
>
> @mmap_lock_r_acquire_stat[fault-thread]: count 22107705, average
> 326480, total 7217743234867
>
> /*****************************************************************************/
> batch size: 8
>
> @mmap_lock_r_acquire[fault-thread]:
> [128, 256)            55 |                                                    |
> [256, 512)        247028 |@@@@@@                                              |
> [512, 1K)         239083 |@@@@@@                                              |
> [1K, 2K)          142296 |@@@                                                 |
> [2K, 4K)          153149 |@@@@                                                |
> [4K, 8K)         1899396 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K)        1780734 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
> [16K, 32K)         95645 |@@                                                  |
> [32K, 64K)          1933 |                                                    |
> [64K, 128K)            3 |                                                    |
> [128K, 256K)           0 |                                                    |
> [256K, 512K)           0 |                                                    |
> [512K, 1M)             0 |                                                    |
> [1M, 2M)               0 |                                                    |
> [2M, 4M)               0 |                                                    |
> [4M, 8M)         1132899 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
> [8M, 16M)           3953 |                                                    |
>
> @mmap_lock_r_acquire_stat[fault-thread]: count 5696174, average
> 1156055, total 6585091744973
>
> /*****************************************************************************/
> batch size: 32
>
> @mmap_lock_r_acquire[fault-thread]:
> [128, 256)            35 |                                                    |
> [256, 512)         63413 |@                                                   |
> [512, 1K)          78130 |@                                                   |
> [1K, 2K)           39548 |                                                    |
> [2K, 4K)           44331 |                                                    |
> [4K, 8K)         2398751 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K)        1316932 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> [16K, 32K)         54798 |@                                                   |
> [32K, 64K)           771 |                                                    |
> [64K, 128K)            2 |                                                    |
> [128K, 256K)           0 |                                                    |
> [256K, 512K)           0 |                                                    |
> [512K, 1M)             0 |                                                    |
> [1M, 2M)               0 |                                                    |
> [2M, 4M)               0 |                                                    |
> [4M, 8M)               0 |                                                    |
> [8M, 16M)              0 |                                                    |
> [16M, 32M)        280791 |@@@@@@                                              |
> [32M, 64M)           809 |                                                    |
>
> @mmap_lock_r_acquire_stat[fault-thread]: count 4278311, average
> 1571585, total 6723733081824
>
> /*****************************************************************************/
> batch size: 64
>
> @mmap_lock_r_acquire[fault-thread]:
> [256, 512)         30303 |                                                    |
> [512, 1K)          42366 |@                                                   |
> [1K, 2K)           23679 |                                                    |
> [2K, 4K)           22781 |                                                    |
> [4K, 8K)         1637566 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> [8K, 16K)        1955773 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K)         41832 |@                                                   |
> [32K, 64K)           563 |                                                    |
> [64K, 128K)            0 |                                                    |
> [128K, 256K)           0 |                                                    |
> [256K, 512K)           0 |                                                    |
> [512K, 1M)             0 |                                                    |
> [1M, 2M)               0 |                                                    |
> [2M, 4M)               0 |                                                    |
> [4M, 8M)               0 |                                                    |
> [8M, 16M)              0 |                                                    |
> [16M, 32M)             0 |                                                    |
> [32M, 64M)        140723 |@@@                                                 |
> [64M, 128M)           77 |                                                    |
>
> @mmap_lock_r_acquire_stat[fault-thread]: count 3895663, average
> 1715797, total 6684170171691
>
> On Thu, Mar 10, 2022 at 4:06 PM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Thu, Mar 10, 2022 at 12:17 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Mar 10, 2022 at 11:26:15AM -0800, David Rientjes wrote:
> > > > One concern might be the queueing of read locks needed for page faults
> > > > behind a collapser of a long range of memory that is otherwise looping
> > > > and repeatedly taking the write lock.
> > >
> > > I would have thought that _not_ batching would improve this situation.
> > > Unless our implementation of rwsems has changed since the last time I
> > > looked, dropping-and-reacquiring a rwsem while there are pending readers
> > > means you go to the end of the line and they all get to handle their
> > > page faults.
> > >
> >
> > Hey Matthew, thanks for the review / feedback.
> >
> > I don't have great intuition here, so I'll try to put together a
> > simple synthetic test to get some data. Though the code would be
> > different, I can functionally approximate a non-batched approach with
> > a batch size of 1, and compare that against N.
> >
> > My file-backed patches likewise weren't able to take advantage of
> > batching outside mmap lock contention, so the data should equally
> > apply there.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/14] mm: userspace hugepage collapse
  2022-03-22 15:53       ` Zach O'Keefe
@ 2022-03-29 12:24         ` Michal Hocko
  2022-03-30  0:36           ` Zach O'Keefe
  0 siblings, 1 reply; 57+ messages in thread
From: Michal Hocko @ 2022-03-29 12:24 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Pasha Tatashin,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer, Yang Shi

On Tue 22-03-22 08:53:35, Zach O'Keefe wrote:
> On Tue, Mar 22, 2022 at 5:11 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 21-03-22 08:46:35, Zach O'Keefe wrote:
> > > Hey Michal, thanks for taking the time to review / comment.
> > >
> > > On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > [ Removed  Richard Henderson from the CC list as the delivery fails for
> > > >   his address]
> > >
> > > Thank you :)
> > >
> > > > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> > > > > Introduction
> > > > > --------------------------------
> > > > >
> > > > > This series provides a mechanism for userspace to induce a collapse of
> > > > > eligible ranges of memory into transparent hugepages in process context,
> > > > > thus permitting users to more tightly control their own hugepage
> > > > > utilization policy at their own expense.
> > > > >
> > > > > This idea was previously introduced by David Rientjes, and thanks to
> > > > > everyone for your patience while I prepared these patches resulting from
> > > > > that discussion[1].
> > > > >
> > > > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> > > > >
> > > > > Interface
> > > > > --------------------------------
> > > > >
> > > > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> > > > > leverages the new process_madvise(2) call.
> > > > >
> > > > > (*) process_madvise(2)
> > > > >
> > > > >         Performs a synchronous collapse of the native pages mapped by
> > > > >         the list of iovecs into transparent hugepages. The default gfp
> > > > >         flags used will be the same as those used at-fault for the VMA
> > > > >         region(s) covered.
> > > >
> > > > Could you expand on reasoning here? The default allocation mode for #PF
> > > > is rather light. Madvised will try harder. The reasoning is that we want
> > > > to make stalls due to #PF as small as possible and only try harder for
> > > > madvised areas (also a subject of configuration). Wouldn't it make more
> > > > sense to try harder for an explicit calls like madvise?
> > > >
> > >
> > > The reasoning is that the user has presumably configured system/vmas
> > > to tell the kernel how badly they want thps, and so this call aligns
> > > with current expectations. I.e. a user who goes about the trouble of
> > > trying to fault-in a thp at a given memory address likely wants a thp
> > > "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a
> > > thp.
> >
> > If the syscall tries only as hard as the #PF doesn't that limit the
> > functionality?
> 
> I'd argue that, the various allocation semantics possible through
> existing thp knobs / vma flags, in addition to the proposed
> MADV_F_COLLAPSE_DEFRAG flag provides a flexible functional space to
> work with. Relatively speaking, in what way would we be lacking
> functionality?

Flexibility is definitely a plus but look at our existing configuration
space and try to wrap your head around that.

> > I mean a non #PF can consume more resources to allocate
> > and collapse a THP as it won't inflict any measurable latency to the
> > targetting process (except for potential CPU contention).
> 
> Sorry, I'm not sure I understand this. What latency are we discussing
> in this point? Do you mean to say that since MADV_COLLAPSE isn't in
> the fault path, it doesn't necessarily need to be fast / direct
> reclaim wouldn't be noticed?

Exactly. Same as khugepaged. I would even argue that khugepaged and
madvise would better behave consistently because in both cases it is a
remote operation to create THPs. One triggered automatically the other
explicitly requested by the userspace. Having a third mode (for madvise)
would add more to the configuration space and a thus a complexity.
[...]
> > > Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both?
> > >
> > > * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of
> > > inter-process protection for collapsing memory in another process'
> > > address space (which a malevolent program could exploit to cause oom
> > > conditions in another memcg hierarchy, for example), but we want
> > > privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp
> > > utilization as they wish.
> >
> > Could you expand some more please? How is this any different from
> > khugepaged (well, except that you can trigger the collapsing explicitly
> > rather than rely on khugepaged to find that mm)?
> >
> 
> MADV_F_COLLAPSE_LIMITS was motivated by being able to replicate &
> extend khugepaged in userspace, where the benefit is precisely that we
> can choose that mm/vma more intelligently.

Could you elaborate some more?

> > > * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want
> > > to explicitly tell the kernel to try harder to back this by thps,
> > > regardless of the current system/vma configuration.
> > >
> > > Note that when used together, these flags can be used to implement the
> > > exact behavior of khugepaged, through MADV_COLLAPSE.
> >
> > IMHO this is stretching the interface and this can backfire in the
> > future. The interface should be really trivial. I want to collapse a
> > memory area. Let the kernel do the right thing and do not bother with
> > all the implementation details. I would use the same allocation strategy
> > as khugepaged as this seems to be closesest from the latency and
> > application awareness POV. In a way you can look at the madvise call as
> > a way to trigger khugepaged functionality on he particular memory range.
> 
> Trying to summarize a few earlier comments centering around
> MADV_F_COLLAPSE_DEFRAG and allocation semantics.
> 
> This series presupposes the existence of an informed userspace agent
> that is aware of what processes/memory ranges would benefit most from
> thps. Such an agent might either be:
> (1) A system-level daemon optimizing thp utilization system-wide
> (2) A highly tuned process / malloc implementation optimizing their
> own thp usage
> 
> The different types of agents reflects the divide between #PF and
> DEFRAG semantics.
> 
> For (1), we want to view this exactly like triggering khugepaged
> functionality from userspace, and likely want DEFRAG semantics.
> 
> For (2), I was viewing this as the "live" symmetric counterpart to
> at-fault thp allocation where the process has decided, at runtime,
> that this memory could benefit from thp backing, and so #PF semantics
> seemed like sane default. I'd worry that using DEFRAG semantics by
> default might deter adoption by users who might not be willing to wait
> an unbounded amount of time for direct reclaim.

This time is not really unbound. THP even in the defrag mode doesn't
even try to be as hard as e.g. hugetlb allocations.

For your 2) category I am not really sure I see the point. Why would
you want to rely on madvise in a lightweight allocation mode when this
has been already done during the #PF time. If an application really
knows it wants to use THP then madvise(MADV_HUGEPAGE) would be the first
thing to do. This would already tell #PF to try a bit harder in some
configurations and khugepaged knows that collapsing memory makes sense.

That being said I would be really careful to provide an extended
interface to control how hard to try to allocate a THP. This has a high
risk of externalizing internal implementation details about how the
compaction works. Unless we have a strong real life usecase I would go
with the khugepaged semantic initially. Maybe we will learn about future
usecases where a very lightweight allocation mode is required but that
can be added later on. The simpler the interface is initially the
better.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 00/14] mm: userspace hugepage collapse
  2022-03-29 12:24         ` Michal Hocko
@ 2022-03-30  0:36           ` Zach O'Keefe
  0 siblings, 0 replies; 57+ messages in thread
From: Zach O'Keefe @ 2022-03-30  0:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Pasha Tatashin,
	SeongJae Park, Song Liu, Vlastimil Babka, Zi Yan, linux-mm,
	Andrea Arcangeli, Andrew Morton, Arnd Bergmann, Axel Rasmussen,
	Chris Kennelly, Chris Zankel, Helge Deller, Hugh Dickins,
	Ivan Kokshaysky, James E.J. Bottomley, Jens Axboe,
	Kirill A. Shutemov, Matthew Wilcox, Matt Turner, Max Filippov,
	Miaohe Lin, Minchan Kim, Patrick Xia, Pavel Begunkov, Peter Xu,
	Thomas Bogendoerfer, Yang Shi

Hey Michal,

Thanks again for taking the time to discuss and align on this last point.

On Tue, Mar 29, 2022 at 5:25 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-03-22 08:53:35, Zach O'Keefe wrote:
> > On Tue, Mar 22, 2022 at 5:11 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 21-03-22 08:46:35, Zach O'Keefe wrote:
> > > > Hey Michal, thanks for taking the time to review / comment.
> > > >
> > > > On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > [ Removed  Richard Henderson from the CC list as the delivery fails for
> > > > >   his address]
> > > >
> > > > Thank you :)
> > > >
> > > > > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> > > > > > Introduction
> > > > > > --------------------------------
> > > > > >
> > > > > > This series provides a mechanism for userspace to induce a collapse of
> > > > > > eligible ranges of memory into transparent hugepages in process context,
> > > > > > thus permitting users to more tightly control their own hugepage
> > > > > > utilization policy at their own expense.
> > > > > >
> > > > > > This idea was previously introduced by David Rientjes, and thanks to
> > > > > > everyone for your patience while I prepared these patches resulting from
> > > > > > that discussion[1].
> > > > > >
> > > > > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
> > > > > >
> > > > > > Interface
> > > > > > --------------------------------
> > > > > >
> > > > > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> > > > > > leverages the new process_madvise(2) call.
> > > > > >
> > > > > > (*) process_madvise(2)
> > > > > >
> > > > > >         Performs a synchronous collapse of the native pages mapped by
> > > > > >         the list of iovecs into transparent hugepages. The default gfp
> > > > > >         flags used will be the same as those used at-fault for the VMA
> > > > > >         region(s) covered.
> > > > >
> > > > > Could you expand on reasoning here? The default allocation mode for #PF
> > > > > is rather light. Madvised will try harder. The reasoning is that we want
> > > > > to make stalls due to #PF as small as possible and only try harder for
> > > > > madvised areas (also a subject of configuration). Wouldn't it make more
> > > > > sense to try harder for an explicit calls like madvise?
> > > > >
> > > >
> > > > The reasoning is that the user has presumably configured system/vmas
> > > > to tell the kernel how badly they want thps, and so this call aligns
> > > > with current expectations. I.e. a user who goes about the trouble of
> > > > trying to fault-in a thp at a given memory address likely wants a thp
> > > > "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a
> > > > thp.
> > >
> > > If the syscall tries only as hard as the #PF doesn't that limit the
> > > functionality?
> >
> > I'd argue that, the various allocation semantics possible through
> > existing thp knobs / vma flags, in addition to the proposed
> > MADV_F_COLLAPSE_DEFRAG flag provides a flexible functional space to
> > work with. Relatively speaking, in what way would we be lacking
> > functionality?
>
> Flexibility is definitely a plus but look at our existing configuration
> space and try to wrap your head around that.
>

:)

> > > I mean a non #PF can consume more resources to allocate
> > > and collapse a THP as it won't inflict any measurable latency to the
> > > targetting process (except for potential CPU contention).
> >
> > Sorry, I'm not sure I understand this. What latency are we discussing
> > in this point? Do you mean to say that since MADV_COLLAPSE isn't in
> > the fault path, it doesn't necessarily need to be fast / direct
> > reclaim wouldn't be noticed?
>
> Exactly. Same as khugepaged. I would even argue that khugepaged and
> madvise would better behave consistently because in both cases it is a
> remote operation to create THPs. One triggered automatically the other
> explicitly requested by the userspace. Having a third mode (for madvise)
> would add more to the configuration space and a thus a complexity.
> [...]

Got it. I combined this with the answer at the end.

> > > > Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both?
> > > >
> > > > * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of
> > > > inter-process protection for collapsing memory in another process'
> > > > address space (which a malevolent program could exploit to cause oom
> > > > conditions in another memcg hierarchy, for example), but we want
> > > > privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp
> > > > utilization as they wish.
> > >
> > > Could you expand some more please? How is this any different from
> > > khugepaged (well, except that you can trigger the collapsing explicitly
> > > rather than rely on khugepaged to find that mm)?
> > >
> >
> > MADV_F_COLLAPSE_LIMITS was motivated by being able to replicate &
> > extend khugepaged in userspace, where the benefit is precisely that we
> > can choose that mm/vma more intelligently.
>
> Could you elaborate some more?
>

One idea from the original RFC was moving khugepaged to userspace[1].
Eventually, uhugepaged could be further augmented/informed with task
prioritization or runtime metrics to optimize THP utilization
system-wide by making the best use of the available THPs on the system
at any given point. This flag was partially motivated to allow for a
step (1) where khugepaged is replicated as-is, in userspace.

The other motivation is just to provide users a choice w.r.t. how hard
to try for a THP. Abiding by khugepaged-like semantics was the
default, but an informed user might have good reason to back memory
that is currently 90% swapped-out by THPs.

Perhaps for the initial series, we can forgo this flag for simplicity,
assume the user is informed, and ignore pte limits. We can revisit
this as necessary in the future, if the need arises.

> > > > * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want
> > > > to explicitly tell the kernel to try harder to back this by thps,
> > > > regardless of the current system/vma configuration.
> > > >
> > > > Note that when used together, these flags can be used to implement the
> > > > exact behavior of khugepaged, through MADV_COLLAPSE.
> > >
> > > IMHO this is stretching the interface and this can backfire in the
> > > future. The interface should be really trivial. I want to collapse a
> > > memory area. Let the kernel do the right thing and do not bother with
> > > all the implementation details. I would use the same allocation strategy
> > > as khugepaged as this seems to be closesest from the latency and
> > > application awareness POV. In a way you can look at the madvise call as
> > > a way to trigger khugepaged functionality on he particular memory range.
> >
> > Trying to summarize a few earlier comments centering around
> > MADV_F_COLLAPSE_DEFRAG and allocation semantics.
> >
> > This series presupposes the existence of an informed userspace agent
> > that is aware of what processes/memory ranges would benefit most from
> > thps. Such an agent might either be:
> > (1) A system-level daemon optimizing thp utilization system-wide
> > (2) A highly tuned process / malloc implementation optimizing their
> > own thp usage
> >
> > The different types of agents reflects the divide between #PF and
> > DEFRAG semantics.
> >
> > For (1), we want to view this exactly like triggering khugepaged
> > functionality from userspace, and likely want DEFRAG semantics.
> >
> > For (2), I was viewing this as the "live" symmetric counterpart to
> > at-fault thp allocation where the process has decided, at runtime,
> > that this memory could benefit from thp backing, and so #PF semantics
> > seemed like sane default. I'd worry that using DEFRAG semantics by
> > default might deter adoption by users who might not be willing to wait
> > an unbounded amount of time for direct reclaim.
>
> This time is not really unbound. THP even in the defrag mode doesn't
> even try to be as hard as e.g. hugetlb allocations.
>
> For your 2) category I am not really sure I see the point. Why would
> you want to rely on madvise in a lightweight allocation mode when this
> has been already done during the #PF time. If an application really
> knows it wants to use THP then madvise(MADV_HUGEPAGE) would be the first
> thing to do. This would already tell #PF to try a bit harder in some
> configurations and khugepaged knows that collapsing memory makes sense.
>

The primary motivation here is that at some point much after
fault-time, the process may determine they'd like the memory backed by
hugepages - but still prefer a lightweight allocation. A system
allocator here is a canonical example: they might free memory via
MADV_DONTNEED, but at some later point want to MADV_COLLAPSE the
region again once it becomes heavily used; however, they wouldn't be
willing to tolerate reclaim to do so.

> That being said I would be really careful to provide an extended
> interface to control how hard to try to allocate a THP. This has a high
> risk of externalizing internal implementation details about how the
> compaction works. Unless we have a strong real life usecase I would go
> with the khugepaged semantic initially. Maybe we will learn about future
> usecases where a very lightweight allocation mode is required but that
> can be added later on. The simpler the interface is initially the
> better.
>
Understand and respect your thoughts here. I won't pretend to know
what the best* option is, but presumably having control over when to
allow reclaim was important enough to motivate our current, extensive
configuration space.

Without the option, we have no control from userspace. With it, we
may* have too much. Initially, I'll propose a simple interface which
defaults to whatever is in
/sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and we can
incrementally expand if/when is necessary.

> Thanks!
> --
> Michal Hocko
> SUSE Labs

Again, thanks for taking the time to read and discuss,

Zach

[1] https://lore.kernel.org/all/5127b9c-a147-8ef5-c942-ae8c755413d0@google.com/


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-03-30  0:37 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-08 21:34 [RFC PATCH 00/14] mm: userspace hugepage collapse Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 01/14] mm/rmap: add mm_find_pmd_raw helper Zach O'Keefe
2022-03-09 22:48   ` Yang Shi
2022-03-08 21:34 ` [RFC PATCH 02/14] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-03-09 22:53   ` Yang Shi
2022-03-08 21:34 ` [RFC PATCH 03/14] mm/khugepaged: add __do_collapse_huge_page() helper Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 04/14] mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 05/14] mm/khugepaged: add mmap_assert_locked() checks to scan_pmd() Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 06/14] mm/khugepaged: add hugepage_vma_revalidate_pmd_count() Zach O'Keefe
2022-03-09 23:15   ` Yang Shi
2022-03-08 21:34 ` [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count() Zach O'Keefe
2022-03-09 23:17   ` Yang Shi
2022-03-10  0:00     ` Zach O'Keefe
2022-03-10  0:41       ` Yang Shi
2022-03-10  1:09         ` Zach O'Keefe
2022-03-10  2:16           ` Yang Shi
2022-03-10 15:50             ` Zach O'Keefe
2022-03-10 18:17               ` Yang Shi
2022-03-10 18:46                 ` David Rientjes
2022-03-10 18:58                   ` Zach O'Keefe
2022-03-10 19:54                   ` Yang Shi
2022-03-10 20:24                     ` Zach O'Keefe
2022-03-10 18:53                 ` Zach O'Keefe
2022-03-10 15:56   ` David Hildenbrand
2022-03-10 18:39     ` Zach O'Keefe
2022-03-10 18:54     ` David Rientjes
2022-03-21 14:27       ` Michal Hocko
2022-03-08 21:34 ` [RFC PATCH 08/14] mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled() Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 09/14] mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP Zach O'Keefe
2022-03-09 23:40   ` Yang Shi
2022-03-10  0:46     ` Zach O'Keefe
2022-03-10  2:05       ` Yang Shi
2022-03-10  8:37         ` Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 10/14] mm/khugepaged: rename khugepaged-specific/not functions Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 11/14] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-03-09 23:43   ` Yang Shi
2022-03-10  1:11     ` Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse Zach O'Keefe
2022-03-10  0:06   ` Yang Shi
2022-03-10 19:26     ` David Rientjes
2022-03-10 20:16       ` Matthew Wilcox
2022-03-11  0:06         ` Zach O'Keefe
2022-03-25 16:51           ` Zach O'Keefe
2022-03-25 19:54             ` Yang Shi
2022-03-08 21:34 ` [RFC PATCH 13/14] mm/madvise: add __madvise_collapse_*_batch() actions Zach O'Keefe
2022-03-08 21:34 ` [RFC PATCH 14/14] mm/madvise: add process_madvise(MADV_COLLAPSE) Zach O'Keefe
2022-03-21 14:32 ` [RFC PATCH 00/14] mm: userspace hugepage collapse Zi Yan
2022-03-21 14:51   ` Zach O'Keefe
2022-03-21 14:37 ` Michal Hocko
2022-03-21 15:46   ` Zach O'Keefe
2022-03-22 12:11     ` Michal Hocko
2022-03-22 15:53       ` Zach O'Keefe
2022-03-29 12:24         ` Michal Hocko
2022-03-30  0:36           ` Zach O'Keefe
2022-03-22  6:40 ` Zach O'Keefe
2022-03-22 12:05   ` Michal Hocko
2022-03-23 13:30     ` Zach O'Keefe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).