linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE
@ 2022-09-07 14:45 Zach O'Keefe
  2022-09-07 14:45 ` [PATCH mm-unstable v3 01/10] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() Zach O'Keefe
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-07 14:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

v3 Forward

This version cleans up a few small issues in v2, expands selftest
coverage, rebases on some recent khugepaged changes and adds more details
to commit descriptions to help with review.

The three main cleanups made are:

(1)	Patch 2: In hpage_collapse_scan_file() and collapse_file(),
	don't use then xa_state.xa_index to determine if the
	HPAGE_PMD_ORDER THP is properly aligned.  Instead, check
	the compound_head(page)->index. Not only is it better to not
	rely on internal data in struct xa_state (as the comments
	above said struct definition ask), but it is slightly more
	accurate / future proof in case we encounter an unaligned
	compound page of order HPAGE_PMD_ORDER (AFAIK not possible today).
	Moreover, especially for hpage_collapse_scan_file() where the RCU
	lock might be dropped as we traverse the XArray, we want to
	be checking the compound_head(), since otherwise we might
	erroneously be looking at a tail page if a collapse happened from
	under us.

(2)	Patch 2: When hpage_collapse_scan_file() returns
	SCAN_PTE_MAPPED_HUGEPAGE in the khugepaged path, check the pmd
	maps a pte table before adding the mm/address to the deferred
	collapse array. The reason is: we will grab mmap_lock in write
	every time we attempt collapse_pte_mapped_thp(), so we should
	try to avoid this if possible.  This also prevents khugepaged
	from repeatedly adding the same mm/address pair to the deferred
	collapse array after the page cache has already been updated with
	the new hugepage, but before the memory has been refaulted.

(3)	Patch 3: In find_pmd_thp_or_none(), check pmd_none() instead of
	!pmd_present() when detecting pmds that have been cleared.  The
	reason this check exists is because MADV_COLLAPSE might be
	operating on memory which was already collapsed by khugepaged,
	but before the memory had been refaulted.  In this case, khugepaged
	cleared the pmd, and so the correct pmd entry to look for is the
	"none" pmd.
--------------------------------

v2 Forward

Mostly a RESEND: rebase on latest mm-unstable + minor bug fixes from
kernel test robot.
--------------------------------

This series builds on top of the previous "mm: userspace hugepage collapse"
series which introduced the MADV_COLLAPSE madvise mode and added support
for private, anonymous mappings[1], by adding support for file and shmem
backed memory to CONFIG_READ_ONLY_THP_FOR_FS=y kernels.

File and shmem support have been added with effort to align with existing
MADV_COLLAPSE semantics and policy decisions[2].  Collapse of shmem-backed
memory ignores kernel-guiding directives and heuristics including all
sysfs settings (transparent_hugepage/shmem_enabled), and tmpfs huge= mount
options (shmem always supports large folios).  Like anonymous mappings, on
successful return of MADV_COLLAPSE on file/shmem memory, the contents of
memory mapped by the addresses provided will be synchronously pmd-mapped
THPs.

This functionality unlocks two important uses:

(1)	Immediately back executable text by THPs.  Current support provided
	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
	system which might impair services from serving at their full rated
	load after (re)starting.  Tricks like mremap(2)'ing text onto
	anonymous memory to immediately realize iTLB performance prevents
	page sharing and demand paging, both of which increase steady state
	memory footprint.  Now, we can have the best of both worlds: Peak
	upfront performance and lower RAM footprints.

(2)	userfaultfd-based live migration of virtual machines satisfy UFFD
	faults by fetching native-sized pages over the network (to avoid
	latency of transferring an entire hugepage).  However, after guest
	memory has been fully copied to the new host, MADV_COLLAPSE can
	be used to immediately increase guest performance.

khugepaged has received a small improvement by association and can now
detect and collapse pte-mapped THPs.  However, there is still work to be
done along the file collapse path.  Compound pages of arbitrary order still
needs to be supported and THP collapse needs to be converted to using
folios in general.  Eventually, we'd like to move away from the read-only
and executable-mapped constraints currently imposed on eligible files and
support any inode claiming huge folio support.  That said, I think the
series as-is covers enough to claim that MADV_COLLAPSE supports file/shmem
memory.

Patches 1-3	Implement the guts of the series.
Patch 4 	Is a tracepoint for debugging.
Patches 5-9 	Refactor existing khugepaged selftests to work with new
		memory types + new collapse tests.
Patch 10 	Adds a userfaultfd selftest mode to mimic a functional test
		of UFFDIO_REGISTER_MODE_MINOR+MADV_COLLAPSE live migration.
		(v3 note: "userfaultfd shmem" selftest is failing as of
		Sep 5 mm-unstable)

Applies against mm-unstable.

[1] https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@google.com/
[2] https://lore.kernel.org/linux-mm/YtBmhaiPHUTkJml8@google.com/

Previous versions:
v1: https://lore.kernel.org/linux-mm/20220812012843.3948330-1-zokeefe@google.com/
v2: https://lore.kernel.org/linux-mm/20220826220329.1495407-1-zokeefe@google.com/

v2 -> v3:
- The 3 changes mentioned in the v3 Forward
- Drop redundant PageTransCompound() check in collapse_pte_mapped_thp() in
  "mm/madvise: add file and shmem support to MADV_COLLAPSE" (it is covered
  by PageHead() and hugepage_vma_check() for !HugeTLB.
- In "selftests/vm: add thp collapse file and tmpfs testing", don't assume
  path used for file collapse testing will be on /dev/sda - instead, use the
  major/minor device numbers returned from stat(2) to traverse sysfs and find
  the correct block device.  Also only do stat() statfs() checks on
  user-supplied test directory once (instead of every time we create a test
  file).
- Added "selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared
  pmd" which tests a common case of MADV_COLLAPSE applied to file/shmem
  memory that has been "collapsed" (in the page cache) by khugepaged, but
  not yet refaulted by the process.

v1 -> v2:
- Add missing definition for khugepaged_add_pte_mapped_thp() in
  !CONFIG_SHEM builds, in "mm/khugepaged: attempt to map
  file/shmem-backed pte-mapped THPs by pmds"
- Minor bugfixes in "mm/madvise: add file and shmem support to
  MADV_COLLAPSE" for !CONFIG_SHMEM, !CONFIG_TRANSPARENT_HUGEPAGE and some
  compiler settings.
- Rebased on latest mm-unstable

Zach O'Keefe (10):
  mm/shmem: add flag to enforce shmem THP in hugepage_vma_check()
  mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by
    pmds
  mm/madvise: add file and shmem support to MADV_COLLAPSE
  mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  selftests/vm: dedup THP helpers
  selftests/vm: modularize thp collapse memory operations
  selftests/vm: add thp collapse file and tmpfs testing
  selftests/vm: add thp collapse shmem testing
  selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
  selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory

 include/linux/khugepaged.h                    |  13 +-
 include/linux/shmem_fs.h                      |  10 +-
 include/trace/events/huge_memory.h            |  36 +
 kernel/events/uprobes.c                       |   2 +-
 mm/huge_memory.c                              |   2 +-
 mm/khugepaged.c                               | 304 ++++--
 mm/shmem.c                                    |  18 +-
 tools/testing/selftests/vm/Makefile           |   2 +
 tools/testing/selftests/vm/khugepaged.c       | 904 +++++++++++++-----
 tools/testing/selftests/vm/soft-dirty.c       |   2 +-
 .../selftests/vm/split_huge_page_test.c       |  12 +-
 tools/testing/selftests/vm/userfaultfd.c      | 171 +++-
 tools/testing/selftests/vm/vm_util.c          |  36 +-
 tools/testing/selftests/vm/vm_util.h          |   5 +-
 14 files changed, 1143 insertions(+), 374 deletions(-)

-- 
2.37.2.789.g6183377224-goog



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH mm-unstable v3 01/10] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check()
  2022-09-07 14:45 [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE Zach O'Keefe
@ 2022-09-07 14:45 ` Zach O'Keefe
  2022-09-16 17:46   ` Yang Shi
  2022-09-07 14:45 ` [PATCH mm-unstable v3 02/10] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds Zach O'Keefe
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-07 14:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Extend 'mm/thp: add flag to enforce sysfs THP in
hugepage_vma_check()' to shmem, allowing callers to ignore
/sys/kernel/transparent_hugepage/shmem_enabled and tmpfs huge= mount.

This is intended to be used by MADV_COLLAPSE, and the rationale is
analogous to the anon/file case: MADV_COLLAPSE is not coupled to
directives that advise the kernel's decisions on when THPs should be
considered eligible. shmem/tmpfs always claims large folio support,
regardless of sysfs or mount options.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/linux/shmem_fs.h | 10 ++++++----
 mm/huge_memory.c         |  2 +-
 mm/shmem.c               | 18 +++++++++---------
 3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index f24071e3c826..d500ea967dc7 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -92,11 +92,13 @@ extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 int shmem_unuse(unsigned int type);
 
-extern bool shmem_is_huge(struct vm_area_struct *vma,
-			  struct inode *inode, pgoff_t index);
-static inline bool shmem_huge_enabled(struct vm_area_struct *vma)
+extern bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
+			  pgoff_t index, bool shmem_huge_force);
+static inline bool shmem_huge_enabled(struct vm_area_struct *vma,
+				      bool shmem_huge_force)
 {
-	return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff);
+	return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff,
+			     shmem_huge_force);
 }
 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7fa74b9749a6..53d170dac332 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -119,7 +119,7 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 	 * own flags.
 	 */
 	if (!in_pf && shmem_file(vma->vm_file))
-		return shmem_huge_enabled(vma);
+		return shmem_huge_enabled(vma, !enforce_sysfs);
 
 	/* Enforce sysfs THP requirements as necessary */
 	if (enforce_sysfs &&
diff --git a/mm/shmem.c b/mm/shmem.c
index 99b7341bd0bf..47c42c566fd1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -461,20 +461,20 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 
 static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
-bool shmem_is_huge(struct vm_area_struct *vma,
-		   struct inode *inode, pgoff_t index)
+bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
+		   pgoff_t index, bool shmem_huge_force)
 {
 	loff_t i_size;
 
 	if (!S_ISREG(inode->i_mode))
 		return false;
-	if (shmem_huge == SHMEM_HUGE_DENY)
-		return false;
 	if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
 	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
 		return false;
-	if (shmem_huge == SHMEM_HUGE_FORCE)
+	if (shmem_huge == SHMEM_HUGE_FORCE || shmem_huge_force)
 		return true;
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return false;
 
 	switch (SHMEM_SB(inode->i_sb)->huge) {
 	case SHMEM_HUGE_ALWAYS:
@@ -669,8 +669,8 @@ static long shmem_unused_huge_count(struct super_block *sb,
 
 #define shmem_huge SHMEM_HUGE_DENY
 
-bool shmem_is_huge(struct vm_area_struct *vma,
-		   struct inode *inode, pgoff_t index)
+bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
+		   pgoff_t index, bool shmem_huge_force)
 {
 	return false;
 }
@@ -1056,7 +1056,7 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
 			STATX_ATTR_NODUMP);
 	generic_fillattr(&init_user_ns, inode, stat);
 
-	if (shmem_is_huge(NULL, inode, 0))
+	if (shmem_is_huge(NULL, inode, 0, false))
 		stat->blksize = HPAGE_PMD_SIZE;
 
 	if (request_mask & STATX_BTIME) {
@@ -1888,7 +1888,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		return 0;
 	}
 
-	if (!shmem_is_huge(vma, inode, index))
+	if (!shmem_is_huge(vma, inode, index, false))
 		goto alloc_nohuge;
 
 	huge_gfp = vma_thp_gfp_mask(vma);
-- 
2.37.2.789.g6183377224-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm-unstable v3 02/10] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds
  2022-09-07 14:45 [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE Zach O'Keefe
  2022-09-07 14:45 ` [PATCH mm-unstable v3 01/10] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() Zach O'Keefe
@ 2022-09-07 14:45 ` Zach O'Keefe
  2022-09-16 18:26   ` Yang Shi
  2022-09-07 14:45 ` [PATCH mm-unstable v3 03/10] mm/madvise: add file and shmem support to MADV_COLLAPSE Zach O'Keefe
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-07 14:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

The main benefit of THPs are that they can be mapped at the pmd level,
increasing the likelihood of TLB hit and spending less cycles in page
table walks.  pte-mapped hugepages - that is - hugepage-aligned compound
pages of order HPAGE_PMD_ORDER mapped by ptes - although being
contiguous in physical memory, don't have this advantage.  In fact, one
could argue they are detrimental to system performance overall since
they occupy a precious hugepage-aligned/sized region of physical memory
that could otherwise be used more effectively.  Additionally, pte-mapped
hugepages can be the cheapest memory to collapse for khugepaged since no
new hugepage allocation or copying of memory contents is necessary - we
only need to update the mapping page tables.

In the anonymous collapse path, we are able to collapse pte-mapped
hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no
effort when compound pages (of any order) are encountered.

Identify pte-mapped hugepages in the file/shmem collapse path.  The
final step of which makes a racy check of the value of the pmd to ensure
it maps a pte table.  This should be fine, since races that result in
false-positive (i.e. attempt collapse even though we sholdn't) will fail
later in collapse_pte_mapped_thp() once we actually lock mmap_lock and
reinspect the pmd value.  Races that result in false-negatives (i.e.
where we decide to not attempt collapse, but should have) shouldn't be
an issue, since in the worst case, we do nothing - which is what we've
done up to this point.  We make a similar check in retract_page_tables().
If we do think we've found a pte-mapped hugepgae in khugepaged context,
attempt to update page tables mapping this hugepage.

Note that these collapses still count towards the
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter, and
if the pte-mapped hugepage was also mapped into multiple process' address
spaces, could be incremented for each page table update.  Since we
increment the counter when a pte-mapped hugepage is successfully added to
the list of to-collapse pte-mapped THPs, it's possible that we never
actually update the page table either.  This is different from how
file/shmem pages_collapsed accounting works today where only a successful
page cache update is counted (it's also possible here that no page tables
are actually changed).  Though it incurs some slop, this is preferred to
either not accounting for the event at all, or plumbing through data in
struct mm_slot on whether to account for the collapse or not.

Also note that work still needs to be done to support arbitrary compound
pages, and that this should all be converted to using folios.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/trace/events/huge_memory.h |  1 +
 mm/khugepaged.c                    | 67 +++++++++++++++++++++++++++---
 2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 55392bf30a03..fbbb25494d60 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -17,6 +17,7 @@
 	EM( SCAN_EXCEED_SHARED_PTE,	"exceed_shared_pte")		\
 	EM( SCAN_PTE_NON_PRESENT,	"pte_non_present")		\
 	EM( SCAN_PTE_UFFD_WP,		"pte_uffd_wp")			\
+	EM( SCAN_PTE_MAPPED_HUGEPAGE,	"pte_mapped_hugepage")		\
 	EM( SCAN_PAGE_RO,		"no_writable_page")		\
 	EM( SCAN_LACK_REFERENCED_PAGE,	"lack_referenced_page")		\
 	EM( SCAN_PAGE_NULL,		"page_null")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 55c8625ed950..31ccf49cf279 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -35,6 +35,7 @@ enum scan_result {
 	SCAN_EXCEED_SHARED_PTE,
 	SCAN_PTE_NON_PRESENT,
 	SCAN_PTE_UFFD_WP,
+	SCAN_PTE_MAPPED_HUGEPAGE,
 	SCAN_PAGE_RO,
 	SCAN_LACK_REFERENCED_PAGE,
 	SCAN_PAGE_NULL,
@@ -1318,20 +1319,24 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
  * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
  * khugepaged should try to collapse the page table.
  */
-static void khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
+static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
 					  unsigned long addr)
 {
 	struct khugepaged_mm_slot *mm_slot;
 	struct mm_slot *slot;
+	bool ret = false;
 
 	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
 
 	spin_lock(&khugepaged_mm_lock);
 	slot = mm_slot_lookup(mm_slots_hash, mm);
 	mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
-	if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP))
+	if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
 		mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
+		ret = true;
+	}
 	spin_unlock(&khugepaged_mm_lock);
+	return ret;
 }
 
 static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1368,9 +1373,16 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	pte_t *start_pte, *pte;
 	pmd_t *pmd;
 	spinlock_t *ptl;
-	int count = 0;
+	int count = 0, result = SCAN_FAIL;
 	int i;
 
+	mmap_assert_write_locked(mm);
+
+	/* Fast check before locking page if already PMD-mapped  */
+	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
+	if (result != SCAN_SUCCEED)
+		return;
+
 	if (!vma || !vma->vm_file ||
 	    !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
 		return;
@@ -1721,9 +1733,16 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
 		/*
 		 * If file was truncated then extended, or hole-punched, before
 		 * we locked the first page, then a THP might be there already.
+		 * This will be discovered on the first iteration.
 		 */
 		if (PageTransCompound(page)) {
-			result = SCAN_PAGE_COMPOUND;
+			struct page *head = compound_head(page);
+
+			result = compound_order(head) == HPAGE_PMD_ORDER &&
+					head->index == start
+					/* Maybe PMD-mapped */
+					? SCAN_PTE_MAPPED_HUGEPAGE
+					: SCAN_PAGE_COMPOUND;
 			goto out_unlock;
 		}
 
@@ -1961,7 +1980,19 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 		 * into a PMD sized page
 		 */
 		if (PageTransCompound(page)) {
-			result = SCAN_PAGE_COMPOUND;
+			struct page *head = compound_head(page);
+
+			result = compound_order(head) == HPAGE_PMD_ORDER &&
+					head->index == start
+					/* Maybe PMD-mapped */
+					? SCAN_PTE_MAPPED_HUGEPAGE
+					: SCAN_PAGE_COMPOUND;
+			/*
+			 * For SCAN_PTE_MAPPED_HUGEPAGE, further processing
+			 * by the caller won't touch the page cache, and so
+			 * it's safe to skip LRU and refcount checks before
+			 * returning.
+			 */
 			break;
 		}
 
@@ -2021,6 +2052,12 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
 {
 }
+
+static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
+					  unsigned long addr)
+{
+	return false;
+}
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
@@ -2115,8 +2152,26 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 								  &mmap_locked,
 								  cc);
 			}
-			if (*result == SCAN_SUCCEED)
+			switch (*result) {
+			case SCAN_PTE_MAPPED_HUGEPAGE: {
+				pmd_t *pmd;
+
+				*result = find_pmd_or_thp_or_none(mm,
+								  khugepaged_scan.address,
+								  &pmd);
+				if (*result != SCAN_SUCCEED)
+					break;
+				if (!khugepaged_add_pte_mapped_thp(mm,
+								   khugepaged_scan.address))
+					break;
+			} fallthrough;
+			case SCAN_SUCCEED:
 				++khugepaged_pages_collapsed;
+				break;
+			default:
+				break;
+			}
+
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
 			progress += HPAGE_PMD_NR;
-- 
2.37.2.789.g6183377224-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm-unstable v3 03/10] mm/madvise: add file and shmem support to MADV_COLLAPSE
  2022-09-07 14:45 [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE Zach O'Keefe
  2022-09-07 14:45 ` [PATCH mm-unstable v3 01/10] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() Zach O'Keefe
  2022-09-07 14:45 ` [PATCH mm-unstable v3 02/10] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds Zach O'Keefe
@ 2022-09-07 14:45 ` Zach O'Keefe
  2022-09-16 20:38   ` Yang Shi
  2022-09-07 14:45 ` [PATCH mm-unstable v3 04/10] mm/khugepaged: add tracepoint to hpage_collapse_scan_file() Zach O'Keefe
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-07 14:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).

On success, the backing memory will be a hugepage.  For the memory range
and process provided, the page tables will synchronously have a huge pmd
installed, mapping the THP.  Other mappings of the file extent mapped by
the memory range may be added to a set of entries that khugepaged will
later process and attempt update their page tables to map the THP by a pmd.

This functionality unlocks two important uses:

(1)	Immediately back executable text by THPs.  Current support provided
	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
	system which might impair services from serving at their full rated
	load after (re)starting.  Tricks like mremap(2)'ing text onto
	anonymous memory to immediately realize iTLB performance prevents
	page sharing and demand paging, both of which increase steady state
	memory footprint.  Now, we can have the best of both worlds: Peak
	upfront performance and lower RAM footprints.

(2)	userfaultfd-based live migration of virtual machines satisfy UFFD
	faults by fetching native-sized pages over the network (to avoid
	latency of transferring an entire hugepage).  However, after guest
	memory has been fully copied to the new host, MADV_COLLAPSE can
	be used to immediately increase guest performance.

Since khugepaged is single threaded, this change now introduces
possibility of collapse contexts racing in file collapse path.  There a
important few places to consider:

(1)	hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
	We could have the memory collapsed out from under us, but
	the next xas_for_each() iteration will correctly pick up the
	hugepage.  The hugepage might not be up to date (insofar as
	copying of small page contents might not have completed - the
	page still may be locked), but regardless what small page index
	we were iterating over, we'll find the hugepage and identify it
	as a suitably aligned compound page of order HPAGE_PMD_ORDER.

	In khugepaged path, we locklessly check the value of the pmd,
	and only add it to deferred collapse array if we find pmd
	mapping pte table. This is fine, since other values that could
	have raced in right afterwards denote failure, or that the
	memory was successfully collapsed, so we don't need further
	processing.

	In madvise path, we'll take mmap_lock() in write to serialize
	against page table updates and will know what to do based on the
	true value of the pmd: recheck all ptes if we point to a pte table,
	directly install the pmd, if the pmd has been cleared, but
	memory not yet faulted, or nothing at all if we find a huge pmd.

	It's worth putting emphasis here on how we treat the none pmd
	here.  If khugepaged has processed this mm's page tables
	already, it will have left the pmd cleared (ready for refault by
	the process).  Depending on the VMA flags and sysfs settings,
	amount of RAM on the machine, and the current load, could be a
	relatively common occurrence - and as such is one we'd like to
	handle successfully in MADV_COLLAPSE.  When we see the none pmd
	in collapse_pte_mapped_thp(), we've locked mmap_lock in write
	and checked (a) huepaged_vma_check() to see if the backing
	memory is appropriate still, along with VMA sizing and
	appropriate hugepage alignment within the file, and (b) we've
	found a hugepage head of order HPAGE_PMD_ORDER at the offset
	in the file mapped by our hugepage-aligned virtual address.
	Even though the common-case is likely race with khugepaged,
	given these checks (regardless how we got here - we could be
	operating on a completely different file than originally checked
	in hpage_collapse_scan_file() for all we know) it should be safe
	to directly make the pmd a huge pmd pointing to this hugepage.

(2)	collapse_file() is mostly serialized on the same file extent by
	lock sequence:

		|	lock hupepage
		|		lock mapping->i_pages
		|			lock 1st page
		|		unlock mapping->i_pages
		|				<page checks>
		|		lock mapping->i_pages
		|				page_ref_freeze(3)
		|				xas_store(hugepage)
		|		unlock mapping->i_pages
		|				page_ref_unfreeze(1)
		|			unlock 1st page
		V	unlock hugepage

	Once a context (who already has their fresh hugepage locked)
	locks mapping->i_pages exclusively, it will hold said lock
	until it locks the first page, and it will hold that lock until
	the after the hugepage has been added to the page cache (and
	will unlock the hugepage after page table update, though that
	isn't important here).

	A racing context that loses the race for mapping->i_pages will
	then lose the race to locking the first page.  Here - depending
	on how far the other racing context has gotten - we might find
	the new hugepage (in which case we'll exit cleanly when we
	check PageTransCompound()), or we'll find the "old" 1st small
	page (in which we'll exit cleanly when we discover unexpected
	refcount of 2 after isolate_lru_page()).  This is assuming we
	are able to successfully lock the page we find - in shmem path,
	we could just fail the trylock and exit cleanly anyways.

	Failure path in collapse_file() is similar: once we hold lock
	on 1st small page, we are serialized against other collapse
	contexts.  Before the 1st small page is unlocked, we add it
	back to the pagecache and unfreeze the refcount appropriately.
	Contexts who lost the race to the 1st small page will then find
	the same 1st small page with the correct refcount and will be
	able to proceed.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/linux/khugepaged.h         |  13 +-
 include/trace/events/huge_memory.h |   1 +
 kernel/events/uprobes.c            |   2 +-
 mm/khugepaged.c                    | 238 ++++++++++++++++++++++-------
 4 files changed, 194 insertions(+), 60 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 384f034ae947..70162d707caf 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -16,11 +16,13 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma,
 				 unsigned long vm_flags);
 extern void khugepaged_min_free_kbytes_update(void);
 #ifdef CONFIG_SHMEM
-extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
+extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
+				   bool install_pmd);
 #else
-static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
-					   unsigned long addr)
+static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
+					  unsigned long addr, bool install_pmd)
 {
+	return 0;
 }
 #endif
 
@@ -46,9 +48,10 @@ static inline void khugepaged_enter_vma(struct vm_area_struct *vma,
 					unsigned long vm_flags)
 {
 }
-static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
-					   unsigned long addr)
+static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
+					  unsigned long addr, bool install_pmd)
 {
+	return 0;
 }
 
 static inline void khugepaged_min_free_kbytes_update(void)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index fbbb25494d60..df33453b70fc 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -11,6 +11,7 @@
 	EM( SCAN_FAIL,			"failed")			\
 	EM( SCAN_SUCCEED,		"succeeded")			\
 	EM( SCAN_PMD_NULL,		"pmd_null")			\
+	EM( SCAN_PMD_NONE,		"pmd_none")			\
 	EM( SCAN_PMD_MAPPED,		"page_pmd_mapped")		\
 	EM( SCAN_EXCEED_NONE_PTE,	"exceed_none_pte")		\
 	EM( SCAN_EXCEED_SWAP_PTE,	"exceed_swap_pte")		\
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e0a9b945e7bc..d9e357b7e17c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -555,7 +555,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 
 	/* try collapse pmd for compound page */
 	if (!ret && orig_page_huge)
-		collapse_pte_mapped_thp(mm, vaddr);
+		collapse_pte_mapped_thp(mm, vaddr, false);
 
 	return ret;
 }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 31ccf49cf279..66457a06b4e7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -29,6 +29,7 @@ enum scan_result {
 	SCAN_FAIL,
 	SCAN_SUCCEED,
 	SCAN_PMD_NULL,
+	SCAN_PMD_NONE,
 	SCAN_PMD_MAPPED,
 	SCAN_EXCEED_NONE_PTE,
 	SCAN_EXCEED_SWAP_PTE,
@@ -838,6 +839,18 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
 				cc->is_khugepaged))
 		return SCAN_VMA_CHECK;
+	return SCAN_SUCCEED;
+}
+
+static int hugepage_vma_revalidate_anon(struct mm_struct *mm,
+					unsigned long address,
+					struct vm_area_struct **vmap,
+					struct collapse_control *cc)
+{
+	int ret = hugepage_vma_revalidate(mm, address, vmap, cc);
+
+	if (ret != SCAN_SUCCEED)
+		return ret;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
 	 * remapped to file after khugepaged reaquired the mmap_lock.
@@ -845,8 +858,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	 * hugepage_vma_check may return true for qualified file
 	 * vmas.
 	 */
-	if (!vma->anon_vma || !vma_is_anonymous(vma))
-		return SCAN_VMA_CHECK;
+	if (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap))
+		return SCAN_PAGE_ANON;
 	return SCAN_SUCCEED;
 }
 
@@ -866,8 +879,8 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 	/* See comments in pmd_none_or_trans_huge_or_clear_bad() */
 	barrier();
 #endif
-	if (!pmd_present(pmde))
-		return SCAN_PMD_NULL;
+	if (pmd_none(pmde))
+		return SCAN_PMD_NONE;
 	if (pmd_trans_huge(pmde))
 		return SCAN_PMD_MAPPED;
 	if (pmd_bad(pmde))
@@ -995,7 +1008,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma, cc);
+	result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1026,7 +1039,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * handled by the anon_vma lock + PG_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, &vma, cc);
+	result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -1332,13 +1345,44 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
 	slot = mm_slot_lookup(mm_slots_hash, mm);
 	mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
 	if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
+		int i;
+		/*
+		 * Multiple callers may be adding entries here.  Do a quick
+		 * check to see the entry hasn't already been added by someone
+		 * else.
+		 */
+		for (i = 0; i < mm_slot->nr_pte_mapped_thp; ++i)
+			if (mm_slot->pte_mapped_thp[i] == addr)
+				goto out;
 		mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
 		ret = true;
 	}
+out:
 	spin_unlock(&khugepaged_mm_lock);
 	return ret;
 }
 
+/* hpage must be locked, and mmap_lock must be held in write */
+static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
+			pmd_t *pmdp, struct page *hpage)
+{
+	struct vm_fault vmf = {
+		.vma = vma,
+		.address = addr,
+		.flags = 0,
+		.pmd = pmdp,
+	};
+
+	VM_BUG_ON(!PageTransHuge(hpage));
+	mmap_assert_write_locked(vma->vm_mm);
+
+	if (do_set_pmd(&vmf, hpage))
+		return SCAN_FAIL;
+
+	get_page(hpage);
+	return SCAN_SUCCEED;
+}
+
 static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 				  unsigned long addr, pmd_t *pmdp)
 {
@@ -1360,12 +1404,14 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
  *
  * @mm: process address space where collapse happens
  * @addr: THP collapse address
+ * @install_pmd: If a huge PMD should be installed
  *
  * This function checks whether all the PTEs in the PMD are pointing to the
  * right THP. If so, retract the page table so the THP can refault in with
- * as pmd-mapped.
+ * as pmd-mapped. Possibly install a huge PMD mapping the THP.
  */
-void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
+int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
+			    bool install_pmd)
 {
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	struct vm_area_struct *vma = vma_lookup(mm, haddr);
@@ -1380,12 +1426,12 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 
 	/* Fast check before locking page if already PMD-mapped  */
 	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
-	if (result != SCAN_SUCCEED)
-		return;
+	if (result == SCAN_PMD_MAPPED)
+		return result;
 
 	if (!vma || !vma->vm_file ||
 	    !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
-		return;
+		return SCAN_VMA_CHECK;
 
 	/*
 	 * If we are here, we've succeeded in replacing all the native pages
@@ -1395,24 +1441,39 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	 * analogously elide sysfs THP settings here.
 	 */
 	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
-		return;
+		return SCAN_VMA_CHECK;
 
 	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
 	if (userfaultfd_wp(vma))
-		return;
+		return SCAN_PTE_UFFD_WP;
 
 	hpage = find_lock_page(vma->vm_file->f_mapping,
 			       linear_page_index(vma, haddr));
 	if (!hpage)
-		return;
+		return SCAN_PAGE_NULL;
 
-	if (!PageHead(hpage))
+	if (!PageHead(hpage)) {
+		result = SCAN_FAIL;
 		goto drop_hpage;
+	}
 
-	if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED)
+	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
+	switch (result) {
+	case SCAN_SUCCEED:
+		break;
+	case SCAN_PMD_NONE:
+		/*
+		 * In MADV_COLLAPSE path, possible race with khugepaged where
+		 * all pte entries have been removed and pmd cleared.  If so,
+		 * skip all the pte checks and just update the pmd mapping.
+		 */
+		goto maybe_install_pmd;
+	default:
 		goto drop_hpage;
+	}
 
 	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
+	result = SCAN_FAIL;
 
 	/* step 1: check all mapped PTEs are to the right huge page */
 	for (i = 0, addr = haddr, pte = start_pte;
@@ -1424,8 +1485,10 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 			continue;
 
 		/* page swapped out, abort */
-		if (!pte_present(*pte))
+		if (!pte_present(*pte)) {
+			result = SCAN_PTE_NON_PRESENT;
 			goto abort;
+		}
 
 		page = vm_normal_page(vma, addr, *pte);
 		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
@@ -1460,12 +1523,19 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
 	}
 
-	/* step 4: collapse pmd */
+	/* step 4: remove pte entries */
 	collapse_and_free_pmd(mm, vma, haddr, pmd);
+
+maybe_install_pmd:
+	/* step 5: install pmd entry */
+	result = install_pmd
+			? set_huge_pmd(vma, haddr, pmd, hpage)
+			: SCAN_SUCCEED;
+
 drop_hpage:
 	unlock_page(hpage);
 	put_page(hpage);
-	return;
+	return result;
 
 abort:
 	pte_unmap_unlock(start_pte, ptl);
@@ -1488,22 +1558,29 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
 		goto out;
 
 	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
-		collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i]);
+		collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
 
 out:
 	mm_slot->nr_pte_mapped_thp = 0;
 	mmap_write_unlock(mm);
 }
 
-static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
+static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
+			       struct mm_struct *target_mm,
+			       unsigned long target_addr, struct page *hpage,
+			       struct collapse_control *cc)
 {
 	struct vm_area_struct *vma;
-	struct mm_struct *mm;
-	unsigned long addr;
-	pmd_t *pmd;
+	int target_result = SCAN_FAIL;
 
 	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		int result = SCAN_FAIL;
+		struct mm_struct *mm = NULL;
+		unsigned long addr = 0;
+		pmd_t *pmd;
+		bool is_target = false;
+
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
 		 * got written to. These VMAs are likely not worth investing
@@ -1520,24 +1597,34 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		 * ptl. It has higher chance to recover THP for the VMA, but
 		 * has higher cost too.
 		 */
-		if (vma->anon_vma)
-			continue;
+		if (vma->anon_vma) {
+			result = SCAN_PAGE_ANON;
+			goto next;
+		}
 		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-		if (addr & ~HPAGE_PMD_MASK)
-			continue;
-		if (vma->vm_end < addr + HPAGE_PMD_SIZE)
-			continue;
+		if (addr & ~HPAGE_PMD_MASK ||
+		    vma->vm_end < addr + HPAGE_PMD_SIZE) {
+			result = SCAN_VMA_CHECK;
+			goto next;
+		}
 		mm = vma->vm_mm;
-		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
-			continue;
+		is_target = mm == target_mm && addr == target_addr;
+		result = find_pmd_or_thp_or_none(mm, addr, &pmd);
+		if (result != SCAN_SUCCEED)
+			goto next;
 		/*
 		 * We need exclusive mmap_lock to retract page table.
 		 *
 		 * We use trylock due to lock inversion: we need to acquire
 		 * mmap_lock while holding page lock. Fault path does it in
 		 * reverse order. Trylock is a way to avoid deadlock.
+		 *
+		 * Also, it's not MADV_COLLAPSE's job to collapse other
+		 * mappings - let khugepaged take care of them later.
 		 */
-		if (mmap_write_trylock(mm)) {
+		result = SCAN_PTE_MAPPED_HUGEPAGE;
+		if ((cc->is_khugepaged || is_target) &&
+		    mmap_write_trylock(mm)) {
 			/*
 			 * When a vma is registered with uffd-wp, we can't
 			 * recycle the pmd pgtable because there can be pte
@@ -1546,22 +1633,45 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			 * it'll always mapped in small page size for uffd-wp
 			 * registered ranges.
 			 */
-			if (!hpage_collapse_test_exit(mm) &&
-			    !userfaultfd_wp(vma))
-				collapse_and_free_pmd(mm, vma, addr, pmd);
+			if (hpage_collapse_test_exit(mm)) {
+				result = SCAN_ANY_PROCESS;
+				goto unlock_next;
+			}
+			if (userfaultfd_wp(vma)) {
+				result = SCAN_PTE_UFFD_WP;
+				goto unlock_next;
+			}
+			collapse_and_free_pmd(mm, vma, addr, pmd);
+			if (!cc->is_khugepaged && is_target)
+				result = set_huge_pmd(vma, addr, pmd, hpage);
+			else
+				result = SCAN_SUCCEED;
+
+unlock_next:
 			mmap_write_unlock(mm);
-		} else {
-			/* Try again later */
+			goto next;
+		}
+		/*
+		 * Calling context will handle target mm/addr. Otherwise, let
+		 * khugepaged try again later.
+		 */
+		if (!is_target) {
 			khugepaged_add_pte_mapped_thp(mm, addr);
+			continue;
 		}
+next:
+		if (is_target)
+			target_result = result;
 	}
 	i_mmap_unlock_write(mapping);
+	return target_result;
 }
 
 /**
  * collapse_file - collapse filemap/tmpfs/shmem pages into huge one.
  *
  * @mm: process address space where collapse happens
+ * @addr: virtual collapse start address
  * @file: file that collapse on
  * @start: collapse start address
  * @cc: collapse context and scratchpad
@@ -1581,8 +1691,9 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
  *    + restore gaps in the page cache;
  *    + unlock and free huge page;
  */
-static int collapse_file(struct mm_struct *mm, struct file *file,
-			 pgoff_t start, struct collapse_control *cc)
+static int collapse_file(struct mm_struct *mm, unsigned long addr,
+			 struct file *file, pgoff_t start,
+			 struct collapse_control *cc)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *hpage;
@@ -1890,7 +2001,8 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
 		/*
 		 * Remove pte page tables, so we can re-fault the page as huge.
 		 */
-		retract_page_tables(mapping, start);
+		result = retract_page_tables(mapping, start, mm, addr, hpage,
+					     cc);
 		unlock_page(hpage);
 		hpage = NULL;
 	} else {
@@ -1946,8 +2058,9 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
 	return result;
 }
 
-static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				pgoff_t start, struct collapse_control *cc)
+static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+				    struct file *file, pgoff_t start,
+				    struct collapse_control *cc)
 {
 	struct page *page = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2035,7 +2148,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
-			result = collapse_file(mm, file, start, cc);
+			result = collapse_file(mm, addr, file, start, cc);
 		}
 	}
 
@@ -2043,8 +2156,9 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
 	return result;
 }
 #else
-static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
-				pgoff_t start, struct collapse_control *cc)
+static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+				    struct file *file, pgoff_t start,
+				    struct collapse_control *cc)
 {
 	BUILD_BUG();
 }
@@ -2142,8 +2256,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 						khugepaged_scan.address);
 
 				mmap_read_unlock(mm);
-				*result = khugepaged_scan_file(mm, file, pgoff,
-							       cc);
+				*result = hpage_collapse_scan_file(mm,
+								   khugepaged_scan.address,
+								   file, pgoff, cc);
 				mmap_locked = false;
 				fput(file);
 			} else {
@@ -2449,10 +2564,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	*prev = vma;
 
-	/* TODO: Support file/shmem */
-	if (!vma->anon_vma || !vma_is_anonymous(vma))
-		return -EINVAL;
-
 	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
 		return -EINVAL;
 
@@ -2483,16 +2594,35 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		}
 		mmap_assert_locked(mm);
 		memset(cc->node_load, 0, sizeof(cc->node_load));
-		result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
-						 cc);
+		if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
+			struct file *file = get_file(vma->vm_file);
+			pgoff_t pgoff = linear_page_index(vma, addr);
+
+			mmap_read_unlock(mm);
+			mmap_locked = false;
+			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
+							  cc);
+			fput(file);
+		} else {
+			result = hpage_collapse_scan_pmd(mm, vma, addr,
+							 &mmap_locked, cc);
+		}
 		if (!mmap_locked)
 			*prev = NULL;  /* Tell caller we dropped mmap_lock */
 
+handle_result:
 		switch (result) {
 		case SCAN_SUCCEED:
 		case SCAN_PMD_MAPPED:
 			++thps;
 			break;
+		case SCAN_PTE_MAPPED_HUGEPAGE:
+			BUG_ON(mmap_locked);
+			BUG_ON(*prev);
+			mmap_write_lock(mm);
+			result = collapse_pte_mapped_thp(mm, addr, true);
+			mmap_write_unlock(mm);
+			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
 		case SCAN_PTE_NON_PRESENT:
-- 
2.37.2.789.g6183377224-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm-unstable v3 04/10] mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  2022-09-07 14:45 [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE Zach O'Keefe
                   ` (2 preceding siblings ...)
  2022-09-07 14:45 ` [PATCH mm-unstable v3 03/10] mm/madvise: add file and shmem support to MADV_COLLAPSE Zach O'Keefe
@ 2022-09-07 14:45 ` Zach O'Keefe
  2022-09-16 20:41   ` Yang Shi
  2022-09-07 14:45 ` [PATCH mm-unstable v3 05/10] selftests/vm: dedup THP helpers Zach O'Keefe
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-07 14:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Add huge_memory:trace_mm_khugepaged_scan_file tracepoint to
hpage_collapse_scan_file() analogously to hpage_collapse_scan_pmd().
While this change is targeted at debugging MADV_COLLAPSE pathway, the
"mm_khugepaged" prefix is retained for symmetry with
huge_memory:trace_mm_khugepaged_scan_pmd, which retains it's legacy name
to prevent changing kernel ABI as much as possible.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/trace/events/huge_memory.h | 34 ++++++++++++++++++++++++++++++
 mm/khugepaged.c                    |  3 ++-
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index df33453b70fc..935af4947917 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -169,5 +169,39 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
 		__entry->ret)
 );
 
+TRACE_EVENT(mm_khugepaged_scan_file,
+
+	TP_PROTO(struct mm_struct *mm, struct page *page, const char *filename,
+		 int present, int swap, int result),
+
+	TP_ARGS(mm, page, filename, present, swap, result),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, pfn)
+		__string(filename, filename)
+		__field(int, present)
+		__field(int, swap)
+		__field(int, result)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->pfn = page ? page_to_pfn(page) : -1;
+		__assign_str(filename, filename);
+		__entry->present = present;
+		__entry->swap = swap;
+		__entry->result = result;
+	),
+
+	TP_printk("mm=%p, scan_pfn=0x%lx, filename=%s, present=%d, swap=%d, result=%s",
+		__entry->mm,
+		__entry->pfn,
+		__get_str(filename),
+		__entry->present,
+		__entry->swap,
+		__print_symbolic(__entry->result, SCAN_STATUS))
+);
+
 #endif /* __HUGE_MEMORY_H */
 #include <trace/define_trace.h>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 66457a06b4e7..9325aec25abc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2152,7 +2152,8 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 		}
 	}
 
-	/* TODO: tracepoints */
+	trace_mm_khugepaged_scan_file(mm, page, file->f_path.dentry->d_iname,
+				      present, swap, result);
 	return result;
 }
 #else
-- 
2.37.2.789.g6183377224-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm-unstable v3 05/10] selftests/vm: dedup THP helpers
  2022-09-07 14:45 [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE Zach O'Keefe
                   ` (3 preceding siblings ...)
  2022-09-07 14:45 ` [PATCH mm-unstable v3 04/10] mm/khugepaged: add tracepoint to hpage_collapse_scan_file() Zach O'Keefe
@ 2022-09-07 14:45 ` Zach O'Keefe
  2022-09-07 14:45 ` [PATCH mm-unstable v3 06/10] selftests/vm: modularize thp collapse memory operations Zach O'Keefe
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-07 14:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

These files:

tools/testing/selftests/vm/vm_util.c
tools/testing/selftests/vm/khugepaged.c

Both contain logic to:

1) Determine hugepage size on current system
2) Read /proc/self/smaps to determine number of THPs at an address

Refactor selftests/vm/khugepaged.c to use the vm_util common helpers
and add it as a build dependency.

Since selftests/vm/khugepaged.c is the largest user of check_huge(),
change the signature of check_huge() to match
selftests/vm/khugepaged.c's useage: take an expected number of
hugepages, and return a bool indicating if the correct number of
hugepages were found.  Add a wrapper, check_huge_anon(), in anticipation
of checking smaps for file and shmem hugepages.

Update existing callsites to use the new pattern / function.

Likewise, check_for_pattern() was duplicated, and it's a general enough
helper to include in vm_util helpers as well.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/Makefile           |  1 +
 tools/testing/selftests/vm/khugepaged.c       | 64 ++-----------------
 tools/testing/selftests/vm/soft-dirty.c       |  2 +-
 .../selftests/vm/split_huge_page_test.c       | 12 ++--
 tools/testing/selftests/vm/vm_util.c          | 26 +++++---
 tools/testing/selftests/vm/vm_util.h          |  3 +-
 6 files changed, 32 insertions(+), 76 deletions(-)

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 4ae879f70f4c..c9c0996c122b 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -95,6 +95,7 @@ TEST_FILES += va_128TBswitch.sh
 
 include ../lib.mk
 
+$(OUTPUT)/khugepaged: vm_util.c
 $(OUTPUT)/madv_populate: vm_util.c
 $(OUTPUT)/soft-dirty: vm_util.c
 $(OUTPUT)/split_huge_page_test: vm_util.c
diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index b77b1e28cdb3..e5c602f7a18b 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -11,6 +11,8 @@
 #include <sys/mman.h>
 #include <sys/wait.h>
 
+#include "vm_util.h"
+
 #ifndef MADV_PAGEOUT
 #define MADV_PAGEOUT 21
 #endif
@@ -351,64 +353,12 @@ static void save_settings(void)
 	signal(SIGQUIT, restore_settings);
 }
 
-#define MAX_LINE_LENGTH 500
-
-static bool check_for_pattern(FILE *fp, char *pattern, char *buf)
-{
-	while (fgets(buf, MAX_LINE_LENGTH, fp) != NULL) {
-		if (!strncmp(buf, pattern, strlen(pattern)))
-			return true;
-	}
-	return false;
-}
-
 static bool check_huge(void *addr, int nr_hpages)
 {
-	bool thp = false;
-	int ret;
-	FILE *fp;
-	char buffer[MAX_LINE_LENGTH];
-	char addr_pattern[MAX_LINE_LENGTH];
-
-	ret = snprintf(addr_pattern, MAX_LINE_LENGTH, "%08lx-",
-		       (unsigned long) addr);
-	if (ret >= MAX_LINE_LENGTH) {
-		printf("%s: Pattern is too long\n", __func__);
-		exit(EXIT_FAILURE);
-	}
-
-
-	fp = fopen(PID_SMAPS, "r");
-	if (!fp) {
-		printf("%s: Failed to open file %s\n", __func__, PID_SMAPS);
-		exit(EXIT_FAILURE);
-	}
-	if (!check_for_pattern(fp, addr_pattern, buffer))
-		goto err_out;
-
-	ret = snprintf(addr_pattern, MAX_LINE_LENGTH, "AnonHugePages:%10ld kB",
-		       nr_hpages * (hpage_pmd_size >> 10));
-	if (ret >= MAX_LINE_LENGTH) {
-		printf("%s: Pattern is too long\n", __func__);
-		exit(EXIT_FAILURE);
-	}
-	/*
-	 * Fetch the AnonHugePages: in the same block and check whether it got
-	 * the expected number of hugeepages next.
-	 */
-	if (!check_for_pattern(fp, "AnonHugePages:", buffer))
-		goto err_out;
-
-	if (strncmp(buffer, addr_pattern, strlen(addr_pattern)))
-		goto err_out;
-
-	thp = true;
-err_out:
-	fclose(fp);
-	return thp;
+	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
 }
 
-
+#define MAX_LINE_LENGTH 500
 static bool check_swap(void *addr, unsigned long size)
 {
 	bool swap = false;
@@ -430,7 +380,7 @@ static bool check_swap(void *addr, unsigned long size)
 		printf("%s: Failed to open file %s\n", __func__, PID_SMAPS);
 		exit(EXIT_FAILURE);
 	}
-	if (!check_for_pattern(fp, addr_pattern, buffer))
+	if (!check_for_pattern(fp, addr_pattern, buffer, sizeof(buffer)))
 		goto err_out;
 
 	ret = snprintf(addr_pattern, MAX_LINE_LENGTH, "Swap:%19ld kB",
@@ -443,7 +393,7 @@ static bool check_swap(void *addr, unsigned long size)
 	 * Fetch the Swap: in the same block and check whether it got
 	 * the expected number of hugeepages next.
 	 */
-	if (!check_for_pattern(fp, "Swap:", buffer))
+	if (!check_for_pattern(fp, "Swap:", buffer, sizeof(buffer)))
 		goto err_out;
 
 	if (strncmp(buffer, addr_pattern, strlen(addr_pattern)))
@@ -1045,7 +995,7 @@ int main(int argc, const char **argv)
 	setbuf(stdout, NULL);
 
 	page_size = getpagesize();
-	hpage_pmd_size = read_num("hpage_pmd_size");
+	hpage_pmd_size = read_pmd_pagesize();
 	hpage_pmd_nr = hpage_pmd_size / page_size;
 
 	default_settings.khugepaged.max_ptes_none = hpage_pmd_nr - 1;
diff --git a/tools/testing/selftests/vm/soft-dirty.c b/tools/testing/selftests/vm/soft-dirty.c
index e3a43f5d4fa2..21d8830c5f24 100644
--- a/tools/testing/selftests/vm/soft-dirty.c
+++ b/tools/testing/selftests/vm/soft-dirty.c
@@ -91,7 +91,7 @@ static void test_hugepage(int pagemap_fd, int pagesize)
 	for (i = 0; i < hpage_len; i++)
 		map[i] = (char)i;
 
-	if (check_huge(map)) {
+	if (check_huge_anon(map, 1, hpage_len)) {
 		ksft_test_result_pass("Test %s huge page allocation\n", __func__);
 
 		clear_softdirty();
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c
index 6aa2b8253aed..76e1c36dd9e5 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -92,7 +92,6 @@ void split_pmd_thp(void)
 {
 	char *one_page;
 	size_t len = 4 * pmd_pagesize;
-	uint64_t thp_size;
 	size_t i;
 
 	one_page = memalign(pmd_pagesize, len);
@@ -107,8 +106,7 @@ void split_pmd_thp(void)
 	for (i = 0; i < len; i++)
 		one_page[i] = (char)i;
 
-	thp_size = check_huge(one_page);
-	if (!thp_size) {
+	if (!check_huge_anon(one_page, 1, pmd_pagesize)) {
 		printf("No THP is allocated\n");
 		exit(EXIT_FAILURE);
 	}
@@ -124,9 +122,8 @@ void split_pmd_thp(void)
 		}
 
 
-	thp_size = check_huge(one_page);
-	if (thp_size) {
-		printf("Still %ld kB AnonHugePages not split\n", thp_size);
+	if (check_huge_anon(one_page, 0, pmd_pagesize)) {
+		printf("Still AnonHugePages not split\n");
 		exit(EXIT_FAILURE);
 	}
 
@@ -172,8 +169,7 @@ void split_pte_mapped_thp(void)
 	for (i = 0; i < len; i++)
 		one_page[i] = (char)i;
 
-	thp_size = check_huge(one_page);
-	if (!thp_size) {
+	if (!check_huge_anon(one_page, 1, pmd_pagesize)) {
 		printf("No THP is allocated\n");
 		exit(EXIT_FAILURE);
 	}
diff --git a/tools/testing/selftests/vm/vm_util.c b/tools/testing/selftests/vm/vm_util.c
index b58ab11a7a30..9dae51b8219f 100644
--- a/tools/testing/selftests/vm/vm_util.c
+++ b/tools/testing/selftests/vm/vm_util.c
@@ -42,9 +42,9 @@ void clear_softdirty(void)
 		ksft_exit_fail_msg("writing clear_refs failed\n");
 }
 
-static bool check_for_pattern(FILE *fp, const char *pattern, char *buf)
+bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len)
 {
-	while (fgets(buf, MAX_LINE_LENGTH, fp) != NULL) {
+	while (fgets(buf, len, fp)) {
 		if (!strncmp(buf, pattern, strlen(pattern)))
 			return true;
 	}
@@ -72,9 +72,10 @@ uint64_t read_pmd_pagesize(void)
 	return strtoul(buf, NULL, 10);
 }
 
-uint64_t check_huge(void *addr)
+bool __check_huge(void *addr, char *pattern, int nr_hpages,
+		  uint64_t hpage_size)
 {
-	uint64_t thp = 0;
+	uint64_t thp = -1;
 	int ret;
 	FILE *fp;
 	char buffer[MAX_LINE_LENGTH];
@@ -89,20 +90,27 @@ uint64_t check_huge(void *addr)
 	if (!fp)
 		ksft_exit_fail_msg("%s: Failed to open file %s\n", __func__, SMAP_FILE_PATH);
 
-	if (!check_for_pattern(fp, addr_pattern, buffer))
+	if (!check_for_pattern(fp, addr_pattern, buffer, sizeof(buffer)))
 		goto err_out;
 
 	/*
-	 * Fetch the AnonHugePages: in the same block and check the number of
+	 * Fetch the pattern in the same block and check the number of
 	 * hugepages.
 	 */
-	if (!check_for_pattern(fp, "AnonHugePages:", buffer))
+	if (!check_for_pattern(fp, pattern, buffer, sizeof(buffer)))
 		goto err_out;
 
-	if (sscanf(buffer, "AnonHugePages:%10ld kB", &thp) != 1)
+	snprintf(addr_pattern, MAX_LINE_LENGTH, "%s%%9ld kB", pattern);
+
+	if (sscanf(buffer, addr_pattern, &thp) != 1)
 		ksft_exit_fail_msg("Reading smap error\n");
 
 err_out:
 	fclose(fp);
-	return thp;
+	return thp == (nr_hpages * (hpage_size >> 10));
+}
+
+bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size)
+{
+	return __check_huge(addr, "AnonHugePages: ", nr_hpages, hpage_size);
 }
diff --git a/tools/testing/selftests/vm/vm_util.h b/tools/testing/selftests/vm/vm_util.h
index 2e512bd57ae1..8434ea0c95cd 100644
--- a/tools/testing/selftests/vm/vm_util.h
+++ b/tools/testing/selftests/vm/vm_util.h
@@ -5,5 +5,6 @@
 uint64_t pagemap_get_entry(int fd, char *start);
 bool pagemap_is_softdirty(int fd, char *start);
 void clear_softdirty(void);
+bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len);
 uint64_t read_pmd_pagesize(void);
-uint64_t check_huge(void *addr);
+bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size);
-- 
2.37.2.789.g6183377224-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm-unstable v3 06/10] selftests/vm: modularize thp collapse memory operations
  2022-09-07 14:45 [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE Zach O'Keefe
                   ` (4 preceding siblings ...)
  2022-09-07 14:45 ` [PATCH mm-unstable v3 05/10] selftests/vm: dedup THP helpers Zach O'Keefe
@ 2022-09-07 14:45 ` Zach O'Keefe
  2022-09-07 14:45 ` [PATCH mm-unstable v3 07/10] selftests/vm: add thp collapse file and tmpfs testing Zach O'Keefe
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-07 14:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Modularize operations to setup, cleanup, fault, and check for huge
pages, for a given memory type.  This allows reusing existing tests
with additional memory types by defining new memory operations.
Following patches will add file and shmem memory types.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 366 +++++++++++++-----------
 1 file changed, 200 insertions(+), 166 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index e5c602f7a18b..b4b1709507a5 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -28,8 +28,16 @@ static int hpage_pmd_nr;
 #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
 #define PID_SMAPS "/proc/self/smaps"
 
+struct mem_ops {
+	void *(*setup_area)(int nr_hpages);
+	void (*cleanup_area)(void *p, unsigned long size);
+	void (*fault)(void *p, unsigned long start, unsigned long end);
+	bool (*check_huge)(void *addr, int nr_hpages);
+};
+
 struct collapse_context {
-	void (*collapse)(const char *msg, char *p, int nr_hpages, bool expect);
+	void (*collapse)(const char *msg, char *p, int nr_hpages,
+			 struct mem_ops *ops, bool expect);
 	bool enforce_pte_scan_limits;
 };
 
@@ -353,11 +361,6 @@ static void save_settings(void)
 	signal(SIGQUIT, restore_settings);
 }
 
-static bool check_huge(void *addr, int nr_hpages)
-{
-	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
-}
-
 #define MAX_LINE_LENGTH 500
 static bool check_swap(void *addr, unsigned long size)
 {
@@ -431,18 +434,25 @@ static void fill_memory(int *p, unsigned long start, unsigned long end)
  * Returns pmd-mapped hugepage in VMA marked VM_HUGEPAGE, filled with
  * validate_memory()'able contents.
  */
-static void *alloc_hpage(void)
+static void *alloc_hpage(struct mem_ops *ops)
 {
-	void *p;
+	void *p = ops->setup_area(1);
 
-	p = alloc_mapping(1);
+	ops->fault(p, 0, hpage_pmd_size);
+	if (madvise(p, hpage_pmd_size, MADV_HUGEPAGE)) {
+		perror("madvise(MADV_HUGEPAGE)");
+		exit(EXIT_FAILURE);
+	}
 	printf("Allocate huge page...");
-	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
-	fill_memory(p, 0, hpage_pmd_size);
-	if (check_huge(p, 1))
-		success("OK");
-	else
-		fail("Fail");
+	if (madvise(p, hpage_pmd_size, MADV_COLLAPSE)) {
+		perror("madvise(MADV_COLLAPSE)");
+		exit(EXIT_FAILURE);
+	}
+	if (!ops->check_huge(p, 1)) {
+		perror("madvise(MADV_COLLAPSE)");
+		exit(EXIT_FAILURE);
+	}
+	success("OK");
 	return p;
 }
 
@@ -459,18 +469,40 @@ static void validate_memory(int *p, unsigned long start, unsigned long end)
 	}
 }
 
-static void madvise_collapse(const char *msg, char *p, int nr_hpages,
-			     bool expect)
+static void *anon_setup_area(int nr_hpages)
+{
+	return alloc_mapping(nr_hpages);
+}
+
+static void anon_cleanup_area(void *p, unsigned long size)
+{
+	munmap(p, size);
+}
+
+static void anon_fault(void *p, unsigned long start, unsigned long end)
+{
+	fill_memory(p, start, end);
+}
+
+static bool anon_check_huge(void *addr, int nr_hpages)
+{
+	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
+}
+
+static struct mem_ops anon_ops = {
+	.setup_area = &anon_setup_area,
+	.cleanup_area = &anon_cleanup_area,
+	.fault = &anon_fault,
+	.check_huge = &anon_check_huge,
+};
+
+static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
+			       struct mem_ops *ops, bool expect)
 {
 	int ret;
 	struct settings settings = *current_settings();
 
 	printf("%s...", msg);
-	/* Sanity check */
-	if (!check_huge(p, 0)) {
-		printf("Unexpected huge page\n");
-		exit(EXIT_FAILURE);
-	}
 
 	/*
 	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
@@ -482,9 +514,10 @@ static void madvise_collapse(const char *msg, char *p, int nr_hpages,
 	/* Clear VM_NOHUGEPAGE */
 	madvise(p, nr_hpages * hpage_pmd_size, MADV_HUGEPAGE);
 	ret = madvise(p, nr_hpages * hpage_pmd_size, MADV_COLLAPSE);
+
 	if (((bool)ret) == expect)
 		fail("Fail: Bad return value");
-	else if (check_huge(p, nr_hpages) != expect)
+	else if (!ops->check_huge(p, expect ? nr_hpages : 0))
 		fail("Fail: check_huge()");
 	else
 		success("OK");
@@ -492,14 +525,26 @@ static void madvise_collapse(const char *msg, char *p, int nr_hpages,
 	pop_settings();
 }
 
+static void madvise_collapse(const char *msg, char *p, int nr_hpages,
+			     struct mem_ops *ops, bool expect)
+{
+	/* Sanity check */
+	if (!ops->check_huge(p, 0)) {
+		printf("Unexpected huge page\n");
+		exit(EXIT_FAILURE);
+	}
+	__madvise_collapse(msg, p, nr_hpages, ops, expect);
+}
+
 #define TICK 500000
-static bool wait_for_scan(const char *msg, char *p, int nr_hpages)
+static bool wait_for_scan(const char *msg, char *p, int nr_hpages,
+			  struct mem_ops *ops)
 {
 	int full_scans;
 	int timeout = 6; /* 3 seconds */
 
 	/* Sanity check */
-	if (!check_huge(p, 0)) {
+	if (!ops->check_huge(p, 0)) {
 		printf("Unexpected huge page\n");
 		exit(EXIT_FAILURE);
 	}
@@ -511,7 +556,7 @@ static bool wait_for_scan(const char *msg, char *p, int nr_hpages)
 
 	printf("%s...", msg);
 	while (timeout--) {
-		if (check_huge(p, nr_hpages))
+		if (ops->check_huge(p, nr_hpages))
 			break;
 		if (read_num("khugepaged/full_scans") >= full_scans)
 			break;
@@ -525,19 +570,20 @@ static bool wait_for_scan(const char *msg, char *p, int nr_hpages)
 }
 
 static void khugepaged_collapse(const char *msg, char *p, int nr_hpages,
-				bool expect)
+				struct mem_ops *ops, bool expect)
 {
-	if (wait_for_scan(msg, p, nr_hpages)) {
+	if (wait_for_scan(msg, p, nr_hpages, ops)) {
 		if (expect)
 			fail("Timeout");
 		else
 			success("OK");
 		return;
-	} else if (check_huge(p, nr_hpages) == expect) {
+	}
+
+	if (ops->check_huge(p, expect ? nr_hpages : 0))
 		success("OK");
-	} else {
+	else
 		fail("Fail");
-	}
 }
 
 static void alloc_at_fault(void)
@@ -551,7 +597,7 @@ static void alloc_at_fault(void)
 	p = alloc_mapping(1);
 	*p = 1;
 	printf("Allocate huge page on fault...");
-	if (check_huge(p, 1))
+	if (check_huge_anon(p, 1, hpage_pmd_size))
 		success("OK");
 	else
 		fail("Fail");
@@ -560,49 +606,48 @@ static void alloc_at_fault(void)
 
 	madvise(p, page_size, MADV_DONTNEED);
 	printf("Split huge PMD on MADV_DONTNEED...");
-	if (check_huge(p, 0))
+	if (check_huge_anon(p, 0, hpage_pmd_size))
 		success("OK");
 	else
 		fail("Fail");
 	munmap(p, hpage_pmd_size);
 }
 
-static void collapse_full(struct collapse_context *c)
+static void collapse_full(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
 	int nr_hpages = 4;
 	unsigned long size = nr_hpages * hpage_pmd_size;
 
-	p = alloc_mapping(nr_hpages);
-	fill_memory(p, 0, size);
+	p = ops->setup_area(nr_hpages);
+	ops->fault(p, 0, size);
 	c->collapse("Collapse multiple fully populated PTE table", p, nr_hpages,
-		    true);
+		    ops, true);
 	validate_memory(p, 0, size);
-	munmap(p, size);
+	ops->cleanup_area(p, size);
 }
 
-static void collapse_empty(struct collapse_context *c)
+static void collapse_empty(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
 
-	p = alloc_mapping(1);
-	c->collapse("Do not collapse empty PTE table", p, 1, false);
-	munmap(p, hpage_pmd_size);
+	p = ops->setup_area(1);
+	c->collapse("Do not collapse empty PTE table", p, 1, ops, false);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_single_pte_entry(struct collapse_context *c)
+static void collapse_single_pte_entry(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
 
-	p = alloc_mapping(1);
-	fill_memory(p, 0, page_size);
+	p = ops->setup_area(1);
+	ops->fault(p, 0, page_size);
 	c->collapse("Collapse PTE table with single PTE entry present", p,
-		    1, true);
-	validate_memory(p, 0, page_size);
-	munmap(p, hpage_pmd_size);
+		    1, ops, true);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_none(struct collapse_context *c)
+static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *ops)
 {
 	int max_ptes_none = hpage_pmd_nr / 2;
 	struct settings settings = *current_settings();
@@ -611,30 +656,30 @@ static void collapse_max_ptes_none(struct collapse_context *c)
 	settings.khugepaged.max_ptes_none = max_ptes_none;
 	push_settings(&settings);
 
-	p = alloc_mapping(1);
+	p = ops->setup_area(1);
 
-	fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
+	ops->fault(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
 	c->collapse("Maybe collapse with max_ptes_none exceeded", p, 1,
-		    !c->enforce_pte_scan_limits);
+		    ops, !c->enforce_pte_scan_limits);
 	validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
 
 	if (c->enforce_pte_scan_limits) {
-		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
-		c->collapse("Collapse with max_ptes_none PTEs empty", p, 1,
+		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
+		c->collapse("Collapse with max_ptes_none PTEs empty", p, 1, ops,
 			    true);
 		validate_memory(p, 0,
 				(hpage_pmd_nr - max_ptes_none) * page_size);
 	}
-
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 	pop_settings();
 }
 
-static void collapse_swapin_single_pte(struct collapse_context *c)
+static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
-	p = alloc_mapping(1);
-	fill_memory(p, 0, hpage_pmd_size);
+
+	p = ops->setup_area(1);
+	ops->fault(p, 0, hpage_pmd_size);
 
 	printf("Swapout one page...");
 	if (madvise(p, page_size, MADV_PAGEOUT)) {
@@ -648,20 +693,21 @@ static void collapse_swapin_single_pte(struct collapse_context *c)
 		goto out;
 	}
 
-	c->collapse("Collapse with swapping in single PTE entry", p, 1, true);
+	c->collapse("Collapse with swapping in single PTE entry", p, 1, ops,
+		    true);
 	validate_memory(p, 0, hpage_pmd_size);
 out:
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_swap(struct collapse_context *c)
+static void collapse_max_ptes_swap(struct collapse_context *c, struct mem_ops *ops)
 {
 	int max_ptes_swap = read_num("khugepaged/max_ptes_swap");
 	void *p;
 
-	p = alloc_mapping(1);
+	p = ops->setup_area(1);
+	ops->fault(p, 0, hpage_pmd_size);
 
-	fill_memory(p, 0, hpage_pmd_size);
 	printf("Swapout %d of %d pages...", max_ptes_swap + 1, hpage_pmd_nr);
 	if (madvise(p, (max_ptes_swap + 1) * page_size, MADV_PAGEOUT)) {
 		perror("madvise(MADV_PAGEOUT)");
@@ -674,12 +720,12 @@ static void collapse_max_ptes_swap(struct collapse_context *c)
 		goto out;
 	}
 
-	c->collapse("Maybe collapse with max_ptes_swap exceeded", p, 1,
+	c->collapse("Maybe collapse with max_ptes_swap exceeded", p, 1, ops,
 		    !c->enforce_pte_scan_limits);
 	validate_memory(p, 0, hpage_pmd_size);
 
 	if (c->enforce_pte_scan_limits) {
-		fill_memory(p, 0, hpage_pmd_size);
+		ops->fault(p, 0, hpage_pmd_size);
 		printf("Swapout %d of %d pages...", max_ptes_swap,
 		       hpage_pmd_nr);
 		if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) {
@@ -694,63 +740,65 @@ static void collapse_max_ptes_swap(struct collapse_context *c)
 		}
 
 		c->collapse("Collapse with max_ptes_swap pages swapped out", p,
-			    1, true);
+			    1, ops, true);
 		validate_memory(p, 0, hpage_pmd_size);
 	}
 out:
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_single_pte_entry_compound(struct collapse_context *c)
+static void collapse_single_pte_entry_compound(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
 
-	p = alloc_hpage();
+	p = alloc_hpage(ops);
+
 	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
 	printf("Split huge page leaving single PTE mapping compound page...");
 	madvise(p + page_size, hpage_pmd_size - page_size, MADV_DONTNEED);
-	if (check_huge(p, 0))
+	if (ops->check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
 
 	c->collapse("Collapse PTE table with single PTE mapping compound page",
-		    p, 1, true);
+		    p, 1, ops, true);
 	validate_memory(p, 0, page_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_full_of_compound(struct collapse_context *c)
+static void collapse_full_of_compound(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
 
-	p = alloc_hpage();
+	p = alloc_hpage(ops);
 	printf("Split huge page leaving single PTE page table full of compound pages...");
 	madvise(p, page_size, MADV_NOHUGEPAGE);
 	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
-	if (check_huge(p, 0))
+	if (ops->check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
 
-	c->collapse("Collapse PTE table full of compound pages", p, 1, true);
+	c->collapse("Collapse PTE table full of compound pages", p, 1, ops,
+		    true);
 	validate_memory(p, 0, hpage_pmd_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_compound_extreme(struct collapse_context *c)
+static void collapse_compound_extreme(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;
 	int i;
 
-	p = alloc_mapping(1);
+	p = ops->setup_area(1);
 	for (i = 0; i < hpage_pmd_nr; i++) {
 		printf("\rConstruct PTE page table full of different PTE-mapped compound pages %3d/%d...",
 				i + 1, hpage_pmd_nr);
 
 		madvise(BASE_ADDR, hpage_pmd_size, MADV_HUGEPAGE);
-		fill_memory(BASE_ADDR, 0, hpage_pmd_size);
-		if (!check_huge(BASE_ADDR, 1)) {
+		ops->fault(BASE_ADDR, 0, hpage_pmd_size);
+		if (!ops->check_huge(BASE_ADDR, 1)) {
 			printf("Failed to allocate huge page\n");
 			exit(EXIT_FAILURE);
 		}
@@ -777,30 +825,30 @@ static void collapse_compound_extreme(struct collapse_context *c)
 		}
 	}
 
-	munmap(BASE_ADDR, hpage_pmd_size);
-	fill_memory(p, 0, hpage_pmd_size);
-	if (check_huge(p, 0))
+	ops->cleanup_area(BASE_ADDR, hpage_pmd_size);
+	ops->fault(p, 0, hpage_pmd_size);
+	if (!ops->check_huge(p, 1))
 		success("OK");
 	else
 		fail("Fail");
 
 	c->collapse("Collapse PTE table full of different compound pages", p, 1,
-		    true);
+		    ops, true);
 
 	validate_memory(p, 0, hpage_pmd_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_fork(struct collapse_context *c)
+static void collapse_fork(struct collapse_context *c, struct mem_ops *ops)
 {
 	int wstatus;
 	void *p;
 
-	p = alloc_mapping(1);
+	p = ops->setup_area(1);
 
 	printf("Allocate small page...");
-	fill_memory(p, 0, page_size);
-	if (check_huge(p, 0))
+	ops->fault(p, 0, page_size);
+	if (ops->check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
@@ -811,17 +859,17 @@ static void collapse_fork(struct collapse_context *c)
 		skip_settings_restore = true;
 		exit_status = 0;
 
-		if (check_huge(p, 0))
+		if (ops->check_huge(p, 0))
 			success("OK");
 		else
 			fail("Fail");
 
-		fill_memory(p, page_size, 2 * page_size);
+		ops->fault(p, page_size, 2 * page_size);
 		c->collapse("Collapse PTE table with single page shared with parent process",
-			    p, 1, true);
+			    p, 1, ops, true);
 
 		validate_memory(p, 0, page_size);
-		munmap(p, hpage_pmd_size);
+		ops->cleanup_area(p, hpage_pmd_size);
 		exit(exit_status);
 	}
 
@@ -829,27 +877,27 @@ static void collapse_fork(struct collapse_context *c)
 	exit_status += WEXITSTATUS(wstatus);
 
 	printf("Check if parent still has small page...");
-	if (check_huge(p, 0))
+	if (ops->check_huge(p, 0))
 		success("OK");
 	else
 		fail("Fail");
 	validate_memory(p, 0, page_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_fork_compound(struct collapse_context *c)
+static void collapse_fork_compound(struct collapse_context *c, struct mem_ops *ops)
 {
 	int wstatus;
 	void *p;
 
-	p = alloc_hpage();
+	p = alloc_hpage(ops);
 	printf("Share huge page over fork()...");
 	if (!fork()) {
 		/* Do not touch settings on child exit */
 		skip_settings_restore = true;
 		exit_status = 0;
 
-		if (check_huge(p, 1))
+		if (ops->check_huge(p, 1))
 			success("OK");
 		else
 			fail("Fail");
@@ -857,20 +905,20 @@ static void collapse_fork_compound(struct collapse_context *c)
 		printf("Split huge page PMD in child process...");
 		madvise(p, page_size, MADV_NOHUGEPAGE);
 		madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
-		if (check_huge(p, 0))
+		if (ops->check_huge(p, 0))
 			success("OK");
 		else
 			fail("Fail");
-		fill_memory(p, 0, page_size);
+		ops->fault(p, 0, page_size);
 
 		write_num("khugepaged/max_ptes_shared", hpage_pmd_nr - 1);
 		c->collapse("Collapse PTE table full of compound pages in child",
-			    p, 1, true);
+			    p, 1, ops, true);
 		write_num("khugepaged/max_ptes_shared",
 			  current_settings()->khugepaged.max_ptes_shared);
 
 		validate_memory(p, 0, hpage_pmd_size);
-		munmap(p, hpage_pmd_size);
+		ops->cleanup_area(p, hpage_pmd_size);
 		exit(exit_status);
 	}
 
@@ -878,59 +926,59 @@ static void collapse_fork_compound(struct collapse_context *c)
 	exit_status += WEXITSTATUS(wstatus);
 
 	printf("Check if parent still has huge page...");
-	if (check_huge(p, 1))
+	if (ops->check_huge(p, 1))
 		success("OK");
 	else
 		fail("Fail");
 	validate_memory(p, 0, hpage_pmd_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void collapse_max_ptes_shared(struct collapse_context *c)
+static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops *ops)
 {
 	int max_ptes_shared = read_num("khugepaged/max_ptes_shared");
 	int wstatus;
 	void *p;
 
-	p = alloc_hpage();
+	p = alloc_hpage(ops);
 	printf("Share huge page over fork()...");
 	if (!fork()) {
 		/* Do not touch settings on child exit */
 		skip_settings_restore = true;
 		exit_status = 0;
 
-		if (check_huge(p, 1))
+		if (ops->check_huge(p, 1))
 			success("OK");
 		else
 			fail("Fail");
 
 		printf("Trigger CoW on page %d of %d...",
 				hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr);
-		fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
-		if (check_huge(p, 0))
+		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
+		if (ops->check_huge(p, 0))
 			success("OK");
 		else
 			fail("Fail");
 
 		c->collapse("Maybe collapse with max_ptes_shared exceeded", p,
-			    1, !c->enforce_pte_scan_limits);
+			    1, ops, !c->enforce_pte_scan_limits);
 
 		if (c->enforce_pte_scan_limits) {
 			printf("Trigger CoW on page %d of %d...",
 			       hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
-			fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) *
+			ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared) *
 				    page_size);
-			if (check_huge(p, 0))
+			if (ops->check_huge(p, 0))
 				success("OK");
 			else
 				fail("Fail");
 
 			c->collapse("Collapse with max_ptes_shared PTEs shared",
-				    p, 1,  true);
+				    p, 1, ops, true);
 		}
 
 		validate_memory(p, 0, hpage_pmd_size);
-		munmap(p, hpage_pmd_size);
+		ops->cleanup_area(p, hpage_pmd_size);
 		exit(exit_status);
 	}
 
@@ -938,42 +986,28 @@ static void collapse_max_ptes_shared(struct collapse_context *c)
 	exit_status += WEXITSTATUS(wstatus);
 
 	printf("Check if parent still has huge page...");
-	if (check_huge(p, 1))
+	if (ops->check_huge(p, 1))
 		success("OK");
 	else
 		fail("Fail");
 	validate_memory(p, 0, hpage_pmd_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
-static void madvise_collapse_existing_thps(void)
+static void madvise_collapse_existing_thps(struct collapse_context *c,
+					   struct mem_ops *ops)
 {
 	void *p;
-	int err;
 
-	p = alloc_mapping(1);
-	fill_memory(p, 0, hpage_pmd_size);
+	p = ops->setup_area(1);
+	ops->fault(p, 0, hpage_pmd_size);
+	c->collapse("Collapse fully populated PTE table...", p, 1, ops, true);
+	validate_memory(p, 0, hpage_pmd_size);
 
-	printf("Collapse fully populated PTE table...");
-	/*
-	 * Note that we don't set MADV_HUGEPAGE here, which
-	 * also tests that VM_HUGEPAGE isn't required for
-	 * MADV_COLLAPSE in "madvise" mode.
-	 */
-	err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
-	if (err == 0 && check_huge(p, 1)) {
-		success("OK");
-		printf("Re-collapse PMD-mapped hugepage");
-		err = madvise(p, hpage_pmd_size, MADV_COLLAPSE);
-		if (err == 0 && check_huge(p, 1))
-			success("OK");
-		else
-			fail("Fail");
-	} else {
-		fail("Fail");
-	}
+	/* c->collapse() will find a hugepage and complain - call directly. */
+	__madvise_collapse("Re-collapse PMD-mapped hugepage", p, 1, ops, true);
 	validate_memory(p, 0, hpage_pmd_size);
-	munmap(p, hpage_pmd_size);
+	ops->cleanup_area(p, hpage_pmd_size);
 }
 
 int main(int argc, const char **argv)
@@ -1013,37 +1047,37 @@ int main(int argc, const char **argv)
 		c.collapse = &khugepaged_collapse;
 		c.enforce_pte_scan_limits = true;
 
-		collapse_full(&c);
-		collapse_empty(&c);
-		collapse_single_pte_entry(&c);
-		collapse_max_ptes_none(&c);
-		collapse_swapin_single_pte(&c);
-		collapse_max_ptes_swap(&c);
-		collapse_single_pte_entry_compound(&c);
-		collapse_full_of_compound(&c);
-		collapse_compound_extreme(&c);
-		collapse_fork(&c);
-		collapse_fork_compound(&c);
-		collapse_max_ptes_shared(&c);
+		collapse_full(&c, &anon_ops);
+		collapse_empty(&c, &anon_ops);
+		collapse_single_pte_entry(&c, &anon_ops);
+		collapse_max_ptes_none(&c, &anon_ops);
+		collapse_swapin_single_pte(&c, &anon_ops);
+		collapse_max_ptes_swap(&c, &anon_ops);
+		collapse_single_pte_entry_compound(&c, &anon_ops);
+		collapse_full_of_compound(&c, &anon_ops);
+		collapse_compound_extreme(&c, &anon_ops);
+		collapse_fork(&c, &anon_ops);
+		collapse_fork_compound(&c, &anon_ops);
+		collapse_max_ptes_shared(&c, &anon_ops);
 	}
 	if (!strcmp(tests, "madvise") || !strcmp(tests, "all")) {
 		printf("\n*** Testing context: madvise ***\n");
 		c.collapse = &madvise_collapse;
 		c.enforce_pte_scan_limits = false;
 
-		collapse_full(&c);
-		collapse_empty(&c);
-		collapse_single_pte_entry(&c);
-		collapse_max_ptes_none(&c);
-		collapse_swapin_single_pte(&c);
-		collapse_max_ptes_swap(&c);
-		collapse_single_pte_entry_compound(&c);
-		collapse_full_of_compound(&c);
-		collapse_compound_extreme(&c);
-		collapse_fork(&c);
-		collapse_fork_compound(&c);
-		collapse_max_ptes_shared(&c);
-		madvise_collapse_existing_thps();
+		collapse_full(&c, &anon_ops);
+		collapse_empty(&c, &anon_ops);
+		collapse_single_pte_entry(&c, &anon_ops);
+		collapse_max_ptes_none(&c, &anon_ops);
+		collapse_swapin_single_pte(&c, &anon_ops);
+		collapse_max_ptes_swap(&c, &anon_ops);
+		collapse_single_pte_entry_compound(&c, &anon_ops);
+		collapse_full_of_compound(&c, &anon_ops);
+		collapse_compound_extreme(&c, &anon_ops);
+		collapse_fork(&c, &anon_ops);
+		collapse_fork_compound(&c, &anon_ops);
+		collapse_max_ptes_shared(&c, &anon_ops);
+		madvise_collapse_existing_thps(&c, &anon_ops);
 	}
 
 	restore_settings(0);
-- 
2.37.2.789.g6183377224-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm-unstable v3 07/10] selftests/vm: add thp collapse file and tmpfs testing
  2022-09-07 14:45 [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE Zach O'Keefe
                   ` (5 preceding siblings ...)
  2022-09-07 14:45 ` [PATCH mm-unstable v3 06/10] selftests/vm: modularize thp collapse memory operations Zach O'Keefe
@ 2022-09-07 14:45 ` Zach O'Keefe
  2022-09-07 14:45 ` [PATCH mm-unstable v3 08/10] selftests/vm: add thp collapse shmem testing Zach O'Keefe
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-07 14:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Add memory operations for file-backed and tmpfs memory.  Call existing
tests with these new memory operations to test collapse functionality
of khugepaged and MADV_COLLAPSE on file-backed and tmpfs memory.  Not all
tests are reusable; for example, collapse_swapin_single_pte() which
checks swap usage.

Refactor test arguments.  Usage is now:

Usage: ./khugepaged <test type> [dir]

        <test type>     : <context>:<mem_type>
        <context>       : [all|khugepaged|madvise]
        <mem_type>      : [all|anon|file]

        "file,all" mem_type requires [dir] argument

        "file,all" mem_type requires kernel built with
        CONFIG_READ_ONLY_THP_FOR_FS=y

        if [dir] is a (sub)directory of a tmpfs mount, tmpfs must be
        mounted with huge=madvise option for khugepaged tests to work

Refactor calling tests to make it clear what collapse context / memory
operations they support, but only invoke tests requested by user. Also log
what test is being ran, and with what context / memory, to make test logs
more human readable.

A new test file is created and deleted for every test to ensure no pages
remain in the page cache between tests (tests also may attempt to
collapse different amount of memory).

For file-backed memory where the file is stored on a block device,
disable /sys/block/<device>/queue/read_ahead_kb so that pages don't find
their way into the page cache without the tests faulting them in.

Add file and shmem wrappers to vm_utils check for file and shmem
hugepages in smaps.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 473 +++++++++++++++++++++---
 tools/testing/selftests/vm/vm_util.c    |  10 +
 tools/testing/selftests/vm/vm_util.h    |   2 +
 3 files changed, 429 insertions(+), 56 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index b4b1709507a5..59f56a329f43 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -1,6 +1,8 @@
 #define _GNU_SOURCE
+#include <ctype.h>
 #include <fcntl.h>
 #include <limits.h>
+#include <dirent.h>
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
@@ -10,12 +12,21 @@
 
 #include <sys/mman.h>
 #include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/sysmacros.h>
+#include <sys/vfs.h>
+
+#include "linux/magic.h"
 
 #include "vm_util.h"
 
 #ifndef MADV_PAGEOUT
 #define MADV_PAGEOUT 21
 #endif
+#ifndef MADV_POPULATE_READ
+#define MADV_POPULATE_READ 22
+#endif
 #ifndef MADV_COLLAPSE
 #define MADV_COLLAPSE 25
 #endif
@@ -27,20 +38,47 @@ static int hpage_pmd_nr;
 
 #define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
 #define PID_SMAPS "/proc/self/smaps"
+#define TEST_FILE "collapse_test_file"
+
+#define MAX_LINE_LENGTH 500
+
+enum vma_type {
+	VMA_ANON,
+	VMA_FILE,
+	VMA_SHMEM,
+};
 
 struct mem_ops {
 	void *(*setup_area)(int nr_hpages);
 	void (*cleanup_area)(void *p, unsigned long size);
 	void (*fault)(void *p, unsigned long start, unsigned long end);
 	bool (*check_huge)(void *addr, int nr_hpages);
+	const char *name;
 };
 
+static struct mem_ops *file_ops;
+static struct mem_ops *anon_ops;
+
 struct collapse_context {
 	void (*collapse)(const char *msg, char *p, int nr_hpages,
 			 struct mem_ops *ops, bool expect);
 	bool enforce_pte_scan_limits;
+	const char *name;
+};
+
+static struct collapse_context *khugepaged_context;
+static struct collapse_context *madvise_context;
+
+struct file_info {
+	const char *dir;
+	char path[PATH_MAX];
+	enum vma_type type;
+	int fd;
+	char dev_queue_read_ahead_path[PATH_MAX];
 };
 
+static struct file_info finfo;
+
 enum thp_enabled {
 	THP_ALWAYS,
 	THP_MADVISE,
@@ -106,6 +144,7 @@ struct settings {
 	enum shmem_enabled shmem_enabled;
 	bool use_zero_page;
 	struct khugepaged_settings khugepaged;
+	unsigned long read_ahead_kb;
 };
 
 static struct settings saved_settings;
@@ -124,6 +163,11 @@ static void fail(const char *msg)
 	exit_status++;
 }
 
+static void skip(const char *msg)
+{
+	printf(" \e[33m%s\e[0m\n", msg);
+}
+
 static int read_file(const char *path, char *buf, size_t buflen)
 {
 	int fd;
@@ -151,13 +195,19 @@ static int write_file(const char *path, const char *buf, size_t buflen)
 	ssize_t numwritten;
 
 	fd = open(path, O_WRONLY);
-	if (fd == -1)
+	if (fd == -1) {
+		printf("open(%s)\n", path);
+		exit(EXIT_FAILURE);
 		return 0;
+	}
 
 	numwritten = write(fd, buf, buflen - 1);
 	close(fd);
-	if (numwritten < 1)
+	if (numwritten < 1) {
+		printf("write(%s)\n", buf);
+		exit(EXIT_FAILURE);
 		return 0;
+	}
 
 	return (unsigned int) numwritten;
 }
@@ -224,20 +274,11 @@ static void write_string(const char *name, const char *val)
 	}
 }
 
-static const unsigned long read_num(const char *name)
+static const unsigned long _read_num(const char *path)
 {
-	char path[PATH_MAX];
 	char buf[21];
-	int ret;
 
-	ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
-	if (ret >= PATH_MAX) {
-		printf("%s: Pathname is too long\n", __func__);
-		exit(EXIT_FAILURE);
-	}
-
-	ret = read_file(path, buf, sizeof(buf));
-	if (ret < 0) {
+	if (read_file(path, buf, sizeof(buf)) < 0) {
 		perror("read_file(read_num)");
 		exit(EXIT_FAILURE);
 	}
@@ -245,10 +286,9 @@ static const unsigned long read_num(const char *name)
 	return strtoul(buf, NULL, 10);
 }
 
-static void write_num(const char *name, unsigned long num)
+static const unsigned long read_num(const char *name)
 {
 	char path[PATH_MAX];
-	char buf[21];
 	int ret;
 
 	ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
@@ -256,6 +296,12 @@ static void write_num(const char *name, unsigned long num)
 		printf("%s: Pathname is too long\n", __func__);
 		exit(EXIT_FAILURE);
 	}
+	return _read_num(path);
+}
+
+static void _write_num(const char *path, unsigned long num)
+{
+	char buf[21];
 
 	sprintf(buf, "%ld", num);
 	if (!write_file(path, buf, strlen(buf) + 1)) {
@@ -264,6 +310,19 @@ static void write_num(const char *name, unsigned long num)
 	}
 }
 
+static void write_num(const char *name, unsigned long num)
+{
+	char path[PATH_MAX];
+	int ret;
+
+	ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
+	if (ret >= PATH_MAX) {
+		printf("%s: Pathname is too long\n", __func__);
+		exit(EXIT_FAILURE);
+	}
+	_write_num(path, num);
+}
+
 static void write_settings(struct settings *settings)
 {
 	struct khugepaged_settings *khugepaged = &settings->khugepaged;
@@ -283,6 +342,10 @@ static void write_settings(struct settings *settings)
 	write_num("khugepaged/max_ptes_swap", khugepaged->max_ptes_swap);
 	write_num("khugepaged/max_ptes_shared", khugepaged->max_ptes_shared);
 	write_num("khugepaged/pages_to_scan", khugepaged->pages_to_scan);
+
+	if (file_ops && finfo.type == VMA_FILE)
+		_write_num(finfo.dev_queue_read_ahead_path,
+			   settings->read_ahead_kb);
 }
 
 #define MAX_SETTINGS_DEPTH 4
@@ -353,6 +416,10 @@ static void save_settings(void)
 		.max_ptes_shared = read_num("khugepaged/max_ptes_shared"),
 		.pages_to_scan = read_num("khugepaged/pages_to_scan"),
 	};
+	if (file_ops && finfo.type == VMA_FILE)
+		saved_settings.read_ahead_kb =
+				_read_num(finfo.dev_queue_read_ahead_path);
+
 	success("OK");
 
 	signal(SIGTERM, restore_settings);
@@ -361,7 +428,88 @@ static void save_settings(void)
 	signal(SIGQUIT, restore_settings);
 }
 
-#define MAX_LINE_LENGTH 500
+static void get_finfo(const char *dir)
+{
+	struct stat path_stat;
+	struct statfs fs;
+	char buf[1 << 10];
+	char path[PATH_MAX];
+	char *str, *end;
+
+	finfo.dir = dir;
+	stat(finfo.dir, &path_stat);
+	if (!S_ISDIR(path_stat.st_mode)) {
+		printf("%s: Not a directory (%s)\n", __func__, finfo.dir);
+		exit(EXIT_FAILURE);
+	}
+	if (snprintf(finfo.path, sizeof(finfo.path), "%s/" TEST_FILE,
+		     finfo.dir) >= sizeof(finfo.path)) {
+		printf("%s: Pathname is too long\n", __func__);
+		exit(EXIT_FAILURE);
+	}
+	if (statfs(finfo.dir, &fs)) {
+		perror("statfs()");
+		exit(EXIT_FAILURE);
+	}
+	finfo.type = fs.f_type == TMPFS_MAGIC ? VMA_SHMEM : VMA_FILE;
+
+	/* Find owning device's queue/read_ahead_kb control */
+	if (snprintf(path, sizeof(path), "/sys/dev/block/%d:%d/uevent",
+		     major(path_stat.st_dev), minor(path_stat.st_dev))
+	    >= sizeof(path)) {
+		printf("%s: Pathname is too long\n", __func__);
+		exit(EXIT_FAILURE);
+	}
+	if (read_file(path, buf, sizeof(buf)) < 0) {
+		perror("read_file(read_num)");
+		exit(EXIT_FAILURE);
+	}
+	if (strstr(buf, "DEVTYPE=disk")) {
+		/* Found it */
+		if (snprintf(finfo.dev_queue_read_ahead_path,
+			     sizeof(finfo.dev_queue_read_ahead_path),
+			     "/sys/dev/block/%d:%d/queue/read_ahead_kb",
+			     major(path_stat.st_dev), minor(path_stat.st_dev))
+		    >= sizeof(finfo.dev_queue_read_ahead_path)) {
+			printf("%s: Pathname is too long\n", __func__);
+			exit(EXIT_FAILURE);
+		}
+		return;
+	}
+	if (!strstr(buf, "DEVTYPE=partition")) {
+		printf("%s: Unknown device type: %s\n", __func__, path);
+		exit(EXIT_FAILURE);
+	}
+	/*
+	 * Partition of block device - need to find actual device.
+	 * Using naming convention that devnameN is partition of
+	 * device devname.
+	 */
+	str = strstr(buf, "DEVNAME=");
+	if (!str) {
+		printf("%s: Could not read: %s", __func__, path);
+		exit(EXIT_FAILURE);
+	}
+	str += 8;
+	end = str;
+	while (*end) {
+		if (isdigit(*end)) {
+			*end = '\0';
+			if (snprintf(finfo.dev_queue_read_ahead_path,
+				     sizeof(finfo.dev_queue_read_ahead_path),
+				     "/sys/block/%s/queue/read_ahead_kb",
+				     str) >= sizeof(finfo.dev_queue_read_ahead_path)) {
+				printf("%s: Pathname is too long\n", __func__);
+				exit(EXIT_FAILURE);
+			}
+			return;
+		}
+		++end;
+	}
+	printf("%s: Could not read: %s\n", __func__, path);
+	exit(EXIT_FAILURE);
+}
+
 static bool check_swap(void *addr, unsigned long size)
 {
 	bool swap = false;
@@ -489,11 +637,91 @@ static bool anon_check_huge(void *addr, int nr_hpages)
 	return check_huge_anon(addr, nr_hpages, hpage_pmd_size);
 }
 
-static struct mem_ops anon_ops = {
+static void *file_setup_area(int nr_hpages)
+{
+	int fd;
+	void *p;
+	unsigned long size;
+
+	unlink(finfo.path);  /* Cleanup from previous failed tests */
+	printf("Creating %s for collapse%s...", finfo.path,
+	       finfo.type == VMA_SHMEM ? " (tmpfs)" : "");
+	fd = open(finfo.path, O_DSYNC | O_CREAT | O_RDWR | O_TRUNC | O_EXCL,
+		  777);
+	if (fd < 0) {
+		perror("open()");
+		exit(EXIT_FAILURE);
+	}
+
+	size = nr_hpages * hpage_pmd_size;
+	p = alloc_mapping(nr_hpages);
+	fill_memory(p, 0, size);
+	write(fd, p, size);
+	close(fd);
+	munmap(p, size);
+	success("OK");
+
+	printf("Opening %s read only for collapse...", finfo.path);
+	finfo.fd = open(finfo.path, O_RDONLY, 777);
+	if (finfo.fd < 0) {
+		perror("open()");
+		exit(EXIT_FAILURE);
+	}
+	p = mmap(BASE_ADDR, size, PROT_READ | PROT_EXEC,
+		 MAP_PRIVATE, finfo.fd, 0);
+	if (p == MAP_FAILED || p != BASE_ADDR) {
+		perror("mmap()");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Drop page cache */
+	write_file("/proc/sys/vm/drop_caches", "3", 2);
+	success("OK");
+	return p;
+}
+
+static void file_cleanup_area(void *p, unsigned long size)
+{
+	munmap(p, size);
+	close(finfo.fd);
+	unlink(finfo.path);
+}
+
+static void file_fault(void *p, unsigned long start, unsigned long end)
+{
+	if (madvise(((char *)p) + start, end - start, MADV_POPULATE_READ)) {
+		perror("madvise(MADV_POPULATE_READ");
+		exit(EXIT_FAILURE);
+	}
+}
+
+static bool file_check_huge(void *addr, int nr_hpages)
+{
+	switch (finfo.type) {
+	case VMA_FILE:
+		return check_huge_file(addr, nr_hpages, hpage_pmd_size);
+	case VMA_SHMEM:
+		return check_huge_shmem(addr, nr_hpages, hpage_pmd_size);
+	default:
+		exit(EXIT_FAILURE);
+		return false;
+	}
+}
+
+static struct mem_ops __anon_ops = {
 	.setup_area = &anon_setup_area,
 	.cleanup_area = &anon_cleanup_area,
 	.fault = &anon_fault,
 	.check_huge = &anon_check_huge,
+	.name = "anon",
+};
+
+static struct mem_ops __file_ops = {
+	.setup_area = &file_setup_area,
+	.cleanup_area = &file_cleanup_area,
+	.fault = &file_fault,
+	.check_huge = &file_check_huge,
+	.name = "file",
 };
 
 static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
@@ -509,6 +737,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
 	 * ignores /sys/kernel/mm/transparent_hugepage/enabled
 	 */
 	settings.thp_enabled = THP_NEVER;
+	settings.shmem_enabled = SHMEM_NEVER;
 	push_settings(&settings);
 
 	/* Clear VM_NOHUGEPAGE */
@@ -580,12 +809,37 @@ static void khugepaged_collapse(const char *msg, char *p, int nr_hpages,
 		return;
 	}
 
+	/*
+	 * For file and shmem memory, khugepaged only retracts pte entries after
+	 * putting the new hugepage in the page cache. The hugepage must be
+	 * subsequently refaulted to install the pmd mapping for the mm.
+	 */
+	if (ops != &__anon_ops)
+		ops->fault(p, 0, nr_hpages * hpage_pmd_size);
+
 	if (ops->check_huge(p, expect ? nr_hpages : 0))
 		success("OK");
 	else
 		fail("Fail");
 }
 
+static struct collapse_context __khugepaged_context = {
+	.collapse = &khugepaged_collapse,
+	.enforce_pte_scan_limits = true,
+	.name = "khugepaged",
+};
+
+static struct collapse_context __madvise_context = {
+	.collapse = &madvise_collapse,
+	.enforce_pte_scan_limits = false,
+	.name = "madvise",
+};
+
+static bool is_tmpfs(struct mem_ops *ops)
+{
+	return ops == &__file_ops && finfo.type == VMA_SHMEM;
+}
+
 static void alloc_at_fault(void)
 {
 	struct settings settings = *current_settings();
@@ -658,6 +912,13 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
 
 	p = ops->setup_area(1);
 
+	if (is_tmpfs(ops)) {
+		/* shmem pages always in the page cache */
+		printf("tmpfs...");
+		skip("Skip");
+		goto skip;
+	}
+
 	ops->fault(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
 	c->collapse("Maybe collapse with max_ptes_none exceeded", p, 1,
 		    ops, !c->enforce_pte_scan_limits);
@@ -670,6 +931,7 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
 		validate_memory(p, 0,
 				(hpage_pmd_nr - max_ptes_none) * page_size);
 	}
+skip:
 	ops->cleanup_area(p, hpage_pmd_size);
 	pop_settings();
 }
@@ -753,6 +1015,13 @@ static void collapse_single_pte_entry_compound(struct collapse_context *c, struc
 
 	p = alloc_hpage(ops);
 
+	if (is_tmpfs(ops)) {
+		/* MADV_DONTNEED won't evict tmpfs pages */
+		printf("tmpfs...");
+		skip("Skip");
+		goto skip;
+	}
+
 	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
 	printf("Split huge page leaving single PTE mapping compound page...");
 	madvise(p + page_size, hpage_pmd_size - page_size, MADV_DONTNEED);
@@ -764,6 +1033,7 @@ static void collapse_single_pte_entry_compound(struct collapse_context *c, struc
 	c->collapse("Collapse PTE table with single PTE mapping compound page",
 		    p, 1, ops, true);
 	validate_memory(p, 0, page_size);
+skip:
 	ops->cleanup_area(p, hpage_pmd_size);
 }
 
@@ -1010,9 +1280,70 @@ static void madvise_collapse_existing_thps(struct collapse_context *c,
 	ops->cleanup_area(p, hpage_pmd_size);
 }
 
+static void usage(void)
+{
+	fprintf(stderr, "\nUsage: ./khugepaged <test type> [dir]\n\n");
+	fprintf(stderr, "\t<test type>\t: <context>:<mem_type>\n");
+	fprintf(stderr, "\t<context>\t: [all|khugepaged|madvise]\n");
+	fprintf(stderr, "\t<mem_type>\t: [all|anon|file]\n");
+	fprintf(stderr, "\n\t\"file,all\" mem_type requires [dir] argument\n");
+	fprintf(stderr, "\n\t\"file,all\" mem_type requires kernel built with\n");
+	fprintf(stderr,	"\tCONFIG_READ_ONLY_THP_FOR_FS=y\n");
+	fprintf(stderr, "\n\tif [dir] is a (sub)directory of a tmpfs mount, tmpfs must be\n");
+	fprintf(stderr,	"\tmounted with huge=madvise option for khugepaged tests to work\n");
+	exit(1);
+}
+
+static void parse_test_type(int argc, const char **argv)
+{
+	char *buf;
+	const char *token;
+
+	if (argc == 1) {
+		/* Backwards compatibility */
+		khugepaged_context =  &__khugepaged_context;
+		madvise_context =  &__madvise_context;
+		anon_ops = &__anon_ops;
+		return;
+	}
+
+	buf = strdup(argv[1]);
+	token = strsep(&buf, ":");
+
+	if (!strcmp(token, "all")) {
+		khugepaged_context =  &__khugepaged_context;
+		madvise_context =  &__madvise_context;
+	} else if (!strcmp(token, "khugepaged")) {
+		khugepaged_context =  &__khugepaged_context;
+	} else if (!strcmp(token, "madvise")) {
+		madvise_context =  &__madvise_context;
+	} else {
+		usage();
+	}
+
+	if (!buf)
+		usage();
+
+	if (!strcmp(buf, "all")) {
+		file_ops =  &__file_ops;
+		anon_ops = &__anon_ops;
+	} else if (!strcmp(buf, "anon")) {
+		anon_ops = &__anon_ops;
+	} else if (!strcmp(buf, "file")) {
+		file_ops =  &__file_ops;
+	} else {
+		usage();
+	}
+
+	if (!file_ops)
+		return;
+
+	if (argc != 3)
+		usage();
+}
+
 int main(int argc, const char **argv)
 {
-	struct collapse_context c;
 	struct settings default_settings = {
 		.thp_enabled = THP_MADVISE,
 		.thp_defrag = THP_DEFRAG_ALWAYS,
@@ -1023,8 +1354,20 @@ int main(int argc, const char **argv)
 			.alloc_sleep_millisecs = 10,
 			.scan_sleep_millisecs = 10,
 		},
+		/*
+		 * When testing file-backed memory, the collapse path
+		 * looks at how many pages are found in the page cache, not
+		 * what pages are mapped. Disable read ahead optimization so
+		 * pages don't find their way into the page cache unless
+		 * we mem_ops->fault() them in.
+		 */
+		.read_ahead_kb = 0,
 	};
-	const char *tests = argc == 1 ? "all" : argv[1];
+
+	parse_test_type(argc, argv);
+
+	if (file_ops)
+		get_finfo(argv[2]);
 
 	setbuf(stdout, NULL);
 
@@ -1042,43 +1385,61 @@ int main(int argc, const char **argv)
 
 	alloc_at_fault();
 
-	if (!strcmp(tests, "khugepaged") || !strcmp(tests, "all")) {
-		printf("\n*** Testing context: khugepaged ***\n");
-		c.collapse = &khugepaged_collapse;
-		c.enforce_pte_scan_limits = true;
-
-		collapse_full(&c, &anon_ops);
-		collapse_empty(&c, &anon_ops);
-		collapse_single_pte_entry(&c, &anon_ops);
-		collapse_max_ptes_none(&c, &anon_ops);
-		collapse_swapin_single_pte(&c, &anon_ops);
-		collapse_max_ptes_swap(&c, &anon_ops);
-		collapse_single_pte_entry_compound(&c, &anon_ops);
-		collapse_full_of_compound(&c, &anon_ops);
-		collapse_compound_extreme(&c, &anon_ops);
-		collapse_fork(&c, &anon_ops);
-		collapse_fork_compound(&c, &anon_ops);
-		collapse_max_ptes_shared(&c, &anon_ops);
-	}
-	if (!strcmp(tests, "madvise") || !strcmp(tests, "all")) {
-		printf("\n*** Testing context: madvise ***\n");
-		c.collapse = &madvise_collapse;
-		c.enforce_pte_scan_limits = false;
-
-		collapse_full(&c, &anon_ops);
-		collapse_empty(&c, &anon_ops);
-		collapse_single_pte_entry(&c, &anon_ops);
-		collapse_max_ptes_none(&c, &anon_ops);
-		collapse_swapin_single_pte(&c, &anon_ops);
-		collapse_max_ptes_swap(&c, &anon_ops);
-		collapse_single_pte_entry_compound(&c, &anon_ops);
-		collapse_full_of_compound(&c, &anon_ops);
-		collapse_compound_extreme(&c, &anon_ops);
-		collapse_fork(&c, &anon_ops);
-		collapse_fork_compound(&c, &anon_ops);
-		collapse_max_ptes_shared(&c, &anon_ops);
-		madvise_collapse_existing_thps(&c, &anon_ops);
-	}
+#define TEST(t, c, o) do { \
+	if (c && o) { \
+		printf("\nRun test: " #t " (%s:%s)\n", c->name, o->name); \
+		t(c, o); \
+	} \
+	} while (0)
+
+	TEST(collapse_full, khugepaged_context, anon_ops);
+	TEST(collapse_full, khugepaged_context, file_ops);
+	TEST(collapse_full, madvise_context, anon_ops);
+	TEST(collapse_full, madvise_context, file_ops);
+
+	TEST(collapse_empty, khugepaged_context, anon_ops);
+	TEST(collapse_empty, madvise_context, anon_ops);
+
+	TEST(collapse_single_pte_entry, khugepaged_context, anon_ops);
+	TEST(collapse_single_pte_entry, khugepaged_context, file_ops);
+	TEST(collapse_single_pte_entry, madvise_context, anon_ops);
+	TEST(collapse_single_pte_entry, madvise_context, file_ops);
+
+	TEST(collapse_max_ptes_none, khugepaged_context, anon_ops);
+	TEST(collapse_max_ptes_none, khugepaged_context, file_ops);
+	TEST(collapse_max_ptes_none, madvise_context, anon_ops);
+	TEST(collapse_max_ptes_none, madvise_context, file_ops);
+
+	TEST(collapse_single_pte_entry_compound, khugepaged_context, anon_ops);
+	TEST(collapse_single_pte_entry_compound, khugepaged_context, file_ops);
+	TEST(collapse_single_pte_entry_compound, madvise_context, anon_ops);
+	TEST(collapse_single_pte_entry_compound, madvise_context, file_ops);
+
+	TEST(collapse_full_of_compound, khugepaged_context, anon_ops);
+	TEST(collapse_full_of_compound, khugepaged_context, file_ops);
+	TEST(collapse_full_of_compound, madvise_context, anon_ops);
+	TEST(collapse_full_of_compound, madvise_context, file_ops);
+
+	TEST(collapse_compound_extreme, khugepaged_context, anon_ops);
+	TEST(collapse_compound_extreme, madvise_context, anon_ops);
+
+	TEST(collapse_swapin_single_pte, khugepaged_context, anon_ops);
+	TEST(collapse_swapin_single_pte, madvise_context, anon_ops);
+
+	TEST(collapse_max_ptes_swap, khugepaged_context, anon_ops);
+	TEST(collapse_max_ptes_swap, madvise_context, anon_ops);
+
+	TEST(collapse_fork, khugepaged_context, anon_ops);
+	TEST(collapse_fork, madvise_context, anon_ops);
+
+	TEST(collapse_fork_compound, khugepaged_context, anon_ops);
+	TEST(collapse_fork_compound, madvise_context, anon_ops);
+
+	TEST(collapse_max_ptes_shared, khugepaged_context, anon_ops);
+	TEST(collapse_max_ptes_shared, madvise_context, anon_ops);
+
+	TEST(madvise_collapse_existing_thps, madvise_context, anon_ops);
+	TEST(madvise_collapse_existing_thps, madvise_context, file_ops);
 
 	restore_settings(0);
 }
diff --git a/tools/testing/selftests/vm/vm_util.c b/tools/testing/selftests/vm/vm_util.c
index 9dae51b8219f..f11f8adda521 100644
--- a/tools/testing/selftests/vm/vm_util.c
+++ b/tools/testing/selftests/vm/vm_util.c
@@ -114,3 +114,13 @@ bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size)
 {
 	return __check_huge(addr, "AnonHugePages: ", nr_hpages, hpage_size);
 }
+
+bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size)
+{
+	return __check_huge(addr, "FilePmdMapped:", nr_hpages, hpage_size);
+}
+
+bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size)
+{
+	return __check_huge(addr, "ShmemPmdMapped:", nr_hpages, hpage_size);
+}
diff --git a/tools/testing/selftests/vm/vm_util.h b/tools/testing/selftests/vm/vm_util.h
index 8434ea0c95cd..5c35de454e08 100644
--- a/tools/testing/selftests/vm/vm_util.h
+++ b/tools/testing/selftests/vm/vm_util.h
@@ -8,3 +8,5 @@ void clear_softdirty(void);
 bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len);
 uint64_t read_pmd_pagesize(void);
 bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size);
+bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size);
+bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size);
-- 
2.37.2.789.g6183377224-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm-unstable v3 08/10] selftests/vm: add thp collapse shmem testing
  2022-09-07 14:45 [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE Zach O'Keefe
                   ` (6 preceding siblings ...)
  2022-09-07 14:45 ` [PATCH mm-unstable v3 07/10] selftests/vm: add thp collapse file and tmpfs testing Zach O'Keefe
@ 2022-09-07 14:45 ` Zach O'Keefe
  2022-09-07 14:45 ` [PATCH mm-unstable v3 09/10] selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd Zach O'Keefe
  2022-09-07 14:45 ` [PATCH mm-unstable v3 10/10] selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory Zach O'Keefe
  9 siblings, 0 replies; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-07 14:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Add memory operations for shmem (memfd) memory, and reuse
existing tests with the new memory operations.

Shmem tests can be called with "shmem" mem_type, and shmem tests
are ran with "all" mem_type as well.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 57 ++++++++++++++++++++++++-
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 59f56a329f43..05d9945daa48 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -58,6 +58,7 @@ struct mem_ops {
 
 static struct mem_ops *file_ops;
 static struct mem_ops *anon_ops;
+static struct mem_ops *shmem_ops;
 
 struct collapse_context {
 	void (*collapse)(const char *msg, char *p, int nr_hpages,
@@ -708,6 +709,40 @@ static bool file_check_huge(void *addr, int nr_hpages)
 	}
 }
 
+static void *shmem_setup_area(int nr_hpages)
+{
+	void *p;
+	unsigned long size = nr_hpages * hpage_pmd_size;
+
+	finfo.fd = memfd_create("khugepaged-selftest-collapse-shmem", 0);
+	if (finfo.fd < 0)  {
+		perror("memfd_create()");
+		exit(EXIT_FAILURE);
+	}
+	if (ftruncate(finfo.fd, size)) {
+		perror("ftruncate()");
+		exit(EXIT_FAILURE);
+	}
+	p = mmap(BASE_ADDR, size, PROT_READ | PROT_WRITE, MAP_SHARED, finfo.fd,
+		 0);
+	if (p != BASE_ADDR) {
+		perror("mmap()");
+		exit(EXIT_FAILURE);
+	}
+	return p;
+}
+
+static void shmem_cleanup_area(void *p, unsigned long size)
+{
+	munmap(p, size);
+	close(finfo.fd);
+}
+
+static bool shmem_check_huge(void *addr, int nr_hpages)
+{
+	return check_huge_shmem(addr, nr_hpages, hpage_pmd_size);
+}
+
 static struct mem_ops __anon_ops = {
 	.setup_area = &anon_setup_area,
 	.cleanup_area = &anon_cleanup_area,
@@ -724,6 +759,14 @@ static struct mem_ops __file_ops = {
 	.name = "file",
 };
 
+static struct mem_ops __shmem_ops = {
+	.setup_area = &shmem_setup_area,
+	.cleanup_area = &shmem_cleanup_area,
+	.fault = &anon_fault,
+	.check_huge = &shmem_check_huge,
+	.name = "shmem",
+};
+
 static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
 			       struct mem_ops *ops, bool expect)
 {
@@ -1285,7 +1328,7 @@ static void usage(void)
 	fprintf(stderr, "\nUsage: ./khugepaged <test type> [dir]\n\n");
 	fprintf(stderr, "\t<test type>\t: <context>:<mem_type>\n");
 	fprintf(stderr, "\t<context>\t: [all|khugepaged|madvise]\n");
-	fprintf(stderr, "\t<mem_type>\t: [all|anon|file]\n");
+	fprintf(stderr, "\t<mem_type>\t: [all|anon|file|shmem]\n");
 	fprintf(stderr, "\n\t\"file,all\" mem_type requires [dir] argument\n");
 	fprintf(stderr, "\n\t\"file,all\" mem_type requires kernel built with\n");
 	fprintf(stderr,	"\tCONFIG_READ_ONLY_THP_FOR_FS=y\n");
@@ -1327,10 +1370,13 @@ static void parse_test_type(int argc, const char **argv)
 	if (!strcmp(buf, "all")) {
 		file_ops =  &__file_ops;
 		anon_ops = &__anon_ops;
+		shmem_ops = &__shmem_ops;
 	} else if (!strcmp(buf, "anon")) {
 		anon_ops = &__anon_ops;
 	} else if (!strcmp(buf, "file")) {
 		file_ops =  &__file_ops;
+	} else if (!strcmp(buf, "shmem")) {
+		shmem_ops = &__shmem_ops;
 	} else {
 		usage();
 	}
@@ -1347,7 +1393,7 @@ int main(int argc, const char **argv)
 	struct settings default_settings = {
 		.thp_enabled = THP_MADVISE,
 		.thp_defrag = THP_DEFRAG_ALWAYS,
-		.shmem_enabled = SHMEM_NEVER,
+		.shmem_enabled = SHMEM_ADVISE,
 		.use_zero_page = 0,
 		.khugepaged = {
 			.defrag = 1,
@@ -1394,16 +1440,20 @@ int main(int argc, const char **argv)
 
 	TEST(collapse_full, khugepaged_context, anon_ops);
 	TEST(collapse_full, khugepaged_context, file_ops);
+	TEST(collapse_full, khugepaged_context, shmem_ops);
 	TEST(collapse_full, madvise_context, anon_ops);
 	TEST(collapse_full, madvise_context, file_ops);
+	TEST(collapse_full, madvise_context, shmem_ops);
 
 	TEST(collapse_empty, khugepaged_context, anon_ops);
 	TEST(collapse_empty, madvise_context, anon_ops);
 
 	TEST(collapse_single_pte_entry, khugepaged_context, anon_ops);
 	TEST(collapse_single_pte_entry, khugepaged_context, file_ops);
+	TEST(collapse_single_pte_entry, khugepaged_context, shmem_ops);
 	TEST(collapse_single_pte_entry, madvise_context, anon_ops);
 	TEST(collapse_single_pte_entry, madvise_context, file_ops);
+	TEST(collapse_single_pte_entry, madvise_context, shmem_ops);
 
 	TEST(collapse_max_ptes_none, khugepaged_context, anon_ops);
 	TEST(collapse_max_ptes_none, khugepaged_context, file_ops);
@@ -1417,8 +1467,10 @@ int main(int argc, const char **argv)
 
 	TEST(collapse_full_of_compound, khugepaged_context, anon_ops);
 	TEST(collapse_full_of_compound, khugepaged_context, file_ops);
+	TEST(collapse_full_of_compound, khugepaged_context, shmem_ops);
 	TEST(collapse_full_of_compound, madvise_context, anon_ops);
 	TEST(collapse_full_of_compound, madvise_context, file_ops);
+	TEST(collapse_full_of_compound, madvise_context, shmem_ops);
 
 	TEST(collapse_compound_extreme, khugepaged_context, anon_ops);
 	TEST(collapse_compound_extreme, madvise_context, anon_ops);
@@ -1440,6 +1492,7 @@ int main(int argc, const char **argv)
 
 	TEST(madvise_collapse_existing_thps, madvise_context, anon_ops);
 	TEST(madvise_collapse_existing_thps, madvise_context, file_ops);
+	TEST(madvise_collapse_existing_thps, madvise_context, shmem_ops);
 
 	restore_settings(0);
 }
-- 
2.37.2.789.g6183377224-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm-unstable v3 09/10] selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
  2022-09-07 14:45 [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE Zach O'Keefe
                   ` (7 preceding siblings ...)
  2022-09-07 14:45 ` [PATCH mm-unstable v3 08/10] selftests/vm: add thp collapse shmem testing Zach O'Keefe
@ 2022-09-07 14:45 ` Zach O'Keefe
  2022-09-07 14:45 ` [PATCH mm-unstable v3 10/10] selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory Zach O'Keefe
  9 siblings, 0 replies; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-07 14:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

This test tests that MADV_COLLAPSE acting on file/shmem memory for which
(1) the file extent mapping by the memory is already a huge page in the
page cache, and (2) the pmd mapping this memory in the target process
is none.

In practice, (1)+(2) is the state left over after khugepaged has
successfully collapsed file/shmem memory for a target VMA, but the
memory has not yet been refaulted. So, this test in-effect tests
MADV_COLLAPSE racing with khugepaged to collapse the memory first.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/khugepaged.c | 30 +++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 05d9945daa48..730507e38c58 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -1323,6 +1323,33 @@ static void madvise_collapse_existing_thps(struct collapse_context *c,
 	ops->cleanup_area(p, hpage_pmd_size);
 }
 
+/*
+ * Test race with khugepaged where page tables have been retracted and
+ * pmd cleared.
+ */
+static void madvise_retracted_page_tables(struct collapse_context *c,
+					  struct mem_ops *ops)
+{
+	void *p;
+	int nr_hpages = 1;
+	unsigned long size = nr_hpages * hpage_pmd_size;
+
+	p = ops->setup_area(nr_hpages);
+	ops->fault(p, 0, size);
+
+	/* Let khugepaged collapse and leave pmd cleared */
+	if (wait_for_scan("Collapse and leave PMD cleared", p, nr_hpages,
+			  ops)) {
+		fail("Timeout");
+		return;
+	}
+	success("OK");
+	c->collapse("Install huge PMD from page cache", p, nr_hpages, ops,
+		    true);
+	validate_memory(p, 0, size);
+	ops->cleanup_area(p, size);
+}
+
 static void usage(void)
 {
 	fprintf(stderr, "\nUsage: ./khugepaged <test type> [dir]\n\n");
@@ -1494,5 +1521,8 @@ int main(int argc, const char **argv)
 	TEST(madvise_collapse_existing_thps, madvise_context, file_ops);
 	TEST(madvise_collapse_existing_thps, madvise_context, shmem_ops);
 
+	TEST(madvise_retracted_page_tables, madvise_context, file_ops);
+	TEST(madvise_retracted_page_tables, madvise_context, shmem_ops);
+
 	restore_settings(0);
 }
-- 
2.37.2.789.g6183377224-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm-unstable v3 10/10] selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
  2022-09-07 14:45 [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE Zach O'Keefe
                   ` (8 preceding siblings ...)
  2022-09-07 14:45 ` [PATCH mm-unstable v3 09/10] selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd Zach O'Keefe
@ 2022-09-07 14:45 ` Zach O'Keefe
  9 siblings, 0 replies; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-07 14:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-api, Axel Rasmussen, James Houghton,
	Hugh Dickins, Yang Shi, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia,
	Zach O'Keefe

Add :collapse mod to userfaultfd selftest.  Currently this mod is
only valid for "shmem" test type, but could be used for other test
types.

When provided, memory allocated by ->allocate_area() will be
hugepage-aligned enforced to be hugepage-sized.  userfaultf_minor_test,
after the UFFD-registered mapping has been populated by UUFD minor
fault handler, attempt to MADV_COLLAPSE the UFFD-registered mapping to
collapse the memory into a pmd-mapped THP.

This test is meant to be a functional test of what occurs during
UFFD-driven live migration of VMs backed by huge tmpfs where, after
a hugepage-sized region has been successfully migrated (in native
page-sized chunks, to avoid latency of fetched a hugepage over the
network), we want to reclaim previous VM performance by remapping it
at the PMD level.

Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 tools/testing/selftests/vm/Makefile      |   1 +
 tools/testing/selftests/vm/userfaultfd.c | 171 ++++++++++++++++++-----
 2 files changed, 134 insertions(+), 38 deletions(-)

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index c9c0996c122b..c687533374e6 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -99,6 +99,7 @@ $(OUTPUT)/khugepaged: vm_util.c
 $(OUTPUT)/madv_populate: vm_util.c
 $(OUTPUT)/soft-dirty: vm_util.c
 $(OUTPUT)/split_huge_page_test: vm_util.c
+$(OUTPUT)/userfaultfd: vm_util.c
 
 ifeq ($(MACHINE),x86_64)
 BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 7be709d9eed0..74babdbc02e5 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -61,10 +61,11 @@
 #include <sys/random.h>
 
 #include "../kselftest.h"
+#include "vm_util.h"
 
 #ifdef __NR_userfaultfd
 
-static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size;
+static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size, hpage_size;
 
 #define BOUNCE_RANDOM		(1<<0)
 #define BOUNCE_RACINGFAULTS	(1<<1)
@@ -79,6 +80,8 @@ static int test_type;
 
 #define UFFD_FLAGS	(O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY)
 
+#define BASE_PMD_ADDR ((void *)(1UL << 30))
+
 /* test using /dev/userfaultfd, instead of userfaultfd(2) */
 static bool test_dev_userfaultfd;
 
@@ -97,9 +100,10 @@ static int huge_fd;
 static unsigned long long *count_verify;
 static int uffd = -1;
 static int uffd_flags, finished, *pipefd;
-static char *area_src, *area_src_alias, *area_dst, *area_dst_alias;
+static char *area_src, *area_src_alias, *area_dst, *area_dst_alias, *area_remap;
 static char *zeropage;
 pthread_attr_t attr;
+static bool test_collapse;
 
 /* Userfaultfd test statistics */
 struct uffd_stats {
@@ -127,6 +131,8 @@ struct uffd_stats {
 #define swap(a, b) \
 	do { typeof(a) __tmp = (a); (a) = (b); (b) = __tmp; } while (0)
 
+#define factor_of_2(x) ((x) ^ ((x) & ((x) - 1)))
+
 const char *examples =
     "# Run anonymous memory test on 100MiB region with 99999 bounces:\n"
     "./userfaultfd anon 100 99999\n\n"
@@ -152,6 +158,8 @@ static void usage(void)
 		"Supported mods:\n");
 	fprintf(stderr, "\tsyscall - Use userfaultfd(2) (default)\n");
 	fprintf(stderr, "\tdev - Use /dev/userfaultfd instead of userfaultfd(2)\n");
+	fprintf(stderr, "\tcollapse - Test MADV_COLLAPSE of UFFDIO_REGISTER_MODE_MINOR\n"
+		"memory\n");
 	fprintf(stderr, "\nExample test mod usage:\n");
 	fprintf(stderr, "# Run anonymous memory test with /dev/userfaultfd:\n");
 	fprintf(stderr, "./userfaultfd anon:dev 100 99999\n\n");
@@ -229,12 +237,10 @@ static void anon_release_pages(char *rel_area)
 		err("madvise(MADV_DONTNEED) failed");
 }
 
-static void anon_allocate_area(void **alloc_area)
+static void anon_allocate_area(void **alloc_area, bool is_src)
 {
 	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
 			   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
-	if (*alloc_area == MAP_FAILED)
-		err("mmap of anonymous memory failed");
 }
 
 static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset)
@@ -252,7 +258,7 @@ static void hugetlb_release_pages(char *rel_area)
 	}
 }
 
-static void hugetlb_allocate_area(void **alloc_area)
+static void hugetlb_allocate_area(void **alloc_area, bool is_src)
 {
 	void *area_alias = NULL;
 	char **alloc_area_alias;
@@ -262,7 +268,7 @@ static void hugetlb_allocate_area(void **alloc_area)
 			nr_pages * page_size,
 			PROT_READ | PROT_WRITE,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB |
-				(*alloc_area == area_src ? 0 : MAP_NORESERVE),
+				(is_src ? 0 : MAP_NORESERVE),
 			-1,
 			0);
 	else
@@ -270,9 +276,9 @@ static void hugetlb_allocate_area(void **alloc_area)
 			nr_pages * page_size,
 			PROT_READ | PROT_WRITE,
 			MAP_SHARED |
-				(*alloc_area == area_src ? 0 : MAP_NORESERVE),
+				(is_src ? 0 : MAP_NORESERVE),
 			huge_fd,
-			*alloc_area == area_src ? 0 : nr_pages * page_size);
+			is_src ? 0 : nr_pages * page_size);
 	if (*alloc_area == MAP_FAILED)
 		err("mmap of hugetlbfs file failed");
 
@@ -282,12 +288,12 @@ static void hugetlb_allocate_area(void **alloc_area)
 			PROT_READ | PROT_WRITE,
 			MAP_SHARED,
 			huge_fd,
-			*alloc_area == area_src ? 0 : nr_pages * page_size);
+			is_src ? 0 : nr_pages * page_size);
 		if (area_alias == MAP_FAILED)
 			err("mmap of hugetlb file alias failed");
 	}
 
-	if (*alloc_area == area_src) {
+	if (is_src) {
 		alloc_area_alias = &area_src_alias;
 	} else {
 		alloc_area_alias = &area_dst_alias;
@@ -310,21 +316,36 @@ static void shmem_release_pages(char *rel_area)
 		err("madvise(MADV_REMOVE) failed");
 }
 
-static void shmem_allocate_area(void **alloc_area)
+static void shmem_allocate_area(void **alloc_area, bool is_src)
 {
 	void *area_alias = NULL;
-	bool is_src = alloc_area == (void **)&area_src;
-	unsigned long offset = is_src ? 0 : nr_pages * page_size;
+	size_t bytes = nr_pages * page_size;
+	unsigned long offset = is_src ? 0 : bytes;
+	char *p = NULL, *p_alias = NULL;
+
+	if (test_collapse) {
+		p = BASE_PMD_ADDR;
+		if (!is_src)
+			/* src map + alias + interleaved hpages */
+			p += 2 * (bytes + hpage_size);
+		p_alias = p;
+		p_alias += bytes;
+		p_alias += hpage_size;  /* Prevent src/dst VMA merge */
+	}
 
-	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
-			   MAP_SHARED, shm_fd, offset);
+	*alloc_area = mmap(p, bytes, PROT_READ | PROT_WRITE, MAP_SHARED,
+			   shm_fd, offset);
 	if (*alloc_area == MAP_FAILED)
 		err("mmap of memfd failed");
+	if (test_collapse && *alloc_area != p)
+		err("mmap of memfd failed at %p", p);
 
-	area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
-			  MAP_SHARED, shm_fd, offset);
+	area_alias = mmap(p_alias, bytes, PROT_READ | PROT_WRITE, MAP_SHARED,
+			  shm_fd, offset);
 	if (area_alias == MAP_FAILED)
 		err("mmap of memfd alias failed");
+	if (test_collapse && area_alias != p_alias)
+		err("mmap of anonymous memory failed at %p", p_alias);
 
 	if (is_src)
 		area_src_alias = area_alias;
@@ -337,28 +358,39 @@ static void shmem_alias_mapping(__u64 *start, size_t len, unsigned long offset)
 	*start = (unsigned long)area_dst_alias + offset;
 }
 
+static void shmem_check_pmd_mapping(void *p, int expect_nr_hpages)
+{
+	if (!check_huge_shmem(area_dst_alias, expect_nr_hpages, hpage_size))
+		err("Did not find expected %d number of hugepages",
+		    expect_nr_hpages);
+}
+
 struct uffd_test_ops {
-	void (*allocate_area)(void **alloc_area);
+	void (*allocate_area)(void **alloc_area, bool is_src);
 	void (*release_pages)(char *rel_area);
 	void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset);
+	void (*check_pmd_mapping)(void *p, int expect_nr_hpages);
 };
 
 static struct uffd_test_ops anon_uffd_test_ops = {
 	.allocate_area	= anon_allocate_area,
 	.release_pages	= anon_release_pages,
 	.alias_mapping = noop_alias_mapping,
+	.check_pmd_mapping = NULL,
 };
 
 static struct uffd_test_ops shmem_uffd_test_ops = {
 	.allocate_area	= shmem_allocate_area,
 	.release_pages	= shmem_release_pages,
 	.alias_mapping = shmem_alias_mapping,
+	.check_pmd_mapping = shmem_check_pmd_mapping,
 };
 
 static struct uffd_test_ops hugetlb_uffd_test_ops = {
 	.allocate_area	= hugetlb_allocate_area,
 	.release_pages	= hugetlb_release_pages,
 	.alias_mapping = hugetlb_alias_mapping,
+	.check_pmd_mapping = NULL,
 };
 
 static struct uffd_test_ops *uffd_test_ops;
@@ -478,6 +510,7 @@ static void uffd_test_ctx_clear(void)
 	munmap_area((void **)&area_src_alias);
 	munmap_area((void **)&area_dst);
 	munmap_area((void **)&area_dst_alias);
+	munmap_area((void **)&area_remap);
 }
 
 static void uffd_test_ctx_init(uint64_t features)
@@ -486,8 +519,8 @@ static void uffd_test_ctx_init(uint64_t features)
 
 	uffd_test_ctx_clear();
 
-	uffd_test_ops->allocate_area((void **)&area_src);
-	uffd_test_ops->allocate_area((void **)&area_dst);
+	uffd_test_ops->allocate_area((void **)&area_src, true);
+	uffd_test_ops->allocate_area((void **)&area_dst, false);
 
 	userfaultfd_open(&features);
 
@@ -804,6 +837,7 @@ static void *uffd_poll_thread(void *arg)
 				err("remove failure");
 			break;
 		case UFFD_EVENT_REMAP:
+			area_remap = area_dst;  /* save for later unmap */
 			area_dst = (char *)(unsigned long)msg.arg.remap.to;
 			break;
 		}
@@ -1256,13 +1290,30 @@ static int userfaultfd_sig_test(void)
 	return userfaults != 0;
 }
 
+void check_memory_contents(char *p)
+{
+	unsigned long i;
+	uint8_t expected_byte;
+	void *expected_page;
+
+	if (posix_memalign(&expected_page, page_size, page_size))
+		err("out of memory");
+
+	for (i = 0; i < nr_pages; ++i) {
+		expected_byte = ~((uint8_t)(i % ((uint8_t)-1)));
+		memset(expected_page, expected_byte, page_size);
+		if (my_bcmp(expected_page, p + (i * page_size), page_size))
+			err("unexpected page contents after minor fault");
+	}
+
+	free(expected_page);
+}
+
 static int userfaultfd_minor_test(void)
 {
-	struct uffdio_register uffdio_register;
 	unsigned long p;
+	struct uffdio_register uffdio_register;
 	pthread_t uffd_mon;
-	uint8_t expected_byte;
-	void *expected_page;
 	char c;
 	struct uffd_stats stats = { 0 };
 
@@ -1301,17 +1352,7 @@ static int userfaultfd_minor_test(void)
 	 * fault. uffd_poll_thread will resolve the fault by bit-flipping the
 	 * page's contents, and then issuing a CONTINUE ioctl.
 	 */
-
-	if (posix_memalign(&expected_page, page_size, page_size))
-		err("out of memory");
-
-	for (p = 0; p < nr_pages; ++p) {
-		expected_byte = ~((uint8_t)(p % ((uint8_t)-1)));
-		memset(expected_page, expected_byte, page_size);
-		if (my_bcmp(expected_page, area_dst_alias + (p * page_size),
-			    page_size))
-			err("unexpected page contents after minor fault");
-	}
+	check_memory_contents(area_dst_alias);
 
 	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c))
 		err("pipe write");
@@ -1320,6 +1361,23 @@ static int userfaultfd_minor_test(void)
 
 	uffd_stats_report(&stats, 1);
 
+	if (test_collapse) {
+		printf("testing collapse of uffd memory into PMD-mapped THPs:");
+		if (madvise(area_dst_alias, nr_pages * page_size,
+			    MADV_COLLAPSE))
+			err("madvise(MADV_COLLAPSE)");
+
+		uffd_test_ops->check_pmd_mapping(area_dst,
+						 nr_pages * page_size /
+						 hpage_size);
+		/*
+		 * This won't cause uffd-fault - it purely just makes sure there
+		 * was no corruption.
+		 */
+		check_memory_contents(area_dst_alias);
+		printf(" done.\n");
+	}
+
 	return stats.missing_faults != 0 || stats.minor_faults != nr_pages;
 }
 
@@ -1656,6 +1714,8 @@ static void parse_test_type_arg(const char *raw_type)
 			test_dev_userfaultfd = true;
 		else if (!strcmp(token, "syscall"))
 			test_dev_userfaultfd = false;
+		else if (!strcmp(token, "collapse"))
+			test_collapse = true;
 		else
 			err("unrecognized test mod '%s'", token);
 	}
@@ -1663,8 +1723,11 @@ static void parse_test_type_arg(const char *raw_type)
 	if (!test_type)
 		err("failed to parse test type argument: '%s'", raw_type);
 
+	if (test_collapse && test_type != TEST_SHMEM)
+		err("Unsupported test: %s", raw_type);
+
 	if (test_type == TEST_HUGETLB)
-		page_size = default_huge_page_size();
+		page_size = hpage_size;
 	else
 		page_size = sysconf(_SC_PAGE_SIZE);
 
@@ -1702,6 +1765,8 @@ static void sigalrm(int sig)
 
 int main(int argc, char **argv)
 {
+	size_t bytes;
+
 	if (argc < 4)
 		usage();
 
@@ -1709,11 +1774,41 @@ int main(int argc, char **argv)
 		err("failed to arm SIGALRM");
 	alarm(ALARM_INTERVAL_SECS);
 
+	hpage_size = default_huge_page_size();
 	parse_test_type_arg(argv[1]);
+	bytes = atol(argv[2]) * 1024 * 1024;
+
+	if (test_collapse && bytes & (hpage_size - 1))
+		err("MiB must be multiple of %lu if :collapse mod set",
+		    hpage_size >> 20);
 
 	nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
-	nr_pages_per_cpu = atol(argv[2]) * 1024*1024 / page_size /
-		nr_cpus;
+
+	if (test_collapse) {
+		/* nr_cpus must divide (bytes / page_size), otherwise,
+		 * area allocations of (nr_pages * paze_size) won't be a
+		 * multiple of hpage_size, even if bytes is a multiple of
+		 * hpage_size.
+		 *
+		 * This means that nr_cpus must divide (N * (2 << (H-P))
+		 * where:
+		 *	bytes = hpage_size * N
+		 *	hpage_size = 2 << H
+		 *	page_size = 2 << P
+		 *
+		 * And we want to chose nr_cpus to be the largest value
+		 * satisfying this constraint, not larger than the number
+		 * of online CPUs. Unfortunately, prime factorization of
+		 * N and nr_cpus may be arbitrary, so have to search for it.
+		 * Instead, just use the highest power of 2 dividing both
+		 * nr_cpus and (bytes / page_size).
+		 */
+		int x = factor_of_2(nr_cpus);
+		int y = factor_of_2(bytes / page_size);
+
+		nr_cpus = x < y ? x : y;
+	}
+	nr_pages_per_cpu = bytes / page_size / nr_cpus;
 	if (!nr_pages_per_cpu) {
 		_err("invalid MiB");
 		usage();
-- 
2.37.2.789.g6183377224-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH mm-unstable v3 01/10] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check()
  2022-09-07 14:45 ` [PATCH mm-unstable v3 01/10] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() Zach O'Keefe
@ 2022-09-16 17:46   ` Yang Shi
  2022-09-16 22:22     ` Zach O'Keefe
  0 siblings, 1 reply; 22+ messages in thread
From: Yang Shi @ 2022-09-16 17:46 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Extend 'mm/thp: add flag to enforce sysfs THP in
> hugepage_vma_check()' to shmem, allowing callers to ignore
> /sys/kernel/transparent_hugepage/shmem_enabled and tmpfs huge= mount.
>
> This is intended to be used by MADV_COLLAPSE, and the rationale is
> analogous to the anon/file case: MADV_COLLAPSE is not coupled to
> directives that advise the kernel's decisions on when THPs should be
> considered eligible. shmem/tmpfs always claims large folio support,
> regardless of sysfs or mount options.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

A nit below...

> ---
>  include/linux/shmem_fs.h | 10 ++++++----
>  mm/huge_memory.c         |  2 +-
>  mm/shmem.c               | 18 +++++++++---------
>  3 files changed, 16 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index f24071e3c826..d500ea967dc7 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -92,11 +92,13 @@ extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
>  extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
>  int shmem_unuse(unsigned int type);
>
> -extern bool shmem_is_huge(struct vm_area_struct *vma,
> -                         struct inode *inode, pgoff_t index);
> -static inline bool shmem_huge_enabled(struct vm_area_struct *vma)
> +extern bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
> +                         pgoff_t index, bool shmem_huge_force);
> +static inline bool shmem_huge_enabled(struct vm_area_struct *vma,
> +                                     bool shmem_huge_force)
>  {
> -       return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff);
> +       return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff,
> +                            shmem_huge_force);
>  }
>  extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
>  extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7fa74b9749a6..53d170dac332 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -119,7 +119,7 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
>          * own flags.
>          */
>         if (!in_pf && shmem_file(vma->vm_file))
> -               return shmem_huge_enabled(vma);
> +               return shmem_huge_enabled(vma, !enforce_sysfs);
>
>         /* Enforce sysfs THP requirements as necessary */
>         if (enforce_sysfs &&
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 99b7341bd0bf..47c42c566fd1 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -461,20 +461,20 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>
>  static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>
> -bool shmem_is_huge(struct vm_area_struct *vma,
> -                  struct inode *inode, pgoff_t index)
> +bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
> +                  pgoff_t index, bool shmem_huge_force)
>  {
>         loff_t i_size;
>
>         if (!S_ISREG(inode->i_mode))
>                 return false;
> -       if (shmem_huge == SHMEM_HUGE_DENY)
> -               return false;
>         if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
>             test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
>                 return false;
> -       if (shmem_huge == SHMEM_HUGE_FORCE)
> +       if (shmem_huge == SHMEM_HUGE_FORCE || shmem_huge_force)

shmem_huge_force means ignore all sysfs and mount options, so it seems
better to have it test explicitly IMHO, like:

if (shmem_huge_force)
    return true;

if (shmem_huge == SHMEM_HUGE_FORCE)
    return true;


>                 return true;
> +       if (shmem_huge == SHMEM_HUGE_DENY)
> +               return false;
>
>         switch (SHMEM_SB(inode->i_sb)->huge) {
>         case SHMEM_HUGE_ALWAYS:
> @@ -669,8 +669,8 @@ static long shmem_unused_huge_count(struct super_block *sb,
>
>  #define shmem_huge SHMEM_HUGE_DENY
>
> -bool shmem_is_huge(struct vm_area_struct *vma,
> -                  struct inode *inode, pgoff_t index)
> +bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
> +                  pgoff_t index, bool shmem_huge_force)
>  {
>         return false;
>  }
> @@ -1056,7 +1056,7 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
>                         STATX_ATTR_NODUMP);
>         generic_fillattr(&init_user_ns, inode, stat);
>
> -       if (shmem_is_huge(NULL, inode, 0))
> +       if (shmem_is_huge(NULL, inode, 0, false))
>                 stat->blksize = HPAGE_PMD_SIZE;
>
>         if (request_mask & STATX_BTIME) {
> @@ -1888,7 +1888,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>                 return 0;
>         }
>
> -       if (!shmem_is_huge(vma, inode, index))
> +       if (!shmem_is_huge(vma, inode, index, false))
>                 goto alloc_nohuge;
>
>         huge_gfp = vma_thp_gfp_mask(vma);
> --
> 2.37.2.789.g6183377224-goog
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm-unstable v3 02/10] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds
  2022-09-07 14:45 ` [PATCH mm-unstable v3 02/10] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds Zach O'Keefe
@ 2022-09-16 18:26   ` Yang Shi
  2022-09-19 15:36     ` Zach O'Keefe
  0 siblings, 1 reply; 22+ messages in thread
From: Yang Shi @ 2022-09-16 18:26 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> The main benefit of THPs are that they can be mapped at the pmd level,
> increasing the likelihood of TLB hit and spending less cycles in page
> table walks.  pte-mapped hugepages - that is - hugepage-aligned compound
> pages of order HPAGE_PMD_ORDER mapped by ptes - although being
> contiguous in physical memory, don't have this advantage.  In fact, one
> could argue they are detrimental to system performance overall since
> they occupy a precious hugepage-aligned/sized region of physical memory
> that could otherwise be used more effectively.  Additionally, pte-mapped
> hugepages can be the cheapest memory to collapse for khugepaged since no
> new hugepage allocation or copying of memory contents is necessary - we
> only need to update the mapping page tables.
>
> In the anonymous collapse path, we are able to collapse pte-mapped
> hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no
> effort when compound pages (of any order) are encountered.
>
> Identify pte-mapped hugepages in the file/shmem collapse path.  The
> final step of which makes a racy check of the value of the pmd to ensure
> it maps a pte table.  This should be fine, since races that result in
> false-positive (i.e. attempt collapse even though we sholdn't) will fail

s/sholdn't/shouldn't

> later in collapse_pte_mapped_thp() once we actually lock mmap_lock and
> reinspect the pmd value.  Races that result in false-negatives (i.e.
> where we decide to not attempt collapse, but should have) shouldn't be
> an issue, since in the worst case, we do nothing - which is what we've
> done up to this point.  We make a similar check in retract_page_tables().
> If we do think we've found a pte-mapped hugepgae in khugepaged context,
> attempt to update page tables mapping this hugepage.
>
> Note that these collapses still count towards the
> /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter, and
> if the pte-mapped hugepage was also mapped into multiple process' address
> spaces, could be incremented for each page table update.  Since we
> increment the counter when a pte-mapped hugepage is successfully added to
> the list of to-collapse pte-mapped THPs, it's possible that we never
> actually update the page table either.  This is different from how
> file/shmem pages_collapsed accounting works today where only a successful
> page cache update is counted (it's also possible here that no page tables
> are actually changed).  Though it incurs some slop, this is preferred to
> either not accounting for the event at all, or plumbing through data in
> struct mm_slot on whether to account for the collapse or not.

I don't have a strong preference on this. Typically it is used to tell
the users khugepaged is making progress. We have thp_collapse_alloc
from /proc/vmstat to account how many huge pages are really allocated
by khugepaged/MADV_COLLAPSE.

But it may be better to add a note in the document
(Documentation/admin-guide/mm/transhuge.rst) to make it more explicit.

>
> Also note that work still needs to be done to support arbitrary compound
> pages, and that this should all be converted to using folios.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Other than the above comments and two nits below, the patch looks good
to me. Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  include/trace/events/huge_memory.h |  1 +
>  mm/khugepaged.c                    | 67 +++++++++++++++++++++++++++---
>  2 files changed, 62 insertions(+), 6 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 55392bf30a03..fbbb25494d60 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -17,6 +17,7 @@
>         EM( SCAN_EXCEED_SHARED_PTE,     "exceed_shared_pte")            \
>         EM( SCAN_PTE_NON_PRESENT,       "pte_non_present")              \
>         EM( SCAN_PTE_UFFD_WP,           "pte_uffd_wp")                  \
> +       EM( SCAN_PTE_MAPPED_HUGEPAGE,   "pte_mapped_hugepage")          \
>         EM( SCAN_PAGE_RO,               "no_writable_page")             \
>         EM( SCAN_LACK_REFERENCED_PAGE,  "lack_referenced_page")         \
>         EM( SCAN_PAGE_NULL,             "page_null")                    \
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 55c8625ed950..31ccf49cf279 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -35,6 +35,7 @@ enum scan_result {
>         SCAN_EXCEED_SHARED_PTE,
>         SCAN_PTE_NON_PRESENT,
>         SCAN_PTE_UFFD_WP,
> +       SCAN_PTE_MAPPED_HUGEPAGE,
>         SCAN_PAGE_RO,
>         SCAN_LACK_REFERENCED_PAGE,
>         SCAN_PAGE_NULL,
> @@ -1318,20 +1319,24 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
>   * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
>   * khugepaged should try to collapse the page table.
>   */
> -static void khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> +static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
>                                           unsigned long addr)
>  {
>         struct khugepaged_mm_slot *mm_slot;
>         struct mm_slot *slot;
> +       bool ret = false;
>
>         VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
>
>         spin_lock(&khugepaged_mm_lock);
>         slot = mm_slot_lookup(mm_slots_hash, mm);
>         mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
> -       if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP))
> +       if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
>                 mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
> +               ret = true;
> +       }
>         spin_unlock(&khugepaged_mm_lock);
> +       return ret;
>  }
>
>  static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> @@ -1368,9 +1373,16 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
>         pte_t *start_pte, *pte;
>         pmd_t *pmd;
>         spinlock_t *ptl;
> -       int count = 0;
> +       int count = 0, result = SCAN_FAIL;
>         int i;
>
> +       mmap_assert_write_locked(mm);
> +
> +       /* Fast check before locking page if already PMD-mapped  */

It also back off if the page is not mapped at all. So better to
reflect this in the comment too.

> +       result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> +       if (result != SCAN_SUCCEED)
> +               return;
> +
>         if (!vma || !vma->vm_file ||
>             !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
>                 return;
> @@ -1721,9 +1733,16 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
>                 /*
>                  * If file was truncated then extended, or hole-punched, before
>                  * we locked the first page, then a THP might be there already.
> +                * This will be discovered on the first iteration.
>                  */
>                 if (PageTransCompound(page)) {
> -                       result = SCAN_PAGE_COMPOUND;
> +                       struct page *head = compound_head(page);
> +
> +                       result = compound_order(head) == HPAGE_PMD_ORDER &&
> +                                       head->index == start
> +                                       /* Maybe PMD-mapped */
> +                                       ? SCAN_PTE_MAPPED_HUGEPAGE
> +                                       : SCAN_PAGE_COMPOUND;
>                         goto out_unlock;
>                 }
>
> @@ -1961,7 +1980,19 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
>                  * into a PMD sized page
>                  */

The comment starts with "XXX:", better to rephrase to "TODO:", it
seems more understandable.

>                 if (PageTransCompound(page)) {
> -                       result = SCAN_PAGE_COMPOUND;
> +                       struct page *head = compound_head(page);
> +
> +                       result = compound_order(head) == HPAGE_PMD_ORDER &&
> +                                       head->index == start
> +                                       /* Maybe PMD-mapped */
> +                                       ? SCAN_PTE_MAPPED_HUGEPAGE
> +                                       : SCAN_PAGE_COMPOUND;
> +                       /*
> +                        * For SCAN_PTE_MAPPED_HUGEPAGE, further processing
> +                        * by the caller won't touch the page cache, and so
> +                        * it's safe to skip LRU and refcount checks before
> +                        * returning.
> +                        */
>                         break;
>                 }
>
> @@ -2021,6 +2052,12 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
>  static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
>  {
>  }
> +
> +static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> +                                         unsigned long addr)
> +{
> +       return false;
> +}
>  #endif
>
>  static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> @@ -2115,8 +2152,26 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>                                                                   &mmap_locked,
>                                                                   cc);
>                         }
> -                       if (*result == SCAN_SUCCEED)
> +                       switch (*result) {
> +                       case SCAN_PTE_MAPPED_HUGEPAGE: {
> +                               pmd_t *pmd;
> +
> +                               *result = find_pmd_or_thp_or_none(mm,
> +                                                                 khugepaged_scan.address,
> +                                                                 &pmd);
> +                               if (*result != SCAN_SUCCEED)
> +                                       break;
> +                               if (!khugepaged_add_pte_mapped_thp(mm,
> +                                                                  khugepaged_scan.address))
> +                                       break;
> +                       } fallthrough;
> +                       case SCAN_SUCCEED:
>                                 ++khugepaged_pages_collapsed;
> +                               break;
> +                       default:
> +                               break;
> +                       }
> +
>                         /* move to next address */
>                         khugepaged_scan.address += HPAGE_PMD_SIZE;
>                         progress += HPAGE_PMD_NR;
> --
> 2.37.2.789.g6183377224-goog
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm-unstable v3 03/10] mm/madvise: add file and shmem support to MADV_COLLAPSE
  2022-09-07 14:45 ` [PATCH mm-unstable v3 03/10] mm/madvise: add file and shmem support to MADV_COLLAPSE Zach O'Keefe
@ 2022-09-16 20:38   ` Yang Shi
  2022-09-19 15:29     ` Zach O'Keefe
  0 siblings, 1 reply; 22+ messages in thread
From: Yang Shi @ 2022-09-16 20:38 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
> memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
>
> On success, the backing memory will be a hugepage.  For the memory range
> and process provided, the page tables will synchronously have a huge pmd
> installed, mapping the THP.  Other mappings of the file extent mapped by
> the memory range may be added to a set of entries that khugepaged will
> later process and attempt update their page tables to map the THP by a pmd.
>
> This functionality unlocks two important uses:
>
> (1)     Immediately back executable text by THPs.  Current support provided
>         by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
>         system which might impair services from serving at their full rated
>         load after (re)starting.  Tricks like mremap(2)'ing text onto
>         anonymous memory to immediately realize iTLB performance prevents
>         page sharing and demand paging, both of which increase steady state
>         memory footprint.  Now, we can have the best of both worlds: Peak
>         upfront performance and lower RAM footprints.
>
> (2)     userfaultfd-based live migration of virtual machines satisfy UFFD
>         faults by fetching native-sized pages over the network (to avoid
>         latency of transferring an entire hugepage).  However, after guest
>         memory has been fully copied to the new host, MADV_COLLAPSE can
>         be used to immediately increase guest performance.
>
> Since khugepaged is single threaded, this change now introduces
> possibility of collapse contexts racing in file collapse path.  There a
> important few places to consider:
>
> (1)     hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
>         We could have the memory collapsed out from under us, but
>         the next xas_for_each() iteration will correctly pick up the
>         hugepage.  The hugepage might not be up to date (insofar as
>         copying of small page contents might not have completed - the
>         page still may be locked), but regardless what small page index
>         we were iterating over, we'll find the hugepage and identify it
>         as a suitably aligned compound page of order HPAGE_PMD_ORDER.
>
>         In khugepaged path, we locklessly check the value of the pmd,
>         and only add it to deferred collapse array if we find pmd
>         mapping pte table. This is fine, since other values that could
>         have raced in right afterwards denote failure, or that the
>         memory was successfully collapsed, so we don't need further
>         processing.
>
>         In madvise path, we'll take mmap_lock() in write to serialize
>         against page table updates and will know what to do based on the
>         true value of the pmd: recheck all ptes if we point to a pte table,
>         directly install the pmd, if the pmd has been cleared, but
>         memory not yet faulted, or nothing at all if we find a huge pmd.
>
>         It's worth putting emphasis here on how we treat the none pmd
>         here.  If khugepaged has processed this mm's page tables
>         already, it will have left the pmd cleared (ready for refault by
>         the process).  Depending on the VMA flags and sysfs settings,
>         amount of RAM on the machine, and the current load, could be a
>         relatively common occurrence - and as such is one we'd like to
>         handle successfully in MADV_COLLAPSE.  When we see the none pmd
>         in collapse_pte_mapped_thp(), we've locked mmap_lock in write
>         and checked (a) huepaged_vma_check() to see if the backing
>         memory is appropriate still, along with VMA sizing and
>         appropriate hugepage alignment within the file, and (b) we've
>         found a hugepage head of order HPAGE_PMD_ORDER at the offset
>         in the file mapped by our hugepage-aligned virtual address.
>         Even though the common-case is likely race with khugepaged,
>         given these checks (regardless how we got here - we could be
>         operating on a completely different file than originally checked
>         in hpage_collapse_scan_file() for all we know) it should be safe
>         to directly make the pmd a huge pmd pointing to this hugepage.
>
> (2)     collapse_file() is mostly serialized on the same file extent by
>         lock sequence:
>
>                 |       lock hupepage
>                 |               lock mapping->i_pages
>                 |                       lock 1st page
>                 |               unlock mapping->i_pages
>                 |                               <page checks>
>                 |               lock mapping->i_pages
>                 |                               page_ref_freeze(3)
>                 |                               xas_store(hugepage)
>                 |               unlock mapping->i_pages
>                 |                               page_ref_unfreeze(1)
>                 |                       unlock 1st page
>                 V       unlock hugepage
>
>         Once a context (who already has their fresh hugepage locked)
>         locks mapping->i_pages exclusively, it will hold said lock
>         until it locks the first page, and it will hold that lock until
>         the after the hugepage has been added to the page cache (and
>         will unlock the hugepage after page table update, though that
>         isn't important here).
>
>         A racing context that loses the race for mapping->i_pages will
>         then lose the race to locking the first page.  Here - depending
>         on how far the other racing context has gotten - we might find
>         the new hugepage (in which case we'll exit cleanly when we
>         check PageTransCompound()), or we'll find the "old" 1st small
>         page (in which we'll exit cleanly when we discover unexpected
>         refcount of 2 after isolate_lru_page()).  This is assuming we
>         are able to successfully lock the page we find - in shmem path,
>         we could just fail the trylock and exit cleanly anyways.
>
>         Failure path in collapse_file() is similar: once we hold lock
>         on 1st small page, we are serialized against other collapse
>         contexts.  Before the 1st small page is unlocked, we add it
>         back to the pagecache and unfreeze the refcount appropriately.
>         Contexts who lost the race to the 1st small page will then find
>         the same 1st small page with the correct refcount and will be
>         able to proceed.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> ---
>  include/linux/khugepaged.h         |  13 +-
>  include/trace/events/huge_memory.h |   1 +
>  kernel/events/uprobes.c            |   2 +-
>  mm/khugepaged.c                    | 238 ++++++++++++++++++++++-------
>  4 files changed, 194 insertions(+), 60 deletions(-)
>
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index 384f034ae947..70162d707caf 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -16,11 +16,13 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma,
>                                  unsigned long vm_flags);
>  extern void khugepaged_min_free_kbytes_update(void);
>  #ifdef CONFIG_SHMEM
> -extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
> +extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> +                                  bool install_pmd);
>  #else
> -static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> -                                          unsigned long addr)
> +static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
> +                                         unsigned long addr, bool install_pmd)
>  {
> +       return 0;
>  }
>  #endif
>
> @@ -46,9 +48,10 @@ static inline void khugepaged_enter_vma(struct vm_area_struct *vma,
>                                         unsigned long vm_flags)
>  {
>  }
> -static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> -                                          unsigned long addr)
> +static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
> +                                         unsigned long addr, bool install_pmd)
>  {
> +       return 0;
>  }
>
>  static inline void khugepaged_min_free_kbytes_update(void)
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index fbbb25494d60..df33453b70fc 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -11,6 +11,7 @@
>         EM( SCAN_FAIL,                  "failed")                       \
>         EM( SCAN_SUCCEED,               "succeeded")                    \
>         EM( SCAN_PMD_NULL,              "pmd_null")                     \
> +       EM( SCAN_PMD_NONE,              "pmd_none")                     \
>         EM( SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
>         EM( SCAN_EXCEED_NONE_PTE,       "exceed_none_pte")              \
>         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index e0a9b945e7bc..d9e357b7e17c 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -555,7 +555,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>
>         /* try collapse pmd for compound page */
>         if (!ret && orig_page_huge)
> -               collapse_pte_mapped_thp(mm, vaddr);
> +               collapse_pte_mapped_thp(mm, vaddr, false);
>
>         return ret;
>  }
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 31ccf49cf279..66457a06b4e7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -29,6 +29,7 @@ enum scan_result {
>         SCAN_FAIL,
>         SCAN_SUCCEED,
>         SCAN_PMD_NULL,
> +       SCAN_PMD_NONE,
>         SCAN_PMD_MAPPED,
>         SCAN_EXCEED_NONE_PTE,
>         SCAN_EXCEED_SWAP_PTE,
> @@ -838,6 +839,18 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>         if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
>                                 cc->is_khugepaged))
>                 return SCAN_VMA_CHECK;
> +       return SCAN_SUCCEED;
> +}
> +
> +static int hugepage_vma_revalidate_anon(struct mm_struct *mm,

Do we really need a new function for anon vma dedicatedly? Can't we
add a parameter to hugepage_vma_revalidate()?

> +                                       unsigned long address,
> +                                       struct vm_area_struct **vmap,
> +                                       struct collapse_control *cc)
> +{
> +       int ret = hugepage_vma_revalidate(mm, address, vmap, cc);
> +
> +       if (ret != SCAN_SUCCEED)
> +               return ret;
>         /*
>          * Anon VMA expected, the address may be unmapped then
>          * remapped to file after khugepaged reaquired the mmap_lock.
> @@ -845,8 +858,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>          * hugepage_vma_check may return true for qualified file
>          * vmas.
>          */
> -       if (!vma->anon_vma || !vma_is_anonymous(vma))
> -               return SCAN_VMA_CHECK;
> +       if (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap))
> +               return SCAN_PAGE_ANON;
>         return SCAN_SUCCEED;
>  }
>
> @@ -866,8 +879,8 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
>         /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
>         barrier();
>  #endif
> -       if (!pmd_present(pmde))
> -               return SCAN_PMD_NULL;
> +       if (pmd_none(pmde))
> +               return SCAN_PMD_NONE;
>         if (pmd_trans_huge(pmde))
>                 return SCAN_PMD_MAPPED;
>         if (pmd_bad(pmde))
> @@ -995,7 +1008,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>                 goto out_nolock;
>
>         mmap_read_lock(mm);
> -       result = hugepage_vma_revalidate(mm, address, &vma, cc);
> +       result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
>         if (result != SCAN_SUCCEED) {
>                 mmap_read_unlock(mm);
>                 goto out_nolock;
> @@ -1026,7 +1039,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>          * handled by the anon_vma lock + PG_lock.
>          */
>         mmap_write_lock(mm);
> -       result = hugepage_vma_revalidate(mm, address, &vma, cc);
> +       result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
>         if (result != SCAN_SUCCEED)
>                 goto out_up_write;
>         /* check if the pmd is still valid */
> @@ -1332,13 +1345,44 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
>         slot = mm_slot_lookup(mm_slots_hash, mm);
>         mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
>         if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
> +               int i;
> +               /*
> +                * Multiple callers may be adding entries here.  Do a quick
> +                * check to see the entry hasn't already been added by someone
> +                * else.
> +                */
> +               for (i = 0; i < mm_slot->nr_pte_mapped_thp; ++i)
> +                       if (mm_slot->pte_mapped_thp[i] == addr)
> +                               goto out;

I don't quite get why we need this. I'm supposed just khugepaged could
add the addr to the array and MADV_COLLAPSE just handles pte-mapped
hugepage immediately IIRC, right? If so there is actually no change on
khugepaged side.

>                 mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
>                 ret = true;
>         }
> +out:
>         spin_unlock(&khugepaged_mm_lock);
>         return ret;
>  }
>
> +/* hpage must be locked, and mmap_lock must be held in write */
> +static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> +                       pmd_t *pmdp, struct page *hpage)
> +{
> +       struct vm_fault vmf = {
> +               .vma = vma,
> +               .address = addr,
> +               .flags = 0,
> +               .pmd = pmdp,
> +       };
> +
> +       VM_BUG_ON(!PageTransHuge(hpage));
> +       mmap_assert_write_locked(vma->vm_mm);
> +
> +       if (do_set_pmd(&vmf, hpage))
> +               return SCAN_FAIL;
> +
> +       get_page(hpage);
> +       return SCAN_SUCCEED;
> +}
> +
>  static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>                                   unsigned long addr, pmd_t *pmdp)
>  {
> @@ -1360,12 +1404,14 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
>   *
>   * @mm: process address space where collapse happens
>   * @addr: THP collapse address
> + * @install_pmd: If a huge PMD should be installed
>   *
>   * This function checks whether all the PTEs in the PMD are pointing to the
>   * right THP. If so, retract the page table so the THP can refault in with
> - * as pmd-mapped.
> + * as pmd-mapped. Possibly install a huge PMD mapping the THP.
>   */
> -void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> +int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> +                           bool install_pmd)
>  {
>         unsigned long haddr = addr & HPAGE_PMD_MASK;
>         struct vm_area_struct *vma = vma_lookup(mm, haddr);
> @@ -1380,12 +1426,12 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
>
>         /* Fast check before locking page if already PMD-mapped  */
>         result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> -       if (result != SCAN_SUCCEED)
> -               return;
> +       if (result == SCAN_PMD_MAPPED)
> +               return result;
>
>         if (!vma || !vma->vm_file ||
>             !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> -               return;
> +               return SCAN_VMA_CHECK;
>
>         /*
>          * If we are here, we've succeeded in replacing all the native pages
> @@ -1395,24 +1441,39 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
>          * analogously elide sysfs THP settings here.
>          */
>         if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> -               return;
> +               return SCAN_VMA_CHECK;
>
>         /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
>         if (userfaultfd_wp(vma))
> -               return;
> +               return SCAN_PTE_UFFD_WP;
>
>         hpage = find_lock_page(vma->vm_file->f_mapping,
>                                linear_page_index(vma, haddr));
>         if (!hpage)
> -               return;
> +               return SCAN_PAGE_NULL;
>
> -       if (!PageHead(hpage))
> +       if (!PageHead(hpage)) {
> +               result = SCAN_FAIL;

I don't think you could trust this must be a HPAGE_PMD_ORDER hugepage
anymore since the vma might point to a different file, so a different
page cache. And the current kernel does support arbitrary order of
large foios for page cache. The below pte traverse may remove rmap for
the wrong page IIUC. Khugepaged should experience the same problem as
well.

>                 goto drop_hpage;
> +       }
>
> -       if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED)
> +       result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> +       switch (result) {
> +       case SCAN_SUCCEED:
> +               break;
> +       case SCAN_PMD_NONE:
> +               /*
> +                * In MADV_COLLAPSE path, possible race with khugepaged where
> +                * all pte entries have been removed and pmd cleared.  If so,
> +                * skip all the pte checks and just update the pmd mapping.
> +                */
> +               goto maybe_install_pmd;
> +       default:
>                 goto drop_hpage;
> +       }
>
>         start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> +       result = SCAN_FAIL;
>
>         /* step 1: check all mapped PTEs are to the right huge page */
>         for (i = 0, addr = haddr, pte = start_pte;
> @@ -1424,8 +1485,10 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
>                         continue;
>
>                 /* page swapped out, abort */
> -               if (!pte_present(*pte))
> +               if (!pte_present(*pte)) {
> +                       result = SCAN_PTE_NON_PRESENT;
>                         goto abort;
> +               }
>
>                 page = vm_normal_page(vma, addr, *pte);
>                 if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> @@ -1460,12 +1523,19 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
>                 add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
>         }
>
> -       /* step 4: collapse pmd */
> +       /* step 4: remove pte entries */

It also collapses and flushes pmd.

>         collapse_and_free_pmd(mm, vma, haddr, pmd);
> +
> +maybe_install_pmd:
> +       /* step 5: install pmd entry */
> +       result = install_pmd
> +                       ? set_huge_pmd(vma, haddr, pmd, hpage)
> +                       : SCAN_SUCCEED;
> +
>  drop_hpage:
>         unlock_page(hpage);
>         put_page(hpage);
> -       return;
> +       return result;
>
>  abort:
>         pte_unmap_unlock(start_pte, ptl);
> @@ -1488,22 +1558,29 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
>                 goto out;
>
>         for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
> -               collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i]);
> +               collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
>
>  out:
>         mm_slot->nr_pte_mapped_thp = 0;
>         mmap_write_unlock(mm);
>  }
>
> -static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> +static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> +                              struct mm_struct *target_mm,
> +                              unsigned long target_addr, struct page *hpage,
> +                              struct collapse_control *cc)
>  {
>         struct vm_area_struct *vma;
> -       struct mm_struct *mm;
> -       unsigned long addr;
> -       pmd_t *pmd;
> +       int target_result = SCAN_FAIL;
>
>         i_mmap_lock_write(mapping);
>         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> +               int result = SCAN_FAIL;
> +               struct mm_struct *mm = NULL;
> +               unsigned long addr = 0;
> +               pmd_t *pmd;
> +               bool is_target = false;
> +
>                 /*
>                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
>                  * got written to. These VMAs are likely not worth investing
> @@ -1520,24 +1597,34 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>                  * ptl. It has higher chance to recover THP for the VMA, but
>                  * has higher cost too.
>                  */
> -               if (vma->anon_vma)
> -                       continue;
> +               if (vma->anon_vma) {
> +                       result = SCAN_PAGE_ANON;
> +                       goto next;
> +               }
>                 addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> -               if (addr & ~HPAGE_PMD_MASK)
> -                       continue;
> -               if (vma->vm_end < addr + HPAGE_PMD_SIZE)
> -                       continue;
> +               if (addr & ~HPAGE_PMD_MASK ||
> +                   vma->vm_end < addr + HPAGE_PMD_SIZE) {
> +                       result = SCAN_VMA_CHECK;
> +                       goto next;
> +               }
>                 mm = vma->vm_mm;
> -               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> -                       continue;
> +               is_target = mm == target_mm && addr == target_addr;
> +               result = find_pmd_or_thp_or_none(mm, addr, &pmd);
> +               if (result != SCAN_SUCCEED)
> +                       goto next;
>                 /*
>                  * We need exclusive mmap_lock to retract page table.
>                  *
>                  * We use trylock due to lock inversion: we need to acquire
>                  * mmap_lock while holding page lock. Fault path does it in
>                  * reverse order. Trylock is a way to avoid deadlock.
> +                *
> +                * Also, it's not MADV_COLLAPSE's job to collapse other
> +                * mappings - let khugepaged take care of them later.
>                  */
> -               if (mmap_write_trylock(mm)) {
> +               result = SCAN_PTE_MAPPED_HUGEPAGE;
> +               if ((cc->is_khugepaged || is_target) &&
> +                   mmap_write_trylock(mm)) {
>                         /*
>                          * When a vma is registered with uffd-wp, we can't
>                          * recycle the pmd pgtable because there can be pte
> @@ -1546,22 +1633,45 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>                          * it'll always mapped in small page size for uffd-wp
>                          * registered ranges.
>                          */
> -                       if (!hpage_collapse_test_exit(mm) &&
> -                           !userfaultfd_wp(vma))
> -                               collapse_and_free_pmd(mm, vma, addr, pmd);
> +                       if (hpage_collapse_test_exit(mm)) {
> +                               result = SCAN_ANY_PROCESS;
> +                               goto unlock_next;
> +                       }
> +                       if (userfaultfd_wp(vma)) {
> +                               result = SCAN_PTE_UFFD_WP;
> +                               goto unlock_next;
> +                       }
> +                       collapse_and_free_pmd(mm, vma, addr, pmd);
> +                       if (!cc->is_khugepaged && is_target)
> +                               result = set_huge_pmd(vma, addr, pmd, hpage);
> +                       else
> +                               result = SCAN_SUCCEED;
> +
> +unlock_next:
>                         mmap_write_unlock(mm);
> -               } else {
> -                       /* Try again later */
> +                       goto next;
> +               }
> +               /*
> +                * Calling context will handle target mm/addr. Otherwise, let
> +                * khugepaged try again later.
> +                */
> +               if (!is_target) {
>                         khugepaged_add_pte_mapped_thp(mm, addr);
> +                       continue;
>                 }
> +next:
> +               if (is_target)
> +                       target_result = result;
>         }
>         i_mmap_unlock_write(mapping);
> +       return target_result;
>  }
>
>  /**
>   * collapse_file - collapse filemap/tmpfs/shmem pages into huge one.
>   *
>   * @mm: process address space where collapse happens
> + * @addr: virtual collapse start address
>   * @file: file that collapse on
>   * @start: collapse start address
>   * @cc: collapse context and scratchpad
> @@ -1581,8 +1691,9 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>   *    + restore gaps in the page cache;
>   *    + unlock and free huge page;
>   */
> -static int collapse_file(struct mm_struct *mm, struct file *file,
> -                        pgoff_t start, struct collapse_control *cc)
> +static int collapse_file(struct mm_struct *mm, unsigned long addr,
> +                        struct file *file, pgoff_t start,
> +                        struct collapse_control *cc)
>  {
>         struct address_space *mapping = file->f_mapping;
>         struct page *hpage;
> @@ -1890,7 +2001,8 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
>                 /*
>                  * Remove pte page tables, so we can re-fault the page as huge.
>                  */
> -               retract_page_tables(mapping, start);
> +               result = retract_page_tables(mapping, start, mm, addr, hpage,
> +                                            cc);
>                 unlock_page(hpage);
>                 hpage = NULL;
>         } else {
> @@ -1946,8 +2058,9 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
>         return result;
>  }
>
> -static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> -                               pgoff_t start, struct collapse_control *cc)
> +static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> +                                   struct file *file, pgoff_t start,
> +                                   struct collapse_control *cc)
>  {
>         struct page *page = NULL;
>         struct address_space *mapping = file->f_mapping;
> @@ -2035,7 +2148,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
>                         result = SCAN_EXCEED_NONE_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>                 } else {
> -                       result = collapse_file(mm, file, start, cc);
> +                       result = collapse_file(mm, addr, file, start, cc);
>                 }
>         }
>
> @@ -2043,8 +2156,9 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
>         return result;
>  }
>  #else
> -static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> -                               pgoff_t start, struct collapse_control *cc)
> +static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> +                                   struct file *file, pgoff_t start,
> +                                   struct collapse_control *cc)
>  {
>         BUILD_BUG();
>  }
> @@ -2142,8 +2256,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>                                                 khugepaged_scan.address);
>
>                                 mmap_read_unlock(mm);
> -                               *result = khugepaged_scan_file(mm, file, pgoff,
> -                                                              cc);
> +                               *result = hpage_collapse_scan_file(mm,
> +                                                                  khugepaged_scan.address,
> +                                                                  file, pgoff, cc);
>                                 mmap_locked = false;
>                                 fput(file);
>                         } else {
> @@ -2449,10 +2564,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>
>         *prev = vma;
>
> -       /* TODO: Support file/shmem */
> -       if (!vma->anon_vma || !vma_is_anonymous(vma))
> -               return -EINVAL;
> -
>         if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
>                 return -EINVAL;
>
> @@ -2483,16 +2594,35 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>                 }
>                 mmap_assert_locked(mm);
>                 memset(cc->node_load, 0, sizeof(cc->node_load));
> -               result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
> -                                                cc);
> +               if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
> +                       struct file *file = get_file(vma->vm_file);
> +                       pgoff_t pgoff = linear_page_index(vma, addr);
> +
> +                       mmap_read_unlock(mm);
> +                       mmap_locked = false;
> +                       result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> +                                                         cc);
> +                       fput(file);
> +               } else {
> +                       result = hpage_collapse_scan_pmd(mm, vma, addr,
> +                                                        &mmap_locked, cc);
> +               }
>                 if (!mmap_locked)
>                         *prev = NULL;  /* Tell caller we dropped mmap_lock */
>
> +handle_result:
>                 switch (result) {
>                 case SCAN_SUCCEED:
>                 case SCAN_PMD_MAPPED:
>                         ++thps;
>                         break;
> +               case SCAN_PTE_MAPPED_HUGEPAGE:
> +                       BUG_ON(mmap_locked);
> +                       BUG_ON(*prev);
> +                       mmap_write_lock(mm);
> +                       result = collapse_pte_mapped_thp(mm, addr, true);
> +                       mmap_write_unlock(mm);
> +                       goto handle_result;
>                 /* Whitelisted set of results where continuing OK */
>                 case SCAN_PMD_NULL:
>                 case SCAN_PTE_NON_PRESENT:
> --
> 2.37.2.789.g6183377224-goog
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm-unstable v3 04/10] mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  2022-09-07 14:45 ` [PATCH mm-unstable v3 04/10] mm/khugepaged: add tracepoint to hpage_collapse_scan_file() Zach O'Keefe
@ 2022-09-16 20:41   ` Yang Shi
  2022-09-16 23:05     ` Zach O'Keefe
  0 siblings, 1 reply; 22+ messages in thread
From: Yang Shi @ 2022-09-16 20:41 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Add huge_memory:trace_mm_khugepaged_scan_file tracepoint to
> hpage_collapse_scan_file() analogously to hpage_collapse_scan_pmd().
> While this change is targeted at debugging MADV_COLLAPSE pathway, the
> "mm_khugepaged" prefix is retained for symmetry with
> huge_memory:trace_mm_khugepaged_scan_pmd, which retains it's legacy name
> to prevent changing kernel ABI as much as possible.
>
> Signed-off-by: Zach O'Keefe <zokeefe@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  include/trace/events/huge_memory.h | 34 ++++++++++++++++++++++++++++++
>  mm/khugepaged.c                    |  3 ++-
>  2 files changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index df33453b70fc..935af4947917 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -169,5 +169,39 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
>                 __entry->ret)
>  );
>
> +TRACE_EVENT(mm_khugepaged_scan_file,
> +
> +       TP_PROTO(struct mm_struct *mm, struct page *page, const char *filename,
> +                int present, int swap, int result),
> +
> +       TP_ARGS(mm, page, filename, present, swap, result),
> +
> +       TP_STRUCT__entry(
> +               __field(struct mm_struct *, mm)
> +               __field(unsigned long, pfn)
> +               __string(filename, filename)
> +               __field(int, present)
> +               __field(int, swap)
> +               __field(int, result)
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->mm = mm;
> +               __entry->pfn = page ? page_to_pfn(page) : -1;
> +               __assign_str(filename, filename);
> +               __entry->present = present;
> +               __entry->swap = swap;
> +               __entry->result = result;
> +       ),
> +
> +       TP_printk("mm=%p, scan_pfn=0x%lx, filename=%s, present=%d, swap=%d, result=%s",
> +               __entry->mm,
> +               __entry->pfn,
> +               __get_str(filename),
> +               __entry->present,
> +               __entry->swap,
> +               __print_symbolic(__entry->result, SCAN_STATUS))
> +);
> +
>  #endif /* __HUGE_MEMORY_H */
>  #include <trace/define_trace.h>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 66457a06b4e7..9325aec25abc 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2152,7 +2152,8 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>                 }
>         }
>
> -       /* TODO: tracepoints */
> +       trace_mm_khugepaged_scan_file(mm, page, file->f_path.dentry->d_iname,
> +                                     present, swap, result);
>         return result;
>  }
>  #else
> --
> 2.37.2.789.g6183377224-goog
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm-unstable v3 01/10] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check()
  2022-09-16 17:46   ` Yang Shi
@ 2022-09-16 22:22     ` Zach O'Keefe
  0 siblings, 0 replies; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-16 22:22 UTC (permalink / raw)
  To: Yang Shi
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Sep 16 10:46, Yang Shi wrote:
> On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > Extend 'mm/thp: add flag to enforce sysfs THP in
> > hugepage_vma_check()' to shmem, allowing callers to ignore
> > /sys/kernel/transparent_hugepage/shmem_enabled and tmpfs huge= mount.
> >
> > This is intended to be used by MADV_COLLAPSE, and the rationale is
> > analogous to the anon/file case: MADV_COLLAPSE is not coupled to
> > directives that advise the kernel's decisions on when THPs should be
> > considered eligible. shmem/tmpfs always claims large folio support,
> > regardless of sysfs or mount options.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> 
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> 
> A nit below...
> 

Hey Yang,

Thanks for taking the time as always :)

> > ---
> >  include/linux/shmem_fs.h | 10 ++++++----
> >  mm/huge_memory.c         |  2 +-
> >  mm/shmem.c               | 18 +++++++++---------
> >  3 files changed, 16 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> > index f24071e3c826..d500ea967dc7 100644
> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -92,11 +92,13 @@ extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
> >  extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
> >  int shmem_unuse(unsigned int type);
> >
> > -extern bool shmem_is_huge(struct vm_area_struct *vma,
> > -                         struct inode *inode, pgoff_t index);
> > -static inline bool shmem_huge_enabled(struct vm_area_struct *vma)
> > +extern bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
> > +                         pgoff_t index, bool shmem_huge_force);
> > +static inline bool shmem_huge_enabled(struct vm_area_struct *vma,
> > +                                     bool shmem_huge_force)
> >  {
> > -       return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff);
> > +       return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff,
> > +                            shmem_huge_force);
> >  }
> >  extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
> >  extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 7fa74b9749a6..53d170dac332 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -119,7 +119,7 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
> >          * own flags.
> >          */
> >         if (!in_pf && shmem_file(vma->vm_file))
> > -               return shmem_huge_enabled(vma);
> > +               return shmem_huge_enabled(vma, !enforce_sysfs);
> >
> >         /* Enforce sysfs THP requirements as necessary */
> >         if (enforce_sysfs &&
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 99b7341bd0bf..47c42c566fd1 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -461,20 +461,20 @@ static bool shmem_confirm_swap(struct address_space *mapping,
> >
> >  static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
> >
> > -bool shmem_is_huge(struct vm_area_struct *vma,
> > -                  struct inode *inode, pgoff_t index)
> > +bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
> > +                  pgoff_t index, bool shmem_huge_force)
> >  {
> >         loff_t i_size;
> >
> >         if (!S_ISREG(inode->i_mode))
> >                 return false;
> > -       if (shmem_huge == SHMEM_HUGE_DENY)
> > -               return false;
> >         if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
> >             test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
> >                 return false;
> > -       if (shmem_huge == SHMEM_HUGE_FORCE)
> > +       if (shmem_huge == SHMEM_HUGE_FORCE || shmem_huge_force)
> 
> shmem_huge_force means ignore all sysfs and mount options, so it seems
> better to have it test explicitly IMHO, like:
> 
> if (shmem_huge_force)
>     return true;
> 
> if (shmem_huge == SHMEM_HUGE_FORCE)
>     return true;
> 
> 

This makes sense to me - a little bit cleaner / more direct.  Thanks for the
suggestion.

Thank you again,
Zach

> >                 return true;
> > +       if (shmem_huge == SHMEM_HUGE_DENY)
> > +               return false;
> >
> >         switch (SHMEM_SB(inode->i_sb)->huge) {
> >         case SHMEM_HUGE_ALWAYS:
> > @@ -669,8 +669,8 @@ static long shmem_unused_huge_count(struct super_block *sb,
> >
> >  #define shmem_huge SHMEM_HUGE_DENY
> >
> > -bool shmem_is_huge(struct vm_area_struct *vma,
> > -                  struct inode *inode, pgoff_t index)
> > +bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
> > +                  pgoff_t index, bool shmem_huge_force)
> >  {
> >         return false;
> >  }
> > @@ -1056,7 +1056,7 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
> >                         STATX_ATTR_NODUMP);
> >         generic_fillattr(&init_user_ns, inode, stat);
> >
> > -       if (shmem_is_huge(NULL, inode, 0))
> > +       if (shmem_is_huge(NULL, inode, 0, false))
> >                 stat->blksize = HPAGE_PMD_SIZE;
> >
> >         if (request_mask & STATX_BTIME) {
> > @@ -1888,7 +1888,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
> >                 return 0;
> >         }
> >
> > -       if (!shmem_is_huge(vma, inode, index))
> > +       if (!shmem_is_huge(vma, inode, index, false))
> >                 goto alloc_nohuge;
> >
> >         huge_gfp = vma_thp_gfp_mask(vma);
> > --
> > 2.37.2.789.g6183377224-goog
> >


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm-unstable v3 04/10] mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  2022-09-16 20:41   ` Yang Shi
@ 2022-09-16 23:05     ` Zach O'Keefe
  0 siblings, 0 replies; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-16 23:05 UTC (permalink / raw)
  To: Yang Shi
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Sep 16 13:41, Yang Shi wrote:
> On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > Add huge_memory:trace_mm_khugepaged_scan_file tracepoint to
> > hpage_collapse_scan_file() analogously to hpage_collapse_scan_pmd().
> > While this change is targeted at debugging MADV_COLLAPSE pathway, the
> > "mm_khugepaged" prefix is retained for symmetry with
> > huge_memory:trace_mm_khugepaged_scan_pmd, which retains it's legacy name
> > to prevent changing kernel ABI as much as possible.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> 
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> 

Thanks, as always!

> > ---
> >  include/trace/events/huge_memory.h | 34 ++++++++++++++++++++++++++++++
> >  mm/khugepaged.c                    |  3 ++-
> >  2 files changed, 36 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index df33453b70fc..935af4947917 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -169,5 +169,39 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
> >                 __entry->ret)
> >  );
> >
> > +TRACE_EVENT(mm_khugepaged_scan_file,
> > +
> > +       TP_PROTO(struct mm_struct *mm, struct page *page, const char *filename,
> > +                int present, int swap, int result),
> > +
> > +       TP_ARGS(mm, page, filename, present, swap, result),
> > +
> > +       TP_STRUCT__entry(
> > +               __field(struct mm_struct *, mm)
> > +               __field(unsigned long, pfn)
> > +               __string(filename, filename)
> > +               __field(int, present)
> > +               __field(int, swap)
> > +               __field(int, result)
> > +       ),
> > +
> > +       TP_fast_assign(
> > +               __entry->mm = mm;
> > +               __entry->pfn = page ? page_to_pfn(page) : -1;
> > +               __assign_str(filename, filename);
> > +               __entry->present = present;
> > +               __entry->swap = swap;
> > +               __entry->result = result;
> > +       ),
> > +
> > +       TP_printk("mm=%p, scan_pfn=0x%lx, filename=%s, present=%d, swap=%d, result=%s",
> > +               __entry->mm,
> > +               __entry->pfn,
> > +               __get_str(filename),
> > +               __entry->present,
> > +               __entry->swap,
> > +               __print_symbolic(__entry->result, SCAN_STATUS))
> > +);
> > +
> >  #endif /* __HUGE_MEMORY_H */
> >  #include <trace/define_trace.h>
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 66457a06b4e7..9325aec25abc 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2152,7 +2152,8 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >                 }
> >         }
> >
> > -       /* TODO: tracepoints */
> > +       trace_mm_khugepaged_scan_file(mm, page, file->f_path.dentry->d_iname,
> > +                                     present, swap, result);
> >         return result;
> >  }
> >  #else
> > --
> > 2.37.2.789.g6183377224-goog
> >


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm-unstable v3 03/10] mm/madvise: add file and shmem support to MADV_COLLAPSE
  2022-09-16 20:38   ` Yang Shi
@ 2022-09-19 15:29     ` Zach O'Keefe
  2022-09-19 17:54       ` Yang Shi
  2022-09-19 18:12       ` Yang Shi
  0 siblings, 2 replies; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-19 15:29 UTC (permalink / raw)
  To: Yang Shi
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Sep 16 13:38, Yang Shi wrote:
> On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
> > memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
> >
> > On success, the backing memory will be a hugepage.  For the memory range
> > and process provided, the page tables will synchronously have a huge pmd
> > installed, mapping the THP.  Other mappings of the file extent mapped by
> > the memory range may be added to a set of entries that khugepaged will
> > later process and attempt update their page tables to map the THP by a pmd.
> >
> > This functionality unlocks two important uses:
> >
> > (1)     Immediately back executable text by THPs.  Current support provided
> >         by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
> >         system which might impair services from serving at their full rated
> >         load after (re)starting.  Tricks like mremap(2)'ing text onto
> >         anonymous memory to immediately realize iTLB performance prevents
> >         page sharing and demand paging, both of which increase steady state
> >         memory footprint.  Now, we can have the best of both worlds: Peak
> >         upfront performance and lower RAM footprints.
> >
> > (2)     userfaultfd-based live migration of virtual machines satisfy UFFD
> >         faults by fetching native-sized pages over the network (to avoid
> >         latency of transferring an entire hugepage).  However, after guest
> >         memory has been fully copied to the new host, MADV_COLLAPSE can
> >         be used to immediately increase guest performance.
> >
> > Since khugepaged is single threaded, this change now introduces
> > possibility of collapse contexts racing in file collapse path.  There a
> > important few places to consider:
> >
> > (1)     hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
> >         We could have the memory collapsed out from under us, but
> >         the next xas_for_each() iteration will correctly pick up the
> >         hugepage.  The hugepage might not be up to date (insofar as
> >         copying of small page contents might not have completed - the
> >         page still may be locked), but regardless what small page index
> >         we were iterating over, we'll find the hugepage and identify it
> >         as a suitably aligned compound page of order HPAGE_PMD_ORDER.
> >
> >         In khugepaged path, we locklessly check the value of the pmd,
> >         and only add it to deferred collapse array if we find pmd
> >         mapping pte table. This is fine, since other values that could
> >         have raced in right afterwards denote failure, or that the
> >         memory was successfully collapsed, so we don't need further
> >         processing.
> >
> >         In madvise path, we'll take mmap_lock() in write to serialize
> >         against page table updates and will know what to do based on the
> >         true value of the pmd: recheck all ptes if we point to a pte table,
> >         directly install the pmd, if the pmd has been cleared, but
> >         memory not yet faulted, or nothing at all if we find a huge pmd.
> >
> >         It's worth putting emphasis here on how we treat the none pmd
> >         here.  If khugepaged has processed this mm's page tables
> >         already, it will have left the pmd cleared (ready for refault by
> >         the process).  Depending on the VMA flags and sysfs settings,
> >         amount of RAM on the machine, and the current load, could be a
> >         relatively common occurrence - and as such is one we'd like to
> >         handle successfully in MADV_COLLAPSE.  When we see the none pmd
> >         in collapse_pte_mapped_thp(), we've locked mmap_lock in write
> >         and checked (a) huepaged_vma_check() to see if the backing
> >         memory is appropriate still, along with VMA sizing and
> >         appropriate hugepage alignment within the file, and (b) we've
> >         found a hugepage head of order HPAGE_PMD_ORDER at the offset
> >         in the file mapped by our hugepage-aligned virtual address.
> >         Even though the common-case is likely race with khugepaged,
> >         given these checks (regardless how we got here - we could be
> >         operating on a completely different file than originally checked
> >         in hpage_collapse_scan_file() for all we know) it should be safe
> >         to directly make the pmd a huge pmd pointing to this hugepage.
> >
> > (2)     collapse_file() is mostly serialized on the same file extent by
> >         lock sequence:
> >
> >                 |       lock hupepage
> >                 |               lock mapping->i_pages
> >                 |                       lock 1st page
> >                 |               unlock mapping->i_pages
> >                 |                               <page checks>
> >                 |               lock mapping->i_pages
> >                 |                               page_ref_freeze(3)
> >                 |                               xas_store(hugepage)
> >                 |               unlock mapping->i_pages
> >                 |                               page_ref_unfreeze(1)
> >                 |                       unlock 1st page
> >                 V       unlock hugepage
> >
> >         Once a context (who already has their fresh hugepage locked)
> >         locks mapping->i_pages exclusively, it will hold said lock
> >         until it locks the first page, and it will hold that lock until
> >         the after the hugepage has been added to the page cache (and
> >         will unlock the hugepage after page table update, though that
> >         isn't important here).
> >
> >         A racing context that loses the race for mapping->i_pages will
> >         then lose the race to locking the first page.  Here - depending
> >         on how far the other racing context has gotten - we might find
> >         the new hugepage (in which case we'll exit cleanly when we
> >         check PageTransCompound()), or we'll find the "old" 1st small
> >         page (in which we'll exit cleanly when we discover unexpected
> >         refcount of 2 after isolate_lru_page()).  This is assuming we
> >         are able to successfully lock the page we find - in shmem path,
> >         we could just fail the trylock and exit cleanly anyways.
> >
> >         Failure path in collapse_file() is similar: once we hold lock
> >         on 1st small page, we are serialized against other collapse
> >         contexts.  Before the 1st small page is unlocked, we add it
> >         back to the pagecache and unfreeze the refcount appropriately.
> >         Contexts who lost the race to the 1st small page will then find
> >         the same 1st small page with the correct refcount and will be
> >         able to proceed.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > ---
> >  include/linux/khugepaged.h         |  13 +-
> >  include/trace/events/huge_memory.h |   1 +
> >  kernel/events/uprobes.c            |   2 +-
> >  mm/khugepaged.c                    | 238 ++++++++++++++++++++++-------
> >  4 files changed, 194 insertions(+), 60 deletions(-)
> >
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index 384f034ae947..70162d707caf 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -16,11 +16,13 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma,
> >                                  unsigned long vm_flags);
> >  extern void khugepaged_min_free_kbytes_update(void);
> >  #ifdef CONFIG_SHMEM
> > -extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
> > +extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > +                                  bool install_pmd);
> >  #else
> > -static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> > -                                          unsigned long addr)
> > +static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
> > +                                         unsigned long addr, bool install_pmd)
> >  {
> > +       return 0;
> >  }
> >  #endif
> >
> > @@ -46,9 +48,10 @@ static inline void khugepaged_enter_vma(struct vm_area_struct *vma,
> >                                         unsigned long vm_flags)
> >  {
> >  }
> > -static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> > -                                          unsigned long addr)
> > +static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
> > +                                         unsigned long addr, bool install_pmd)
> >  {
> > +       return 0;
> >  }
> >
> >  static inline void khugepaged_min_free_kbytes_update(void)
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index fbbb25494d60..df33453b70fc 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -11,6 +11,7 @@
> >         EM( SCAN_FAIL,                  "failed")                       \
> >         EM( SCAN_SUCCEED,               "succeeded")                    \
> >         EM( SCAN_PMD_NULL,              "pmd_null")                     \
> > +       EM( SCAN_PMD_NONE,              "pmd_none")                     \
> >         EM( SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
> >         EM( SCAN_EXCEED_NONE_PTE,       "exceed_none_pte")              \
> >         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index e0a9b945e7bc..d9e357b7e17c 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -555,7 +555,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >
> >         /* try collapse pmd for compound page */
> >         if (!ret && orig_page_huge)
> > -               collapse_pte_mapped_thp(mm, vaddr);
> > +               collapse_pte_mapped_thp(mm, vaddr, false);
> >
> >         return ret;
> >  }
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 31ccf49cf279..66457a06b4e7 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -29,6 +29,7 @@ enum scan_result {
> >         SCAN_FAIL,
> >         SCAN_SUCCEED,
> >         SCAN_PMD_NULL,
> > +       SCAN_PMD_NONE,
> >         SCAN_PMD_MAPPED,
> >         SCAN_EXCEED_NONE_PTE,
> >         SCAN_EXCEED_SWAP_PTE,
> > @@ -838,6 +839,18 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
> >                                 cc->is_khugepaged))
> >                 return SCAN_VMA_CHECK;
> > +       return SCAN_SUCCEED;
> > +}
> > +
> > +static int hugepage_vma_revalidate_anon(struct mm_struct *mm,

Hey Yang,

Thanks for taking the time to review this series - particularly this patch,
which I found tricky.

> 
> Do we really need a new function for anon vma dedicatedly? Can't we
> add a parameter to hugepage_vma_revalidate()?
> 

Good point - at some point I think I utilized it more, but you're right that
it it's overkill now.  Have added a "expect_anon" argument to
hugepage_vma_revalidate().  Thanks for the suggestions.

> > +                                       unsigned long address,
> > +                                       struct vm_area_struct **vmap,
> > +                                       struct collapse_control *cc)
> > +{
> > +       int ret = hugepage_vma_revalidate(mm, address, vmap, cc);
> > +
> > +       if (ret != SCAN_SUCCEED)
> > +               return ret;
> >         /*
> >          * Anon VMA expected, the address may be unmapped then
> >          * remapped to file after khugepaged reaquired the mmap_lock.
> > @@ -845,8 +858,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >          * hugepage_vma_check may return true for qualified file
> >          * vmas.
> >          */
> > -       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > -               return SCAN_VMA_CHECK;
> > +       if (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap))
> > +               return SCAN_PAGE_ANON;
> >         return SCAN_SUCCEED;
> >  }
> >
> > @@ -866,8 +879,8 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> >         /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> >         barrier();
> >  #endif
> > -       if (!pmd_present(pmde))
> > -               return SCAN_PMD_NULL;
> > +       if (pmd_none(pmde))
> > +               return SCAN_PMD_NONE;
> >         if (pmd_trans_huge(pmde))
> >                 return SCAN_PMD_MAPPED;
> >         if (pmd_bad(pmde))
> > @@ -995,7 +1008,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                 goto out_nolock;
> >
> >         mmap_read_lock(mm);
> > -       result = hugepage_vma_revalidate(mm, address, &vma, cc);
> > +       result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
> >         if (result != SCAN_SUCCEED) {
> >                 mmap_read_unlock(mm);
> >                 goto out_nolock;
> > @@ -1026,7 +1039,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >          * handled by the anon_vma lock + PG_lock.
> >          */
> >         mmap_write_lock(mm);
> > -       result = hugepage_vma_revalidate(mm, address, &vma, cc);
> > +       result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
> >         if (result != SCAN_SUCCEED)
> >                 goto out_up_write;
> >         /* check if the pmd is still valid */
> > @@ -1332,13 +1345,44 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> >         slot = mm_slot_lookup(mm_slots_hash, mm);
> >         mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
> >         if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
> > +               int i;
> > +               /*
> > +                * Multiple callers may be adding entries here.  Do a quick
> > +                * check to see the entry hasn't already been added by someone
> > +                * else.
> > +                */
> > +               for (i = 0; i < mm_slot->nr_pte_mapped_thp; ++i)
> > +                       if (mm_slot->pte_mapped_thp[i] == addr)
> > +                               goto out;
> 
> I don't quite get why we need this. I'm supposed just khugepaged could
> add the addr to the array and MADV_COLLAPSE just handles pte-mapped
> hugepage immediately IIRC, right? If so there is actually no change on
> khugepaged side.
>

So you're right to say that this change isn't needed.  The "multi-add"
sequence is:

(1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
    emptying the A's ->pte_mapped_thp[] array.
(2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
    retract_page_tables() finds a VMA in mm_struct A mapping the same extent
    (at virtual address X) and adds an entry (for X) into mm_struct A's
    ->pte-mapped_thp[] array.
(3) khugepaged calls khugepagedge_collapse_scan_file() for mm_struct A at X,
    sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry (for X)
    into mm_struct A's ->pte-mapped_thp[] array.

Which is somewhat contrived/rare - but it can occur.  If we don't have this,
the second time we call collapse_pte_mapped_thp() for the same
mm_struct/address, we should take the "if (result == SCAN_PMD_MAPPED) {...}"
branch early and return before grabbing any other locks (we already have
exclusive mmap_lock).  So, perhaps we can drop this check?

> >                 mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
> >                 ret = true;
> >         }
> > +out:
> >         spin_unlock(&khugepaged_mm_lock);
> >         return ret;
> >  }
> >
> > +/* hpage must be locked, and mmap_lock must be held in write */
> > +static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> > +                       pmd_t *pmdp, struct page *hpage)
> > +{
> > +       struct vm_fault vmf = {
> > +               .vma = vma,
> > +               .address = addr,
> > +               .flags = 0,
> > +               .pmd = pmdp,
> > +       };
> > +
> > +       VM_BUG_ON(!PageTransHuge(hpage));
> > +       mmap_assert_write_locked(vma->vm_mm);
> > +
> > +       if (do_set_pmd(&vmf, hpage))
> > +               return SCAN_FAIL;
> > +
> > +       get_page(hpage);
> > +       return SCAN_SUCCEED;
> > +}
> > +
> >  static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> >                                   unsigned long addr, pmd_t *pmdp)
> >  {
> > @@ -1360,12 +1404,14 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
> >   *
> >   * @mm: process address space where collapse happens
> >   * @addr: THP collapse address
> > + * @install_pmd: If a huge PMD should be installed
> >   *
> >   * This function checks whether all the PTEs in the PMD are pointing to the
> >   * right THP. If so, retract the page table so the THP can refault in with
> > - * as pmd-mapped.
> > + * as pmd-mapped. Possibly install a huge PMD mapping the THP.
> >   */
> > -void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > +int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > +                           bool install_pmd)
> >  {
> >         unsigned long haddr = addr & HPAGE_PMD_MASK;
> >         struct vm_area_struct *vma = vma_lookup(mm, haddr);
> > @@ -1380,12 +1426,12 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> >
> >         /* Fast check before locking page if already PMD-mapped  */
> >         result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> > -       if (result != SCAN_SUCCEED)
> > -               return;
> > +       if (result == SCAN_PMD_MAPPED)
> > +               return result;
> >
> >         if (!vma || !vma->vm_file ||
> >             !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> > -               return;
> > +               return SCAN_VMA_CHECK;
> >
> >         /*
> >          * If we are here, we've succeeded in replacing all the native pages
> > @@ -1395,24 +1441,39 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> >          * analogously elide sysfs THP settings here.
> >          */
> >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> > -               return;
> > +               return SCAN_VMA_CHECK;
> >
> >         /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> >         if (userfaultfd_wp(vma))
> > -               return;
> > +               return SCAN_PTE_UFFD_WP;
> >
> >         hpage = find_lock_page(vma->vm_file->f_mapping,
> >                                linear_page_index(vma, haddr));
> >         if (!hpage)
> > -               return;
> > +               return SCAN_PAGE_NULL;
> >
> > -       if (!PageHead(hpage))
> > +       if (!PageHead(hpage)) {
> > +               result = SCAN_FAIL;
> 
> I don't think you could trust this must be a HPAGE_PMD_ORDER hugepage
> anymore since the vma might point to a different file, so a different
> page cache. And the current kernel does support arbitrary order of
> large foios for page cache. [...]

Good catch! Yes, I think we need to double check HPAGE_PMD_ORDER here,
and that applies equally to khugepaged as well.

> [...] The below pte traverse may remove rmap for
> the wrong page IIUC. Khugepaged should experience the same problem as
> well.
>

Just to confirm, you mean this is only a danger if we don't check the compound
order, correct? I.e. if compound_order < HPAGE_PMD_ORDER  we'll iterate over
ptes that map something other than our compound page and erroneously adjust rmap
for wrong pages.  So, adding a check for compound_order == HPAGE_PMD_ORDER above
alleviates this possibility.

> >                 goto drop_hpage;
> > +       }
> >
> > -       if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED)
> > +       result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> > +       switch (result) {
> > +       case SCAN_SUCCEED:
> > +               break;
> > +       case SCAN_PMD_NONE:
> > +               /*
> > +                * In MADV_COLLAPSE path, possible race with khugepaged where
> > +                * all pte entries have been removed and pmd cleared.  If so,
> > +                * skip all the pte checks and just update the pmd mapping.
> > +                */
> > +               goto maybe_install_pmd;
> > +       default:
> >                 goto drop_hpage;
> > +       }
> >
> >         start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> > +       result = SCAN_FAIL;
> >
> >         /* step 1: check all mapped PTEs are to the right huge page */
> >         for (i = 0, addr = haddr, pte = start_pte;
> > @@ -1424,8 +1485,10 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> >                         continue;
> >
> >                 /* page swapped out, abort */
> > -               if (!pte_present(*pte))
> > +               if (!pte_present(*pte)) {
> > +                       result = SCAN_PTE_NON_PRESENT;
> >                         goto abort;
> > +               }
> >
> >                 page = vm_normal_page(vma, addr, *pte);
> >                 if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> > @@ -1460,12 +1523,19 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> >                 add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
> >         }
> >
> > -       /* step 4: collapse pmd */
> > +       /* step 4: remove pte entries */
> 
> It also collapses and flushes pmd.
>

True, will update the comment.

Thanks again for your time,
Zach

> >         collapse_and_free_pmd(mm, vma, haddr, pmd);
> > +
> > +maybe_install_pmd:
> > +       /* step 5: install pmd entry */
> > +       result = install_pmd
> > +                       ? set_huge_pmd(vma, haddr, pmd, hpage)
> > +                       : SCAN_SUCCEED;
> > +
> >  drop_hpage:
> >         unlock_page(hpage);
> >         put_page(hpage);
> > -       return;
> > +       return result;
> >
> >  abort:
> >         pte_unmap_unlock(start_pte, ptl);
> > @@ -1488,22 +1558,29 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
> >                 goto out;
> >
> >         for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
> > -               collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i]);
> > +               collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
> >
> >  out:
> >         mm_slot->nr_pte_mapped_thp = 0;
> >         mmap_write_unlock(mm);
> >  }
> >
> > -static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > +static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > +                              struct mm_struct *target_mm,
> > +                              unsigned long target_addr, struct page *hpage,
> > +                              struct collapse_control *cc)
> >  {
> >         struct vm_area_struct *vma;
> > -       struct mm_struct *mm;
> > -       unsigned long addr;
> > -       pmd_t *pmd;
> > +       int target_result = SCAN_FAIL;
> >
> >         i_mmap_lock_write(mapping);
> >         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > +               int result = SCAN_FAIL;
> > +               struct mm_struct *mm = NULL;
> > +               unsigned long addr = 0;
> > +               pmd_t *pmd;
> > +               bool is_target = false;
> > +
> >                 /*
> >                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> >                  * got written to. These VMAs are likely not worth investing
> > @@ -1520,24 +1597,34 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >                  * ptl. It has higher chance to recover THP for the VMA, but
> >                  * has higher cost too.
> >                  */
> > -               if (vma->anon_vma)
> > -                       continue;
> > +               if (vma->anon_vma) {
> > +                       result = SCAN_PAGE_ANON;
> > +                       goto next;
> > +               }
> >                 addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > -               if (addr & ~HPAGE_PMD_MASK)
> > -                       continue;
> > -               if (vma->vm_end < addr + HPAGE_PMD_SIZE)
> > -                       continue;
> > +               if (addr & ~HPAGE_PMD_MASK ||
> > +                   vma->vm_end < addr + HPAGE_PMD_SIZE) {
> > +                       result = SCAN_VMA_CHECK;
> > +                       goto next;
> > +               }
> >                 mm = vma->vm_mm;
> > -               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> > -                       continue;
> > +               is_target = mm == target_mm && addr == target_addr;
> > +               result = find_pmd_or_thp_or_none(mm, addr, &pmd);
> > +               if (result != SCAN_SUCCEED)
> > +                       goto next;
> >                 /*
> >                  * We need exclusive mmap_lock to retract page table.
> >                  *
> >                  * We use trylock due to lock inversion: we need to acquire
> >                  * mmap_lock while holding page lock. Fault path does it in
> >                  * reverse order. Trylock is a way to avoid deadlock.
> > +                *
> > +                * Also, it's not MADV_COLLAPSE's job to collapse other
> > +                * mappings - let khugepaged take care of them later.
> >                  */
> > -               if (mmap_write_trylock(mm)) {
> > +               result = SCAN_PTE_MAPPED_HUGEPAGE;
> > +               if ((cc->is_khugepaged || is_target) &&
> > +                   mmap_write_trylock(mm)) {
> >                         /*
> >                          * When a vma is registered with uffd-wp, we can't
> >                          * recycle the pmd pgtable because there can be pte
> > @@ -1546,22 +1633,45 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >                          * it'll always mapped in small page size for uffd-wp
> >                          * registered ranges.
> >                          */
> > -                       if (!hpage_collapse_test_exit(mm) &&
> > -                           !userfaultfd_wp(vma))
> > -                               collapse_and_free_pmd(mm, vma, addr, pmd);
> > +                       if (hpage_collapse_test_exit(mm)) {
> > +                               result = SCAN_ANY_PROCESS;
> > +                               goto unlock_next;
> > +                       }
> > +                       if (userfaultfd_wp(vma)) {
> > +                               result = SCAN_PTE_UFFD_WP;
> > +                               goto unlock_next;
> > +                       }
> > +                       collapse_and_free_pmd(mm, vma, addr, pmd);
> > +                       if (!cc->is_khugepaged && is_target)
> > +                               result = set_huge_pmd(vma, addr, pmd, hpage);
> > +                       else
> > +                               result = SCAN_SUCCEED;
> > +
> > +unlock_next:
> >                         mmap_write_unlock(mm);
> > -               } else {
> > -                       /* Try again later */
> > +                       goto next;
> > +               }
> > +               /*
> > +                * Calling context will handle target mm/addr. Otherwise, let
> > +                * khugepaged try again later.
> > +                */
> > +               if (!is_target) {
> >                         khugepaged_add_pte_mapped_thp(mm, addr);
> > +                       continue;
> >                 }
> > +next:
> > +               if (is_target)
> > +                       target_result = result;
> >         }
> >         i_mmap_unlock_write(mapping);
> > +       return target_result;
> >  }
> >
> >  /**
> >   * collapse_file - collapse filemap/tmpfs/shmem pages into huge one.
> >   *
> >   * @mm: process address space where collapse happens
> > + * @addr: virtual collapse start address
> >   * @file: file that collapse on
> >   * @start: collapse start address
> >   * @cc: collapse context and scratchpad
> > @@ -1581,8 +1691,9 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >   *    + restore gaps in the page cache;
> >   *    + unlock and free huge page;
> >   */
> > -static int collapse_file(struct mm_struct *mm, struct file *file,
> > -                        pgoff_t start, struct collapse_control *cc)
> > +static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > +                        struct file *file, pgoff_t start,
> > +                        struct collapse_control *cc)
> >  {
> >         struct address_space *mapping = file->f_mapping;
> >         struct page *hpage;
> > @@ -1890,7 +2001,8 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
> >                 /*
> >                  * Remove pte page tables, so we can re-fault the page as huge.
> >                  */
> > -               retract_page_tables(mapping, start);
> > +               result = retract_page_tables(mapping, start, mm, addr, hpage,
> > +                                            cc);
> >                 unlock_page(hpage);
> >                 hpage = NULL;
> >         } else {
> > @@ -1946,8 +2058,9 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
> >         return result;
> >  }
> >
> > -static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > -                               pgoff_t start, struct collapse_control *cc)
> > +static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > +                                   struct file *file, pgoff_t start,
> > +                                   struct collapse_control *cc)
> >  {
> >         struct page *page = NULL;
> >         struct address_space *mapping = file->f_mapping;
> > @@ -2035,7 +2148,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> >                         result = SCAN_EXCEED_NONE_PTE;
> >                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >                 } else {
> > -                       result = collapse_file(mm, file, start, cc);
> > +                       result = collapse_file(mm, addr, file, start, cc);
> >                 }
> >         }
> >
> > @@ -2043,8 +2156,9 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> >         return result;
> >  }
> >  #else
> > -static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > -                               pgoff_t start, struct collapse_control *cc)
> > +static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > +                                   struct file *file, pgoff_t start,
> > +                                   struct collapse_control *cc)
> >  {
> >         BUILD_BUG();
> >  }
> > @@ -2142,8 +2256,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                                                 khugepaged_scan.address);
> >
> >                                 mmap_read_unlock(mm);
> > -                               *result = khugepaged_scan_file(mm, file, pgoff,
> > -                                                              cc);
> > +                               *result = hpage_collapse_scan_file(mm,
> > +                                                                  khugepaged_scan.address,
> > +                                                                  file, pgoff, cc);
> >                                 mmap_locked = false;
> >                                 fput(file);
> >                         } else {
> > @@ -2449,10 +2564,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >
> >         *prev = vma;
> >
> > -       /* TODO: Support file/shmem */
> > -       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > -               return -EINVAL;
> > -
> >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> >                 return -EINVAL;
> >
> > @@ -2483,16 +2594,35 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >                 }
> >                 mmap_assert_locked(mm);
> >                 memset(cc->node_load, 0, sizeof(cc->node_load));
> > -               result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
> > -                                                cc);
> > +               if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
> > +                       struct file *file = get_file(vma->vm_file);
> > +                       pgoff_t pgoff = linear_page_index(vma, addr);
> > +
> > +                       mmap_read_unlock(mm);
> > +                       mmap_locked = false;
> > +                       result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> > +                                                         cc);
> > +                       fput(file);
> > +               } else {
> > +                       result = hpage_collapse_scan_pmd(mm, vma, addr,
> > +                                                        &mmap_locked, cc);
> > +               }
> >                 if (!mmap_locked)
> >                         *prev = NULL;  /* Tell caller we dropped mmap_lock */
> >
> > +handle_result:
> >                 switch (result) {
> >                 case SCAN_SUCCEED:
> >                 case SCAN_PMD_MAPPED:
> >                         ++thps;
> >                         break;
> > +               case SCAN_PTE_MAPPED_HUGEPAGE:
> > +                       BUG_ON(mmap_locked);
> > +                       BUG_ON(*prev);
> > +                       mmap_write_lock(mm);
> > +                       result = collapse_pte_mapped_thp(mm, addr, true);
> > +                       mmap_write_unlock(mm);
> > +                       goto handle_result;
> >                 /* Whitelisted set of results where continuing OK */
> >                 case SCAN_PMD_NULL:
> >                 case SCAN_PTE_NON_PRESENT:
> > --
> > 2.37.2.789.g6183377224-goog
> >


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm-unstable v3 02/10] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds
  2022-09-16 18:26   ` Yang Shi
@ 2022-09-19 15:36     ` Zach O'Keefe
  0 siblings, 0 replies; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-19 15:36 UTC (permalink / raw)
  To: Yang Shi, a
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Sep 16 11:26, Yang Shi wrote:
> On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > The main benefit of THPs are that they can be mapped at the pmd level,
> > increasing the likelihood of TLB hit and spending less cycles in page
> > table walks.  pte-mapped hugepages - that is - hugepage-aligned compound
> > pages of order HPAGE_PMD_ORDER mapped by ptes - although being
> > contiguous in physical memory, don't have this advantage.  In fact, one
> > could argue they are detrimental to system performance overall since
> > they occupy a precious hugepage-aligned/sized region of physical memory
> > that could otherwise be used more effectively.  Additionally, pte-mapped
> > hugepages can be the cheapest memory to collapse for khugepaged since no
> > new hugepage allocation or copying of memory contents is necessary - we
> > only need to update the mapping page tables.
> >
> > In the anonymous collapse path, we are able to collapse pte-mapped
> > hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no
> > effort when compound pages (of any order) are encountered.
> >
> > Identify pte-mapped hugepages in the file/shmem collapse path.  The
> > final step of which makes a racy check of the value of the pmd to ensure
> > it maps a pte table.  This should be fine, since races that result in
> > false-positive (i.e. attempt collapse even though we sholdn't) will fail
> 
> s/sholdn't/shouldn't
> 

Oops - good catch, thank you.

> > later in collapse_pte_mapped_thp() once we actually lock mmap_lock and
> > reinspect the pmd value.  Races that result in false-negatives (i.e.
> > where we decide to not attempt collapse, but should have) shouldn't be
> > an issue, since in the worst case, we do nothing - which is what we've
> > done up to this point.  We make a similar check in retract_page_tables().
> > If we do think we've found a pte-mapped hugepgae in khugepaged context,
> > attempt to update page tables mapping this hugepage.
> >
> > Note that these collapses still count towards the
> > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter, and
> > if the pte-mapped hugepage was also mapped into multiple process' address
> > spaces, could be incremented for each page table update.  Since we
> > increment the counter when a pte-mapped hugepage is successfully added to
> > the list of to-collapse pte-mapped THPs, it's possible that we never
> > actually update the page table either.  This is different from how
> > file/shmem pages_collapsed accounting works today where only a successful
> > page cache update is counted (it's also possible here that no page tables
> > are actually changed).  Though it incurs some slop, this is preferred to
> > either not accounting for the event at all, or plumbing through data in
> > struct mm_slot on whether to account for the collapse or not.
> 
> I don't have a strong preference on this. Typically it is used to tell
> the users khugepaged is making progress. We have thp_collapse_alloc
> from /proc/vmstat to account how many huge pages are really allocated
> by khugepaged/MADV_COLLAPSE.
> 
> But it may be better to add a note in the document
> (Documentation/admin-guide/mm/transhuge.rst) to make it more explicit.
> 

Good point. Have gone ahead and done exactly that - thanks for the suggestion.
> >
> > Also note that work still needs to be done to support arbitrary compound
> > pages, and that this should all be converted to using folios.
> >
> > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> 
> Other than the above comments and two nits below, the patch looks good
> to me. Reviewed-by: Yang Shi <shy828301@gmail.com>
>

Thank you, and thanks for taking the time to review this!

> > ---
> >  include/trace/events/huge_memory.h |  1 +
> >  mm/khugepaged.c                    | 67 +++++++++++++++++++++++++++---
> >  2 files changed, 62 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 55392bf30a03..fbbb25494d60 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -17,6 +17,7 @@
> >         EM( SCAN_EXCEED_SHARED_PTE,     "exceed_shared_pte")            \
> >         EM( SCAN_PTE_NON_PRESENT,       "pte_non_present")              \
> >         EM( SCAN_PTE_UFFD_WP,           "pte_uffd_wp")                  \
> > +       EM( SCAN_PTE_MAPPED_HUGEPAGE,   "pte_mapped_hugepage")          \
> >         EM( SCAN_PAGE_RO,               "no_writable_page")             \
> >         EM( SCAN_LACK_REFERENCED_PAGE,  "lack_referenced_page")         \
> >         EM( SCAN_PAGE_NULL,             "page_null")                    \
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 55c8625ed950..31ccf49cf279 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -35,6 +35,7 @@ enum scan_result {
> >         SCAN_EXCEED_SHARED_PTE,
> >         SCAN_PTE_NON_PRESENT,
> >         SCAN_PTE_UFFD_WP,
> > +       SCAN_PTE_MAPPED_HUGEPAGE,
> >         SCAN_PAGE_RO,
> >         SCAN_LACK_REFERENCED_PAGE,
> >         SCAN_PAGE_NULL,
> > @@ -1318,20 +1319,24 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> >   * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
> >   * khugepaged should try to collapse the page table.
> >   */
> > -static void khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> > +static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> >                                           unsigned long addr)
> >  {
> >         struct khugepaged_mm_slot *mm_slot;
> >         struct mm_slot *slot;
> > +       bool ret = false;
> >
> >         VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
> >
> >         spin_lock(&khugepaged_mm_lock);
> >         slot = mm_slot_lookup(mm_slots_hash, mm);
> >         mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
> > -       if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP))
> > +       if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
> >                 mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
> > +               ret = true;
> > +       }
> >         spin_unlock(&khugepaged_mm_lock);
> > +       return ret;
> >  }
> >
> >  static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> > @@ -1368,9 +1373,16 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> >         pte_t *start_pte, *pte;
> >         pmd_t *pmd;
> >         spinlock_t *ptl;
> > -       int count = 0;
> > +       int count = 0, result = SCAN_FAIL;
> >         int i;
> >
> > +       mmap_assert_write_locked(mm);
> > +
> > +       /* Fast check before locking page if already PMD-mapped  */
> 
> It also back off if the page is not mapped at all. So better to
> reflect this in the comment too.
>

This is a little awkward, since the next patch makes this check:

	if (result == SCAN_PTE_MAPPED_HUGEPAGE)
		return;

Which does what the comment says - but for the sake of someone looking at
just this patch in the future, I'll update the comment for this patch and
change it in the next one.  Thanks for the suggestion.

> > +       result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> > +       if (result != SCAN_SUCCEED)
> > +               return;
> > +
> >         if (!vma || !vma->vm_file ||
> >             !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> >                 return;
> > @@ -1721,9 +1733,16 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
> >                 /*
> >                  * If file was truncated then extended, or hole-punched, before
> >                  * we locked the first page, then a THP might be there already.
> > +                * This will be discovered on the first iteration.
> >                  */
> >                 if (PageTransCompound(page)) {
> > -                       result = SCAN_PAGE_COMPOUND;
> > +                       struct page *head = compound_head(page);
> > +
> > +                       result = compound_order(head) == HPAGE_PMD_ORDER &&
> > +                                       head->index == start
> > +                                       /* Maybe PMD-mapped */
> > +                                       ? SCAN_PTE_MAPPED_HUGEPAGE
> > +                                       : SCAN_PAGE_COMPOUND;
> >                         goto out_unlock;
> >                 }
> >
> > @@ -1961,7 +1980,19 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> >                  * into a PMD sized page
> >                  */
> 
> The comment starts with "XXX:", better to rephrase to "TODO:", it
> seems more understandable.
>

Agreed. Done.

> >                 if (PageTransCompound(page)) {
> > -                       result = SCAN_PAGE_COMPOUND;
> > +                       struct page *head = compound_head(page);
> > +
> > +                       result = compound_order(head) == HPAGE_PMD_ORDER &&
> > +                                       head->index == start
> > +                                       /* Maybe PMD-mapped */
> > +                                       ? SCAN_PTE_MAPPED_HUGEPAGE
> > +                                       : SCAN_PAGE_COMPOUND;
> > +                       /*
> > +                        * For SCAN_PTE_MAPPED_HUGEPAGE, further processing
> > +                        * by the caller won't touch the page cache, and so
> > +                        * it's safe to skip LRU and refcount checks before
> > +                        * returning.
> > +                        */
> >                         break;
> >                 }
> >
> > @@ -2021,6 +2052,12 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> >  static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
> >  {
> >  }
> > +
> > +static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> > +                                         unsigned long addr)
> > +{
> > +       return false;
> > +}
> >  #endif
> >
> >  static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > @@ -2115,8 +2152,26 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                                                                   &mmap_locked,
> >                                                                   cc);
> >                         }
> > -                       if (*result == SCAN_SUCCEED)
> > +                       switch (*result) {
> > +                       case SCAN_PTE_MAPPED_HUGEPAGE: {
> > +                               pmd_t *pmd;
> > +
> > +                               *result = find_pmd_or_thp_or_none(mm,
> > +                                                                 khugepaged_scan.address,
> > +                                                                 &pmd);
> > +                               if (*result != SCAN_SUCCEED)
> > +                                       break;
> > +                               if (!khugepaged_add_pte_mapped_thp(mm,
> > +                                                                  khugepaged_scan.address))
> > +                                       break;
> > +                       } fallthrough;
> > +                       case SCAN_SUCCEED:
> >                                 ++khugepaged_pages_collapsed;
> > +                               break;
> > +                       default:
> > +                               break;
> > +                       }
> > +
> >                         /* move to next address */
> >                         khugepaged_scan.address += HPAGE_PMD_SIZE;
> >                         progress += HPAGE_PMD_NR;
> > --
> > 2.37.2.789.g6183377224-goog
> >


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm-unstable v3 03/10] mm/madvise: add file and shmem support to MADV_COLLAPSE
  2022-09-19 15:29     ` Zach O'Keefe
@ 2022-09-19 17:54       ` Yang Shi
  2022-09-19 18:12       ` Yang Shi
  1 sibling, 0 replies; 22+ messages in thread
From: Yang Shi @ 2022-09-19 17:54 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Mon, Sep 19, 2022 at 8:29 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Sep 16 13:38, Yang Shi wrote:
> > On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
> > > memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
> > >
> > > On success, the backing memory will be a hugepage.  For the memory range
> > > and process provided, the page tables will synchronously have a huge pmd
> > > installed, mapping the THP.  Other mappings of the file extent mapped by
> > > the memory range may be added to a set of entries that khugepaged will
> > > later process and attempt update their page tables to map the THP by a pmd.
> > >
> > > This functionality unlocks two important uses:
> > >
> > > (1)     Immediately back executable text by THPs.  Current support provided
> > >         by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
> > >         system which might impair services from serving at their full rated
> > >         load after (re)starting.  Tricks like mremap(2)'ing text onto
> > >         anonymous memory to immediately realize iTLB performance prevents
> > >         page sharing and demand paging, both of which increase steady state
> > >         memory footprint.  Now, we can have the best of both worlds: Peak
> > >         upfront performance and lower RAM footprints.
> > >
> > > (2)     userfaultfd-based live migration of virtual machines satisfy UFFD
> > >         faults by fetching native-sized pages over the network (to avoid
> > >         latency of transferring an entire hugepage).  However, after guest
> > >         memory has been fully copied to the new host, MADV_COLLAPSE can
> > >         be used to immediately increase guest performance.
> > >
> > > Since khugepaged is single threaded, this change now introduces
> > > possibility of collapse contexts racing in file collapse path.  There a
> > > important few places to consider:
> > >
> > > (1)     hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
> > >         We could have the memory collapsed out from under us, but
> > >         the next xas_for_each() iteration will correctly pick up the
> > >         hugepage.  The hugepage might not be up to date (insofar as
> > >         copying of small page contents might not have completed - the
> > >         page still may be locked), but regardless what small page index
> > >         we were iterating over, we'll find the hugepage and identify it
> > >         as a suitably aligned compound page of order HPAGE_PMD_ORDER.
> > >
> > >         In khugepaged path, we locklessly check the value of the pmd,
> > >         and only add it to deferred collapse array if we find pmd
> > >         mapping pte table. This is fine, since other values that could
> > >         have raced in right afterwards denote failure, or that the
> > >         memory was successfully collapsed, so we don't need further
> > >         processing.
> > >
> > >         In madvise path, we'll take mmap_lock() in write to serialize
> > >         against page table updates and will know what to do based on the
> > >         true value of the pmd: recheck all ptes if we point to a pte table,
> > >         directly install the pmd, if the pmd has been cleared, but
> > >         memory not yet faulted, or nothing at all if we find a huge pmd.
> > >
> > >         It's worth putting emphasis here on how we treat the none pmd
> > >         here.  If khugepaged has processed this mm's page tables
> > >         already, it will have left the pmd cleared (ready for refault by
> > >         the process).  Depending on the VMA flags and sysfs settings,
> > >         amount of RAM on the machine, and the current load, could be a
> > >         relatively common occurrence - and as such is one we'd like to
> > >         handle successfully in MADV_COLLAPSE.  When we see the none pmd
> > >         in collapse_pte_mapped_thp(), we've locked mmap_lock in write
> > >         and checked (a) huepaged_vma_check() to see if the backing
> > >         memory is appropriate still, along with VMA sizing and
> > >         appropriate hugepage alignment within the file, and (b) we've
> > >         found a hugepage head of order HPAGE_PMD_ORDER at the offset
> > >         in the file mapped by our hugepage-aligned virtual address.
> > >         Even though the common-case is likely race with khugepaged,
> > >         given these checks (regardless how we got here - we could be
> > >         operating on a completely different file than originally checked
> > >         in hpage_collapse_scan_file() for all we know) it should be safe
> > >         to directly make the pmd a huge pmd pointing to this hugepage.
> > >
> > > (2)     collapse_file() is mostly serialized on the same file extent by
> > >         lock sequence:
> > >
> > >                 |       lock hupepage
> > >                 |               lock mapping->i_pages
> > >                 |                       lock 1st page
> > >                 |               unlock mapping->i_pages
> > >                 |                               <page checks>
> > >                 |               lock mapping->i_pages
> > >                 |                               page_ref_freeze(3)
> > >                 |                               xas_store(hugepage)
> > >                 |               unlock mapping->i_pages
> > >                 |                               page_ref_unfreeze(1)
> > >                 |                       unlock 1st page
> > >                 V       unlock hugepage
> > >
> > >         Once a context (who already has their fresh hugepage locked)
> > >         locks mapping->i_pages exclusively, it will hold said lock
> > >         until it locks the first page, and it will hold that lock until
> > >         the after the hugepage has been added to the page cache (and
> > >         will unlock the hugepage after page table update, though that
> > >         isn't important here).
> > >
> > >         A racing context that loses the race for mapping->i_pages will
> > >         then lose the race to locking the first page.  Here - depending
> > >         on how far the other racing context has gotten - we might find
> > >         the new hugepage (in which case we'll exit cleanly when we
> > >         check PageTransCompound()), or we'll find the "old" 1st small
> > >         page (in which we'll exit cleanly when we discover unexpected
> > >         refcount of 2 after isolate_lru_page()).  This is assuming we
> > >         are able to successfully lock the page we find - in shmem path,
> > >         we could just fail the trylock and exit cleanly anyways.
> > >
> > >         Failure path in collapse_file() is similar: once we hold lock
> > >         on 1st small page, we are serialized against other collapse
> > >         contexts.  Before the 1st small page is unlocked, we add it
> > >         back to the pagecache and unfreeze the refcount appropriately.
> > >         Contexts who lost the race to the 1st small page will then find
> > >         the same 1st small page with the correct refcount and will be
> > >         able to proceed.
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >  include/linux/khugepaged.h         |  13 +-
> > >  include/trace/events/huge_memory.h |   1 +
> > >  kernel/events/uprobes.c            |   2 +-
> > >  mm/khugepaged.c                    | 238 ++++++++++++++++++++++-------
> > >  4 files changed, 194 insertions(+), 60 deletions(-)
> > >
> > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > > index 384f034ae947..70162d707caf 100644
> > > --- a/include/linux/khugepaged.h
> > > +++ b/include/linux/khugepaged.h
> > > @@ -16,11 +16,13 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma,
> > >                                  unsigned long vm_flags);
> > >  extern void khugepaged_min_free_kbytes_update(void);
> > >  #ifdef CONFIG_SHMEM
> > > -extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
> > > +extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > > +                                  bool install_pmd);
> > >  #else
> > > -static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> > > -                                          unsigned long addr)
> > > +static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
> > > +                                         unsigned long addr, bool install_pmd)
> > >  {
> > > +       return 0;
> > >  }
> > >  #endif
> > >
> > > @@ -46,9 +48,10 @@ static inline void khugepaged_enter_vma(struct vm_area_struct *vma,
> > >                                         unsigned long vm_flags)
> > >  {
> > >  }
> > > -static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> > > -                                          unsigned long addr)
> > > +static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
> > > +                                         unsigned long addr, bool install_pmd)
> > >  {
> > > +       return 0;
> > >  }
> > >
> > >  static inline void khugepaged_min_free_kbytes_update(void)
> > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > index fbbb25494d60..df33453b70fc 100644
> > > --- a/include/trace/events/huge_memory.h
> > > +++ b/include/trace/events/huge_memory.h
> > > @@ -11,6 +11,7 @@
> > >         EM( SCAN_FAIL,                  "failed")                       \
> > >         EM( SCAN_SUCCEED,               "succeeded")                    \
> > >         EM( SCAN_PMD_NULL,              "pmd_null")                     \
> > > +       EM( SCAN_PMD_NONE,              "pmd_none")                     \
> > >         EM( SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
> > >         EM( SCAN_EXCEED_NONE_PTE,       "exceed_none_pte")              \
> > >         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index e0a9b945e7bc..d9e357b7e17c 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -555,7 +555,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > >
> > >         /* try collapse pmd for compound page */
> > >         if (!ret && orig_page_huge)
> > > -               collapse_pte_mapped_thp(mm, vaddr);
> > > +               collapse_pte_mapped_thp(mm, vaddr, false);
> > >
> > >         return ret;
> > >  }
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 31ccf49cf279..66457a06b4e7 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -29,6 +29,7 @@ enum scan_result {
> > >         SCAN_FAIL,
> > >         SCAN_SUCCEED,
> > >         SCAN_PMD_NULL,
> > > +       SCAN_PMD_NONE,
> > >         SCAN_PMD_MAPPED,
> > >         SCAN_EXCEED_NONE_PTE,
> > >         SCAN_EXCEED_SWAP_PTE,
> > > @@ -838,6 +839,18 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
> > >                                 cc->is_khugepaged))
> > >                 return SCAN_VMA_CHECK;
> > > +       return SCAN_SUCCEED;
> > > +}
> > > +
> > > +static int hugepage_vma_revalidate_anon(struct mm_struct *mm,
>
> Hey Yang,
>
> Thanks for taking the time to review this series - particularly this patch,
> which I found tricky.
>
> >
> > Do we really need a new function for anon vma dedicatedly? Can't we
> > add a parameter to hugepage_vma_revalidate()?
> >
>
> Good point - at some point I think I utilized it more, but you're right that
> it it's overkill now.  Have added a "expect_anon" argument to
> hugepage_vma_revalidate().  Thanks for the suggestions.
>
> > > +                                       unsigned long address,
> > > +                                       struct vm_area_struct **vmap,
> > > +                                       struct collapse_control *cc)
> > > +{
> > > +       int ret = hugepage_vma_revalidate(mm, address, vmap, cc);
> > > +
> > > +       if (ret != SCAN_SUCCEED)
> > > +               return ret;
> > >         /*
> > >          * Anon VMA expected, the address may be unmapped then
> > >          * remapped to file after khugepaged reaquired the mmap_lock.
> > > @@ -845,8 +858,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > >          * hugepage_vma_check may return true for qualified file
> > >          * vmas.
> > >          */
> > > -       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > > -               return SCAN_VMA_CHECK;
> > > +       if (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap))
> > > +               return SCAN_PAGE_ANON;
> > >         return SCAN_SUCCEED;
> > >  }
> > >
> > > @@ -866,8 +879,8 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > >         /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > >         barrier();
> > >  #endif
> > > -       if (!pmd_present(pmde))
> > > -               return SCAN_PMD_NULL;
> > > +       if (pmd_none(pmde))
> > > +               return SCAN_PMD_NONE;
> > >         if (pmd_trans_huge(pmde))
> > >                 return SCAN_PMD_MAPPED;
> > >         if (pmd_bad(pmde))
> > > @@ -995,7 +1008,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > >                 goto out_nolock;
> > >
> > >         mmap_read_lock(mm);
> > > -       result = hugepage_vma_revalidate(mm, address, &vma, cc);
> > > +       result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
> > >         if (result != SCAN_SUCCEED) {
> > >                 mmap_read_unlock(mm);
> > >                 goto out_nolock;
> > > @@ -1026,7 +1039,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > >          * handled by the anon_vma lock + PG_lock.
> > >          */
> > >         mmap_write_lock(mm);
> > > -       result = hugepage_vma_revalidate(mm, address, &vma, cc);
> > > +       result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
> > >         if (result != SCAN_SUCCEED)
> > >                 goto out_up_write;
> > >         /* check if the pmd is still valid */
> > > @@ -1332,13 +1345,44 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> > >         slot = mm_slot_lookup(mm_slots_hash, mm);
> > >         mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
> > >         if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
> > > +               int i;
> > > +               /*
> > > +                * Multiple callers may be adding entries here.  Do a quick
> > > +                * check to see the entry hasn't already been added by someone
> > > +                * else.
> > > +                */
> > > +               for (i = 0; i < mm_slot->nr_pte_mapped_thp; ++i)
> > > +                       if (mm_slot->pte_mapped_thp[i] == addr)
> > > +                               goto out;
> >
> > I don't quite get why we need this. I'm supposed just khugepaged could
> > add the addr to the array and MADV_COLLAPSE just handles pte-mapped
> > hugepage immediately IIRC, right? If so there is actually no change on
> > khugepaged side.
> >
>
> So you're right to say that this change isn't needed.  The "multi-add"
> sequence is:
>
> (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
>     emptying the A's ->pte_mapped_thp[] array.
> (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
>     retract_page_tables() finds a VMA in mm_struct A mapping the same extent
>     (at virtual address X) and adds an entry (for X) into mm_struct A's
>     ->pte-mapped_thp[] array.
> (3) khugepaged calls khugepagedge_collapse_scan_file() for mm_struct A at X,
>     sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry (for X)
>     into mm_struct A's ->pte-mapped_thp[] array.
>
> Which is somewhat contrived/rare - but it can occur.  If we don't have this,
> the second time we call collapse_pte_mapped_thp() for the same
> mm_struct/address, we should take the "if (result == SCAN_PMD_MAPPED) {...}"
> branch early and return before grabbing any other locks (we already have
> exclusive mmap_lock).  So, perhaps we can drop this check?

Thanks for elaborating the case. The follow-up question is how often
it happens and whether it is worth it or not?

>
> > >                 mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
> > >                 ret = true;
> > >         }
> > > +out:
> > >         spin_unlock(&khugepaged_mm_lock);
> > >         return ret;
> > >  }
> > >
> > > +/* hpage must be locked, and mmap_lock must be held in write */
> > > +static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > +                       pmd_t *pmdp, struct page *hpage)
> > > +{
> > > +       struct vm_fault vmf = {
> > > +               .vma = vma,
> > > +               .address = addr,
> > > +               .flags = 0,
> > > +               .pmd = pmdp,
> > > +       };
> > > +
> > > +       VM_BUG_ON(!PageTransHuge(hpage));
> > > +       mmap_assert_write_locked(vma->vm_mm);
> > > +
> > > +       if (do_set_pmd(&vmf, hpage))
> > > +               return SCAN_FAIL;
> > > +
> > > +       get_page(hpage);
> > > +       return SCAN_SUCCEED;
> > > +}
> > > +
> > >  static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> > >                                   unsigned long addr, pmd_t *pmdp)
> > >  {
> > > @@ -1360,12 +1404,14 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
> > >   *
> > >   * @mm: process address space where collapse happens
> > >   * @addr: THP collapse address
> > > + * @install_pmd: If a huge PMD should be installed
> > >   *
> > >   * This function checks whether all the PTEs in the PMD are pointing to the
> > >   * right THP. If so, retract the page table so the THP can refault in with
> > > - * as pmd-mapped.
> > > + * as pmd-mapped. Possibly install a huge PMD mapping the THP.
> > >   */
> > > -void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > > +int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > > +                           bool install_pmd)
> > >  {
> > >         unsigned long haddr = addr & HPAGE_PMD_MASK;
> > >         struct vm_area_struct *vma = vma_lookup(mm, haddr);
> > > @@ -1380,12 +1426,12 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > >
> > >         /* Fast check before locking page if already PMD-mapped  */
> > >         result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> > > -       if (result != SCAN_SUCCEED)
> > > -               return;
> > > +       if (result == SCAN_PMD_MAPPED)
> > > +               return result;
> > >
> > >         if (!vma || !vma->vm_file ||
> > >             !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> > > -               return;
> > > +               return SCAN_VMA_CHECK;
> > >
> > >         /*
> > >          * If we are here, we've succeeded in replacing all the native pages
> > > @@ -1395,24 +1441,39 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > >          * analogously elide sysfs THP settings here.
> > >          */
> > >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> > > -               return;
> > > +               return SCAN_VMA_CHECK;
> > >
> > >         /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> > >         if (userfaultfd_wp(vma))
> > > -               return;
> > > +               return SCAN_PTE_UFFD_WP;
> > >
> > >         hpage = find_lock_page(vma->vm_file->f_mapping,
> > >                                linear_page_index(vma, haddr));
> > >         if (!hpage)
> > > -               return;
> > > +               return SCAN_PAGE_NULL;
> > >
> > > -       if (!PageHead(hpage))
> > > +       if (!PageHead(hpage)) {
> > > +               result = SCAN_FAIL;
> >
> > I don't think you could trust this must be a HPAGE_PMD_ORDER hugepage
> > anymore since the vma might point to a different file, so a different
> > page cache. And the current kernel does support arbitrary order of
> > large foios for page cache. [...]
>
> Good catch! Yes, I think we need to double check HPAGE_PMD_ORDER here,
> and that applies equally to khugepaged as well.
>
> > [...] The below pte traverse may remove rmap for
> > the wrong page IIUC. Khugepaged should experience the same problem as
> > well.
> >
>
> Just to confirm, you mean this is only a danger if we don't check the compound
> order, correct? I.e. if compound_order < HPAGE_PMD_ORDER  we'll iterate over
> ptes that map something other than our compound page and erroneously adjust rmap
> for wrong pages.  So, adding a check for compound_order == HPAGE_PMD_ORDER above
> alleviates this possibility.

Yes, exactly. And we can't install PMD for less than HPAGE_PMD_ORDER hugepage.

>
> > >                 goto drop_hpage;
> > > +       }
> > >
> > > -       if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED)
> > > +       result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> > > +       switch (result) {
> > > +       case SCAN_SUCCEED:
> > > +               break;
> > > +       case SCAN_PMD_NONE:
> > > +               /*
> > > +                * In MADV_COLLAPSE path, possible race with khugepaged where
> > > +                * all pte entries have been removed and pmd cleared.  If so,
> > > +                * skip all the pte checks and just update the pmd mapping.
> > > +                */
> > > +               goto maybe_install_pmd;
> > > +       default:
> > >                 goto drop_hpage;
> > > +       }
> > >
> > >         start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> > > +       result = SCAN_FAIL;
> > >
> > >         /* step 1: check all mapped PTEs are to the right huge page */
> > >         for (i = 0, addr = haddr, pte = start_pte;
> > > @@ -1424,8 +1485,10 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > >                         continue;
> > >
> > >                 /* page swapped out, abort */
> > > -               if (!pte_present(*pte))
> > > +               if (!pte_present(*pte)) {
> > > +                       result = SCAN_PTE_NON_PRESENT;
> > >                         goto abort;
> > > +               }
> > >
> > >                 page = vm_normal_page(vma, addr, *pte);
> > >                 if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> > > @@ -1460,12 +1523,19 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > >                 add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
> > >         }
> > >
> > > -       /* step 4: collapse pmd */
> > > +       /* step 4: remove pte entries */
> >
> > It also collapses and flushes pmd.
> >
>
> True, will update the comment.
>
> Thanks again for your time,
> Zach
>
> > >         collapse_and_free_pmd(mm, vma, haddr, pmd);
> > > +
> > > +maybe_install_pmd:
> > > +       /* step 5: install pmd entry */
> > > +       result = install_pmd
> > > +                       ? set_huge_pmd(vma, haddr, pmd, hpage)
> > > +                       : SCAN_SUCCEED;
> > > +
> > >  drop_hpage:
> > >         unlock_page(hpage);
> > >         put_page(hpage);
> > > -       return;
> > > +       return result;
> > >
> > >  abort:
> > >         pte_unmap_unlock(start_pte, ptl);
> > > @@ -1488,22 +1558,29 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
> > >                 goto out;
> > >
> > >         for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
> > > -               collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i]);
> > > +               collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
> > >
> > >  out:
> > >         mm_slot->nr_pte_mapped_thp = 0;
> > >         mmap_write_unlock(mm);
> > >  }
> > >
> > > -static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > > +static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > +                              struct mm_struct *target_mm,
> > > +                              unsigned long target_addr, struct page *hpage,
> > > +                              struct collapse_control *cc)
> > >  {
> > >         struct vm_area_struct *vma;
> > > -       struct mm_struct *mm;
> > > -       unsigned long addr;
> > > -       pmd_t *pmd;
> > > +       int target_result = SCAN_FAIL;
> > >
> > >         i_mmap_lock_write(mapping);
> > >         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > +               int result = SCAN_FAIL;
> > > +               struct mm_struct *mm = NULL;
> > > +               unsigned long addr = 0;
> > > +               pmd_t *pmd;
> > > +               bool is_target = false;
> > > +
> > >                 /*
> > >                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > >                  * got written to. These VMAs are likely not worth investing
> > > @@ -1520,24 +1597,34 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > >                  * ptl. It has higher chance to recover THP for the VMA, but
> > >                  * has higher cost too.
> > >                  */
> > > -               if (vma->anon_vma)
> > > -                       continue;
> > > +               if (vma->anon_vma) {
> > > +                       result = SCAN_PAGE_ANON;
> > > +                       goto next;
> > > +               }
> > >                 addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > > -               if (addr & ~HPAGE_PMD_MASK)
> > > -                       continue;
> > > -               if (vma->vm_end < addr + HPAGE_PMD_SIZE)
> > > -                       continue;
> > > +               if (addr & ~HPAGE_PMD_MASK ||
> > > +                   vma->vm_end < addr + HPAGE_PMD_SIZE) {
> > > +                       result = SCAN_VMA_CHECK;
> > > +                       goto next;
> > > +               }
> > >                 mm = vma->vm_mm;
> > > -               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> > > -                       continue;
> > > +               is_target = mm == target_mm && addr == target_addr;
> > > +               result = find_pmd_or_thp_or_none(mm, addr, &pmd);
> > > +               if (result != SCAN_SUCCEED)
> > > +                       goto next;
> > >                 /*
> > >                  * We need exclusive mmap_lock to retract page table.
> > >                  *
> > >                  * We use trylock due to lock inversion: we need to acquire
> > >                  * mmap_lock while holding page lock. Fault path does it in
> > >                  * reverse order. Trylock is a way to avoid deadlock.
> > > +                *
> > > +                * Also, it's not MADV_COLLAPSE's job to collapse other
> > > +                * mappings - let khugepaged take care of them later.
> > >                  */
> > > -               if (mmap_write_trylock(mm)) {
> > > +               result = SCAN_PTE_MAPPED_HUGEPAGE;
> > > +               if ((cc->is_khugepaged || is_target) &&
> > > +                   mmap_write_trylock(mm)) {
> > >                         /*
> > >                          * When a vma is registered with uffd-wp, we can't
> > >                          * recycle the pmd pgtable because there can be pte
> > > @@ -1546,22 +1633,45 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > >                          * it'll always mapped in small page size for uffd-wp
> > >                          * registered ranges.
> > >                          */
> > > -                       if (!hpage_collapse_test_exit(mm) &&
> > > -                           !userfaultfd_wp(vma))
> > > -                               collapse_and_free_pmd(mm, vma, addr, pmd);
> > > +                       if (hpage_collapse_test_exit(mm)) {
> > > +                               result = SCAN_ANY_PROCESS;
> > > +                               goto unlock_next;
> > > +                       }
> > > +                       if (userfaultfd_wp(vma)) {
> > > +                               result = SCAN_PTE_UFFD_WP;
> > > +                               goto unlock_next;
> > > +                       }
> > > +                       collapse_and_free_pmd(mm, vma, addr, pmd);
> > > +                       if (!cc->is_khugepaged && is_target)
> > > +                               result = set_huge_pmd(vma, addr, pmd, hpage);
> > > +                       else
> > > +                               result = SCAN_SUCCEED;
> > > +
> > > +unlock_next:
> > >                         mmap_write_unlock(mm);
> > > -               } else {
> > > -                       /* Try again later */
> > > +                       goto next;
> > > +               }
> > > +               /*
> > > +                * Calling context will handle target mm/addr. Otherwise, let
> > > +                * khugepaged try again later.
> > > +                */
> > > +               if (!is_target) {
> > >                         khugepaged_add_pte_mapped_thp(mm, addr);
> > > +                       continue;
> > >                 }
> > > +next:
> > > +               if (is_target)
> > > +                       target_result = result;
> > >         }
> > >         i_mmap_unlock_write(mapping);
> > > +       return target_result;
> > >  }
> > >
> > >  /**
> > >   * collapse_file - collapse filemap/tmpfs/shmem pages into huge one.
> > >   *
> > >   * @mm: process address space where collapse happens
> > > + * @addr: virtual collapse start address
> > >   * @file: file that collapse on
> > >   * @start: collapse start address
> > >   * @cc: collapse context and scratchpad
> > > @@ -1581,8 +1691,9 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > >   *    + restore gaps in the page cache;
> > >   *    + unlock and free huge page;
> > >   */
> > > -static int collapse_file(struct mm_struct *mm, struct file *file,
> > > -                        pgoff_t start, struct collapse_control *cc)
> > > +static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > +                        struct file *file, pgoff_t start,
> > > +                        struct collapse_control *cc)
> > >  {
> > >         struct address_space *mapping = file->f_mapping;
> > >         struct page *hpage;
> > > @@ -1890,7 +2001,8 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
> > >                 /*
> > >                  * Remove pte page tables, so we can re-fault the page as huge.
> > >                  */
> > > -               retract_page_tables(mapping, start);
> > > +               result = retract_page_tables(mapping, start, mm, addr, hpage,
> > > +                                            cc);
> > >                 unlock_page(hpage);
> > >                 hpage = NULL;
> > >         } else {
> > > @@ -1946,8 +2058,9 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
> > >         return result;
> > >  }
> > >
> > > -static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > > -                               pgoff_t start, struct collapse_control *cc)
> > > +static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > > +                                   struct file *file, pgoff_t start,
> > > +                                   struct collapse_control *cc)
> > >  {
> > >         struct page *page = NULL;
> > >         struct address_space *mapping = file->f_mapping;
> > > @@ -2035,7 +2148,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > >                         result = SCAN_EXCEED_NONE_PTE;
> > >                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > >                 } else {
> > > -                       result = collapse_file(mm, file, start, cc);
> > > +                       result = collapse_file(mm, addr, file, start, cc);
> > >                 }
> > >         }
> > >
> > > @@ -2043,8 +2156,9 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > >         return result;
> > >  }
> > >  #else
> > > -static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > > -                               pgoff_t start, struct collapse_control *cc)
> > > +static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > > +                                   struct file *file, pgoff_t start,
> > > +                                   struct collapse_control *cc)
> > >  {
> > >         BUILD_BUG();
> > >  }
> > > @@ -2142,8 +2256,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > >                                                 khugepaged_scan.address);
> > >
> > >                                 mmap_read_unlock(mm);
> > > -                               *result = khugepaged_scan_file(mm, file, pgoff,
> > > -                                                              cc);
> > > +                               *result = hpage_collapse_scan_file(mm,
> > > +                                                                  khugepaged_scan.address,
> > > +                                                                  file, pgoff, cc);
> > >                                 mmap_locked = false;
> > >                                 fput(file);
> > >                         } else {
> > > @@ -2449,10 +2564,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > >
> > >         *prev = vma;
> > >
> > > -       /* TODO: Support file/shmem */
> > > -       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > > -               return -EINVAL;
> > > -
> > >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> > >                 return -EINVAL;
> > >
> > > @@ -2483,16 +2594,35 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > >                 }
> > >                 mmap_assert_locked(mm);
> > >                 memset(cc->node_load, 0, sizeof(cc->node_load));
> > > -               result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
> > > -                                                cc);
> > > +               if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
> > > +                       struct file *file = get_file(vma->vm_file);
> > > +                       pgoff_t pgoff = linear_page_index(vma, addr);
> > > +
> > > +                       mmap_read_unlock(mm);
> > > +                       mmap_locked = false;
> > > +                       result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> > > +                                                         cc);
> > > +                       fput(file);
> > > +               } else {
> > > +                       result = hpage_collapse_scan_pmd(mm, vma, addr,
> > > +                                                        &mmap_locked, cc);
> > > +               }
> > >                 if (!mmap_locked)
> > >                         *prev = NULL;  /* Tell caller we dropped mmap_lock */
> > >
> > > +handle_result:
> > >                 switch (result) {
> > >                 case SCAN_SUCCEED:
> > >                 case SCAN_PMD_MAPPED:
> > >                         ++thps;
> > >                         break;
> > > +               case SCAN_PTE_MAPPED_HUGEPAGE:
> > > +                       BUG_ON(mmap_locked);
> > > +                       BUG_ON(*prev);
> > > +                       mmap_write_lock(mm);
> > > +                       result = collapse_pte_mapped_thp(mm, addr, true);
> > > +                       mmap_write_unlock(mm);
> > > +                       goto handle_result;
> > >                 /* Whitelisted set of results where continuing OK */
> > >                 case SCAN_PMD_NULL:
> > >                 case SCAN_PTE_NON_PRESENT:
> > > --
> > > 2.37.2.789.g6183377224-goog
> > >


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm-unstable v3 03/10] mm/madvise: add file and shmem support to MADV_COLLAPSE
  2022-09-19 15:29     ` Zach O'Keefe
  2022-09-19 17:54       ` Yang Shi
@ 2022-09-19 18:12       ` Yang Shi
  2022-09-21 18:26         ` Zach O'Keefe
  1 sibling, 1 reply; 22+ messages in thread
From: Yang Shi @ 2022-09-19 18:12 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Mon, Sep 19, 2022 at 8:29 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Sep 16 13:38, Yang Shi wrote:
> > On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@google.com> wrote:
> > >
> > > Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
> > > memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
> > >
> > > On success, the backing memory will be a hugepage.  For the memory range
> > > and process provided, the page tables will synchronously have a huge pmd
> > > installed, mapping the THP.  Other mappings of the file extent mapped by
> > > the memory range may be added to a set of entries that khugepaged will
> > > later process and attempt update their page tables to map the THP by a pmd.
> > >
> > > This functionality unlocks two important uses:
> > >
> > > (1)     Immediately back executable text by THPs.  Current support provided
> > >         by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
> > >         system which might impair services from serving at their full rated
> > >         load after (re)starting.  Tricks like mremap(2)'ing text onto
> > >         anonymous memory to immediately realize iTLB performance prevents
> > >         page sharing and demand paging, both of which increase steady state
> > >         memory footprint.  Now, we can have the best of both worlds: Peak
> > >         upfront performance and lower RAM footprints.
> > >
> > > (2)     userfaultfd-based live migration of virtual machines satisfy UFFD
> > >         faults by fetching native-sized pages over the network (to avoid
> > >         latency of transferring an entire hugepage).  However, after guest
> > >         memory has been fully copied to the new host, MADV_COLLAPSE can
> > >         be used to immediately increase guest performance.
> > >
> > > Since khugepaged is single threaded, this change now introduces
> > > possibility of collapse contexts racing in file collapse path.  There a
> > > important few places to consider:
> > >
> > > (1)     hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
> > >         We could have the memory collapsed out from under us, but
> > >         the next xas_for_each() iteration will correctly pick up the
> > >         hugepage.  The hugepage might not be up to date (insofar as
> > >         copying of small page contents might not have completed - the
> > >         page still may be locked), but regardless what small page index
> > >         we were iterating over, we'll find the hugepage and identify it
> > >         as a suitably aligned compound page of order HPAGE_PMD_ORDER.
> > >
> > >         In khugepaged path, we locklessly check the value of the pmd,
> > >         and only add it to deferred collapse array if we find pmd
> > >         mapping pte table. This is fine, since other values that could
> > >         have raced in right afterwards denote failure, or that the
> > >         memory was successfully collapsed, so we don't need further
> > >         processing.
> > >
> > >         In madvise path, we'll take mmap_lock() in write to serialize
> > >         against page table updates and will know what to do based on the
> > >         true value of the pmd: recheck all ptes if we point to a pte table,
> > >         directly install the pmd, if the pmd has been cleared, but
> > >         memory not yet faulted, or nothing at all if we find a huge pmd.
> > >
> > >         It's worth putting emphasis here on how we treat the none pmd
> > >         here.  If khugepaged has processed this mm's page tables
> > >         already, it will have left the pmd cleared (ready for refault by
> > >         the process).  Depending on the VMA flags and sysfs settings,
> > >         amount of RAM on the machine, and the current load, could be a
> > >         relatively common occurrence - and as such is one we'd like to
> > >         handle successfully in MADV_COLLAPSE.  When we see the none pmd
> > >         in collapse_pte_mapped_thp(), we've locked mmap_lock in write
> > >         and checked (a) huepaged_vma_check() to see if the backing
> > >         memory is appropriate still, along with VMA sizing and
> > >         appropriate hugepage alignment within the file, and (b) we've
> > >         found a hugepage head of order HPAGE_PMD_ORDER at the offset
> > >         in the file mapped by our hugepage-aligned virtual address.
> > >         Even though the common-case is likely race with khugepaged,
> > >         given these checks (regardless how we got here - we could be
> > >         operating on a completely different file than originally checked
> > >         in hpage_collapse_scan_file() for all we know) it should be safe
> > >         to directly make the pmd a huge pmd pointing to this hugepage.
> > >
> > > (2)     collapse_file() is mostly serialized on the same file extent by
> > >         lock sequence:
> > >
> > >                 |       lock hupepage
> > >                 |               lock mapping->i_pages
> > >                 |                       lock 1st page
> > >                 |               unlock mapping->i_pages
> > >                 |                               <page checks>
> > >                 |               lock mapping->i_pages
> > >                 |                               page_ref_freeze(3)
> > >                 |                               xas_store(hugepage)
> > >                 |               unlock mapping->i_pages
> > >                 |                               page_ref_unfreeze(1)
> > >                 |                       unlock 1st page
> > >                 V       unlock hugepage
> > >
> > >         Once a context (who already has their fresh hugepage locked)
> > >         locks mapping->i_pages exclusively, it will hold said lock
> > >         until it locks the first page, and it will hold that lock until
> > >         the after the hugepage has been added to the page cache (and
> > >         will unlock the hugepage after page table update, though that
> > >         isn't important here).
> > >
> > >         A racing context that loses the race for mapping->i_pages will
> > >         then lose the race to locking the first page.  Here - depending
> > >         on how far the other racing context has gotten - we might find
> > >         the new hugepage (in which case we'll exit cleanly when we
> > >         check PageTransCompound()), or we'll find the "old" 1st small
> > >         page (in which we'll exit cleanly when we discover unexpected
> > >         refcount of 2 after isolate_lru_page()).  This is assuming we
> > >         are able to successfully lock the page we find - in shmem path,
> > >         we could just fail the trylock and exit cleanly anyways.
> > >
> > >         Failure path in collapse_file() is similar: once we hold lock
> > >         on 1st small page, we are serialized against other collapse
> > >         contexts.  Before the 1st small page is unlocked, we add it
> > >         back to the pagecache and unfreeze the refcount appropriately.
> > >         Contexts who lost the race to the 1st small page will then find
> > >         the same 1st small page with the correct refcount and will be
> > >         able to proceed.
> > >
> > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > ---
> > >  include/linux/khugepaged.h         |  13 +-
> > >  include/trace/events/huge_memory.h |   1 +
> > >  kernel/events/uprobes.c            |   2 +-
> > >  mm/khugepaged.c                    | 238 ++++++++++++++++++++++-------
> > >  4 files changed, 194 insertions(+), 60 deletions(-)
> > >
> > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > > index 384f034ae947..70162d707caf 100644
> > > --- a/include/linux/khugepaged.h
> > > +++ b/include/linux/khugepaged.h
> > > @@ -16,11 +16,13 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma,
> > >                                  unsigned long vm_flags);
> > >  extern void khugepaged_min_free_kbytes_update(void);
> > >  #ifdef CONFIG_SHMEM
> > > -extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
> > > +extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > > +                                  bool install_pmd);
> > >  #else
> > > -static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> > > -                                          unsigned long addr)
> > > +static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
> > > +                                         unsigned long addr, bool install_pmd)
> > >  {
> > > +       return 0;
> > >  }
> > >  #endif
> > >
> > > @@ -46,9 +48,10 @@ static inline void khugepaged_enter_vma(struct vm_area_struct *vma,
> > >                                         unsigned long vm_flags)
> > >  {
> > >  }
> > > -static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> > > -                                          unsigned long addr)
> > > +static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
> > > +                                         unsigned long addr, bool install_pmd)
> > >  {
> > > +       return 0;
> > >  }
> > >
> > >  static inline void khugepaged_min_free_kbytes_update(void)
> > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > index fbbb25494d60..df33453b70fc 100644
> > > --- a/include/trace/events/huge_memory.h
> > > +++ b/include/trace/events/huge_memory.h
> > > @@ -11,6 +11,7 @@
> > >         EM( SCAN_FAIL,                  "failed")                       \
> > >         EM( SCAN_SUCCEED,               "succeeded")                    \
> > >         EM( SCAN_PMD_NULL,              "pmd_null")                     \
> > > +       EM( SCAN_PMD_NONE,              "pmd_none")                     \
> > >         EM( SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
> > >         EM( SCAN_EXCEED_NONE_PTE,       "exceed_none_pte")              \
> > >         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index e0a9b945e7bc..d9e357b7e17c 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -555,7 +555,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > >
> > >         /* try collapse pmd for compound page */
> > >         if (!ret && orig_page_huge)
> > > -               collapse_pte_mapped_thp(mm, vaddr);
> > > +               collapse_pte_mapped_thp(mm, vaddr, false);
> > >
> > >         return ret;
> > >  }
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 31ccf49cf279..66457a06b4e7 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -29,6 +29,7 @@ enum scan_result {
> > >         SCAN_FAIL,
> > >         SCAN_SUCCEED,
> > >         SCAN_PMD_NULL,
> > > +       SCAN_PMD_NONE,
> > >         SCAN_PMD_MAPPED,
> > >         SCAN_EXCEED_NONE_PTE,
> > >         SCAN_EXCEED_SWAP_PTE,
> > > @@ -838,6 +839,18 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
> > >                                 cc->is_khugepaged))
> > >                 return SCAN_VMA_CHECK;
> > > +       return SCAN_SUCCEED;
> > > +}
> > > +
> > > +static int hugepage_vma_revalidate_anon(struct mm_struct *mm,
>
> Hey Yang,
>
> Thanks for taking the time to review this series - particularly this patch,
> which I found tricky.
>
> >
> > Do we really need a new function for anon vma dedicatedly? Can't we
> > add a parameter to hugepage_vma_revalidate()?
> >
>
> Good point - at some point I think I utilized it more, but you're right that
> it it's overkill now.  Have added a "expect_anon" argument to
> hugepage_vma_revalidate().  Thanks for the suggestions.
>
> > > +                                       unsigned long address,
> > > +                                       struct vm_area_struct **vmap,
> > > +                                       struct collapse_control *cc)
> > > +{
> > > +       int ret = hugepage_vma_revalidate(mm, address, vmap, cc);
> > > +
> > > +       if (ret != SCAN_SUCCEED)
> > > +               return ret;
> > >         /*
> > >          * Anon VMA expected, the address may be unmapped then
> > >          * remapped to file after khugepaged reaquired the mmap_lock.
> > > @@ -845,8 +858,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > >          * hugepage_vma_check may return true for qualified file
> > >          * vmas.
> > >          */
> > > -       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > > -               return SCAN_VMA_CHECK;
> > > +       if (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap))
> > > +               return SCAN_PAGE_ANON;
> > >         return SCAN_SUCCEED;
> > >  }
> > >
> > > @@ -866,8 +879,8 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > >         /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > >         barrier();
> > >  #endif
> > > -       if (!pmd_present(pmde))
> > > -               return SCAN_PMD_NULL;
> > > +       if (pmd_none(pmde))
> > > +               return SCAN_PMD_NONE;
> > >         if (pmd_trans_huge(pmde))
> > >                 return SCAN_PMD_MAPPED;
> > >         if (pmd_bad(pmde))
> > > @@ -995,7 +1008,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > >                 goto out_nolock;
> > >
> > >         mmap_read_lock(mm);
> > > -       result = hugepage_vma_revalidate(mm, address, &vma, cc);
> > > +       result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
> > >         if (result != SCAN_SUCCEED) {
> > >                 mmap_read_unlock(mm);
> > >                 goto out_nolock;
> > > @@ -1026,7 +1039,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > >          * handled by the anon_vma lock + PG_lock.
> > >          */
> > >         mmap_write_lock(mm);
> > > -       result = hugepage_vma_revalidate(mm, address, &vma, cc);
> > > +       result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
> > >         if (result != SCAN_SUCCEED)
> > >                 goto out_up_write;
> > >         /* check if the pmd is still valid */
> > > @@ -1332,13 +1345,44 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> > >         slot = mm_slot_lookup(mm_slots_hash, mm);
> > >         mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
> > >         if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
> > > +               int i;
> > > +               /*
> > > +                * Multiple callers may be adding entries here.  Do a quick
> > > +                * check to see the entry hasn't already been added by someone
> > > +                * else.
> > > +                */
> > > +               for (i = 0; i < mm_slot->nr_pte_mapped_thp; ++i)
> > > +                       if (mm_slot->pte_mapped_thp[i] == addr)
> > > +                               goto out;
> >
> > I don't quite get why we need this. I'm supposed just khugepaged could
> > add the addr to the array and MADV_COLLAPSE just handles pte-mapped
> > hugepage immediately IIRC, right? If so there is actually no change on
> > khugepaged side.
> >
>
> So you're right to say that this change isn't needed.  The "multi-add"
> sequence is:
>
> (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
>     emptying the A's ->pte_mapped_thp[] array.
> (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
>     retract_page_tables() finds a VMA in mm_struct A mapping the same extent
>     (at virtual address X) and adds an entry (for X) into mm_struct A's
>     ->pte-mapped_thp[] array.
> (3) khugepaged calls khugepagedge_collapse_scan_file() for mm_struct A at X,
>     sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry (for X)
>     into mm_struct A's ->pte-mapped_thp[] array.
>
> Which is somewhat contrived/rare - but it can occur.  If we don't have this,
> the second time we call collapse_pte_mapped_thp() for the same
> mm_struct/address, we should take the "if (result == SCAN_PMD_MAPPED) {...}"
> branch early and return before grabbing any other locks (we already have
> exclusive mmap_lock).  So, perhaps we can drop this check?
>
> > >                 mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
> > >                 ret = true;
> > >         }
> > > +out:
> > >         spin_unlock(&khugepaged_mm_lock);
> > >         return ret;
> > >  }
> > >
> > > +/* hpage must be locked, and mmap_lock must be held in write */
> > > +static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > +                       pmd_t *pmdp, struct page *hpage)
> > > +{
> > > +       struct vm_fault vmf = {
> > > +               .vma = vma,
> > > +               .address = addr,
> > > +               .flags = 0,
> > > +               .pmd = pmdp,
> > > +       };
> > > +
> > > +       VM_BUG_ON(!PageTransHuge(hpage));
> > > +       mmap_assert_write_locked(vma->vm_mm);
> > > +
> > > +       if (do_set_pmd(&vmf, hpage))
> > > +               return SCAN_FAIL;
> > > +
> > > +       get_page(hpage);
> > > +       return SCAN_SUCCEED;
> > > +}
> > > +
> > >  static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> > >                                   unsigned long addr, pmd_t *pmdp)
> > >  {
> > > @@ -1360,12 +1404,14 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
> > >   *
> > >   * @mm: process address space where collapse happens
> > >   * @addr: THP collapse address
> > > + * @install_pmd: If a huge PMD should be installed
> > >   *
> > >   * This function checks whether all the PTEs in the PMD are pointing to the
> > >   * right THP. If so, retract the page table so the THP can refault in with
> > > - * as pmd-mapped.
> > > + * as pmd-mapped. Possibly install a huge PMD mapping the THP.
> > >   */
> > > -void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > > +int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > > +                           bool install_pmd)
> > >  {
> > >         unsigned long haddr = addr & HPAGE_PMD_MASK;
> > >         struct vm_area_struct *vma = vma_lookup(mm, haddr);
> > > @@ -1380,12 +1426,12 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > >
> > >         /* Fast check before locking page if already PMD-mapped  */
> > >         result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> > > -       if (result != SCAN_SUCCEED)
> > > -               return;
> > > +       if (result == SCAN_PMD_MAPPED)
> > > +               return result;
> > >
> > >         if (!vma || !vma->vm_file ||
> > >             !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> > > -               return;
> > > +               return SCAN_VMA_CHECK;
> > >
> > >         /*
> > >          * If we are here, we've succeeded in replacing all the native pages
> > > @@ -1395,24 +1441,39 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > >          * analogously elide sysfs THP settings here.
> > >          */
> > >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> > > -               return;
> > > +               return SCAN_VMA_CHECK;
> > >
> > >         /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> > >         if (userfaultfd_wp(vma))
> > > -               return;
> > > +               return SCAN_PTE_UFFD_WP;
> > >
> > >         hpage = find_lock_page(vma->vm_file->f_mapping,
> > >                                linear_page_index(vma, haddr));
> > >         if (!hpage)
> > > -               return;
> > > +               return SCAN_PAGE_NULL;
> > >
> > > -       if (!PageHead(hpage))
> > > +       if (!PageHead(hpage)) {
> > > +               result = SCAN_FAIL;
> >
> > I don't think you could trust this must be a HPAGE_PMD_ORDER hugepage
> > anymore since the vma might point to a different file, so a different
> > page cache. And the current kernel does support arbitrary order of
> > large foios for page cache. [...]
>
> Good catch! Yes, I think we need to double check HPAGE_PMD_ORDER here,
> and that applies equally to khugepaged as well.

BTW, it should be better to have a separate patch to fix this issue as
a prerequisite of this series.

>
> > [...] The below pte traverse may remove rmap for
> > the wrong page IIUC. Khugepaged should experience the same problem as
> > well.
> >
>
> Just to confirm, you mean this is only a danger if we don't check the compound
> order, correct? I.e. if compound_order < HPAGE_PMD_ORDER  we'll iterate over
> ptes that map something other than our compound page and erroneously adjust rmap
> for wrong pages.  So, adding a check for compound_order == HPAGE_PMD_ORDER above
> alleviates this possibility.
>
> > >                 goto drop_hpage;
> > > +       }
> > >
> > > -       if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED)
> > > +       result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> > > +       switch (result) {
> > > +       case SCAN_SUCCEED:
> > > +               break;
> > > +       case SCAN_PMD_NONE:
> > > +               /*
> > > +                * In MADV_COLLAPSE path, possible race with khugepaged where
> > > +                * all pte entries have been removed and pmd cleared.  If so,
> > > +                * skip all the pte checks and just update the pmd mapping.
> > > +                */
> > > +               goto maybe_install_pmd;
> > > +       default:
> > >                 goto drop_hpage;
> > > +       }
> > >
> > >         start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> > > +       result = SCAN_FAIL;
> > >
> > >         /* step 1: check all mapped PTEs are to the right huge page */
> > >         for (i = 0, addr = haddr, pte = start_pte;
> > > @@ -1424,8 +1485,10 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > >                         continue;
> > >
> > >                 /* page swapped out, abort */
> > > -               if (!pte_present(*pte))
> > > +               if (!pte_present(*pte)) {
> > > +                       result = SCAN_PTE_NON_PRESENT;
> > >                         goto abort;
> > > +               }
> > >
> > >                 page = vm_normal_page(vma, addr, *pte);
> > >                 if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> > > @@ -1460,12 +1523,19 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > >                 add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
> > >         }
> > >
> > > -       /* step 4: collapse pmd */
> > > +       /* step 4: remove pte entries */
> >
> > It also collapses and flushes pmd.
> >
>
> True, will update the comment.
>
> Thanks again for your time,
> Zach
>
> > >         collapse_and_free_pmd(mm, vma, haddr, pmd);
> > > +
> > > +maybe_install_pmd:
> > > +       /* step 5: install pmd entry */
> > > +       result = install_pmd
> > > +                       ? set_huge_pmd(vma, haddr, pmd, hpage)
> > > +                       : SCAN_SUCCEED;
> > > +
> > >  drop_hpage:
> > >         unlock_page(hpage);
> > >         put_page(hpage);
> > > -       return;
> > > +       return result;
> > >
> > >  abort:
> > >         pte_unmap_unlock(start_pte, ptl);
> > > @@ -1488,22 +1558,29 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
> > >                 goto out;
> > >
> > >         for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
> > > -               collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i]);
> > > +               collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
> > >
> > >  out:
> > >         mm_slot->nr_pte_mapped_thp = 0;
> > >         mmap_write_unlock(mm);
> > >  }
> > >
> > > -static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > > +static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > +                              struct mm_struct *target_mm,
> > > +                              unsigned long target_addr, struct page *hpage,
> > > +                              struct collapse_control *cc)
> > >  {
> > >         struct vm_area_struct *vma;
> > > -       struct mm_struct *mm;
> > > -       unsigned long addr;
> > > -       pmd_t *pmd;
> > > +       int target_result = SCAN_FAIL;
> > >
> > >         i_mmap_lock_write(mapping);
> > >         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > +               int result = SCAN_FAIL;
> > > +               struct mm_struct *mm = NULL;
> > > +               unsigned long addr = 0;
> > > +               pmd_t *pmd;
> > > +               bool is_target = false;
> > > +
> > >                 /*
> > >                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > >                  * got written to. These VMAs are likely not worth investing
> > > @@ -1520,24 +1597,34 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > >                  * ptl. It has higher chance to recover THP for the VMA, but
> > >                  * has higher cost too.
> > >                  */
> > > -               if (vma->anon_vma)
> > > -                       continue;
> > > +               if (vma->anon_vma) {
> > > +                       result = SCAN_PAGE_ANON;
> > > +                       goto next;
> > > +               }
> > >                 addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > > -               if (addr & ~HPAGE_PMD_MASK)
> > > -                       continue;
> > > -               if (vma->vm_end < addr + HPAGE_PMD_SIZE)
> > > -                       continue;
> > > +               if (addr & ~HPAGE_PMD_MASK ||
> > > +                   vma->vm_end < addr + HPAGE_PMD_SIZE) {
> > > +                       result = SCAN_VMA_CHECK;
> > > +                       goto next;
> > > +               }
> > >                 mm = vma->vm_mm;
> > > -               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> > > -                       continue;
> > > +               is_target = mm == target_mm && addr == target_addr;
> > > +               result = find_pmd_or_thp_or_none(mm, addr, &pmd);
> > > +               if (result != SCAN_SUCCEED)
> > > +                       goto next;
> > >                 /*
> > >                  * We need exclusive mmap_lock to retract page table.
> > >                  *
> > >                  * We use trylock due to lock inversion: we need to acquire
> > >                  * mmap_lock while holding page lock. Fault path does it in
> > >                  * reverse order. Trylock is a way to avoid deadlock.
> > > +                *
> > > +                * Also, it's not MADV_COLLAPSE's job to collapse other
> > > +                * mappings - let khugepaged take care of them later.
> > >                  */
> > > -               if (mmap_write_trylock(mm)) {
> > > +               result = SCAN_PTE_MAPPED_HUGEPAGE;
> > > +               if ((cc->is_khugepaged || is_target) &&
> > > +                   mmap_write_trylock(mm)) {
> > >                         /*
> > >                          * When a vma is registered with uffd-wp, we can't
> > >                          * recycle the pmd pgtable because there can be pte
> > > @@ -1546,22 +1633,45 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > >                          * it'll always mapped in small page size for uffd-wp
> > >                          * registered ranges.
> > >                          */
> > > -                       if (!hpage_collapse_test_exit(mm) &&
> > > -                           !userfaultfd_wp(vma))
> > > -                               collapse_and_free_pmd(mm, vma, addr, pmd);
> > > +                       if (hpage_collapse_test_exit(mm)) {
> > > +                               result = SCAN_ANY_PROCESS;
> > > +                               goto unlock_next;
> > > +                       }
> > > +                       if (userfaultfd_wp(vma)) {
> > > +                               result = SCAN_PTE_UFFD_WP;
> > > +                               goto unlock_next;
> > > +                       }
> > > +                       collapse_and_free_pmd(mm, vma, addr, pmd);
> > > +                       if (!cc->is_khugepaged && is_target)
> > > +                               result = set_huge_pmd(vma, addr, pmd, hpage);
> > > +                       else
> > > +                               result = SCAN_SUCCEED;
> > > +
> > > +unlock_next:
> > >                         mmap_write_unlock(mm);
> > > -               } else {
> > > -                       /* Try again later */
> > > +                       goto next;
> > > +               }
> > > +               /*
> > > +                * Calling context will handle target mm/addr. Otherwise, let
> > > +                * khugepaged try again later.
> > > +                */
> > > +               if (!is_target) {
> > >                         khugepaged_add_pte_mapped_thp(mm, addr);
> > > +                       continue;
> > >                 }
> > > +next:
> > > +               if (is_target)
> > > +                       target_result = result;
> > >         }
> > >         i_mmap_unlock_write(mapping);
> > > +       return target_result;
> > >  }
> > >
> > >  /**
> > >   * collapse_file - collapse filemap/tmpfs/shmem pages into huge one.
> > >   *
> > >   * @mm: process address space where collapse happens
> > > + * @addr: virtual collapse start address
> > >   * @file: file that collapse on
> > >   * @start: collapse start address
> > >   * @cc: collapse context and scratchpad
> > > @@ -1581,8 +1691,9 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > >   *    + restore gaps in the page cache;
> > >   *    + unlock and free huge page;
> > >   */
> > > -static int collapse_file(struct mm_struct *mm, struct file *file,
> > > -                        pgoff_t start, struct collapse_control *cc)
> > > +static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > +                        struct file *file, pgoff_t start,
> > > +                        struct collapse_control *cc)
> > >  {
> > >         struct address_space *mapping = file->f_mapping;
> > >         struct page *hpage;
> > > @@ -1890,7 +2001,8 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
> > >                 /*
> > >                  * Remove pte page tables, so we can re-fault the page as huge.
> > >                  */
> > > -               retract_page_tables(mapping, start);
> > > +               result = retract_page_tables(mapping, start, mm, addr, hpage,
> > > +                                            cc);
> > >                 unlock_page(hpage);
> > >                 hpage = NULL;
> > >         } else {
> > > @@ -1946,8 +2058,9 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
> > >         return result;
> > >  }
> > >
> > > -static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > > -                               pgoff_t start, struct collapse_control *cc)
> > > +static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > > +                                   struct file *file, pgoff_t start,
> > > +                                   struct collapse_control *cc)
> > >  {
> > >         struct page *page = NULL;
> > >         struct address_space *mapping = file->f_mapping;
> > > @@ -2035,7 +2148,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > >                         result = SCAN_EXCEED_NONE_PTE;
> > >                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > >                 } else {
> > > -                       result = collapse_file(mm, file, start, cc);
> > > +                       result = collapse_file(mm, addr, file, start, cc);
> > >                 }
> > >         }
> > >
> > > @@ -2043,8 +2156,9 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > >         return result;
> > >  }
> > >  #else
> > > -static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > > -                               pgoff_t start, struct collapse_control *cc)
> > > +static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > > +                                   struct file *file, pgoff_t start,
> > > +                                   struct collapse_control *cc)
> > >  {
> > >         BUILD_BUG();
> > >  }
> > > @@ -2142,8 +2256,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > >                                                 khugepaged_scan.address);
> > >
> > >                                 mmap_read_unlock(mm);
> > > -                               *result = khugepaged_scan_file(mm, file, pgoff,
> > > -                                                              cc);
> > > +                               *result = hpage_collapse_scan_file(mm,
> > > +                                                                  khugepaged_scan.address,
> > > +                                                                  file, pgoff, cc);
> > >                                 mmap_locked = false;
> > >                                 fput(file);
> > >                         } else {
> > > @@ -2449,10 +2564,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > >
> > >         *prev = vma;
> > >
> > > -       /* TODO: Support file/shmem */
> > > -       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > > -               return -EINVAL;
> > > -
> > >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> > >                 return -EINVAL;
> > >
> > > @@ -2483,16 +2594,35 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > >                 }
> > >                 mmap_assert_locked(mm);
> > >                 memset(cc->node_load, 0, sizeof(cc->node_load));
> > > -               result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
> > > -                                                cc);
> > > +               if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
> > > +                       struct file *file = get_file(vma->vm_file);
> > > +                       pgoff_t pgoff = linear_page_index(vma, addr);
> > > +
> > > +                       mmap_read_unlock(mm);
> > > +                       mmap_locked = false;
> > > +                       result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> > > +                                                         cc);
> > > +                       fput(file);
> > > +               } else {
> > > +                       result = hpage_collapse_scan_pmd(mm, vma, addr,
> > > +                                                        &mmap_locked, cc);
> > > +               }
> > >                 if (!mmap_locked)
> > >                         *prev = NULL;  /* Tell caller we dropped mmap_lock */
> > >
> > > +handle_result:
> > >                 switch (result) {
> > >                 case SCAN_SUCCEED:
> > >                 case SCAN_PMD_MAPPED:
> > >                         ++thps;
> > >                         break;
> > > +               case SCAN_PTE_MAPPED_HUGEPAGE:
> > > +                       BUG_ON(mmap_locked);
> > > +                       BUG_ON(*prev);
> > > +                       mmap_write_lock(mm);
> > > +                       result = collapse_pte_mapped_thp(mm, addr, true);
> > > +                       mmap_write_unlock(mm);
> > > +                       goto handle_result;
> > >                 /* Whitelisted set of results where continuing OK */
> > >                 case SCAN_PMD_NULL:
> > >                 case SCAN_PTE_NON_PRESENT:
> > > --
> > > 2.37.2.789.g6183377224-goog
> > >


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm-unstable v3 03/10] mm/madvise: add file and shmem support to MADV_COLLAPSE
  2022-09-19 18:12       ` Yang Shi
@ 2022-09-21 18:26         ` Zach O'Keefe
  0 siblings, 0 replies; 22+ messages in thread
From: Zach O'Keefe @ 2022-09-21 18:26 UTC (permalink / raw)
  To: Yang Shi
  Cc: linux-mm, Andrew Morton, linux-api, Axel Rasmussen,
	James Houghton, Hugh Dickins, Miaohe Lin, David Hildenbrand,
	David Rientjes, Matthew Wilcox, Pasha Tatashin, Peter Xu,
	Rongwei Wang, SeongJae Park, Song Liu, Vlastimil Babka,
	Chris Kennelly, Kirill A. Shutemov, Minchan Kim, Patrick Xia

On Mon, Sep 19, 2022 at 11:12 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Sep 19, 2022 at 8:29 AM Zach O'Keefe <zokeefe@google.com> wrote:
> >
> > On Sep 16 13:38, Yang Shi wrote:
> > > On Wed, Sep 7, 2022 at 7:45 AM Zach O'Keefe <zokeefe@google.com> wrote:
> > > >
> > > > Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
> > > > memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
> > > >
> > > > On success, the backing memory will be a hugepage.  For the memory range
> > > > and process provided, the page tables will synchronously have a huge pmd
> > > > installed, mapping the THP.  Other mappings of the file extent mapped by
> > > > the memory range may be added to a set of entries that khugepaged will
> > > > later process and attempt update their page tables to map the THP by a pmd.
> > > >
> > > > This functionality unlocks two important uses:
> > > >
> > > > (1)     Immediately back executable text by THPs.  Current support provided
> > > >         by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
> > > >         system which might impair services from serving at their full rated
> > > >         load after (re)starting.  Tricks like mremap(2)'ing text onto
> > > >         anonymous memory to immediately realize iTLB performance prevents
> > > >         page sharing and demand paging, both of which increase steady state
> > > >         memory footprint.  Now, we can have the best of both worlds: Peak
> > > >         upfront performance and lower RAM footprints.
> > > >
> > > > (2)     userfaultfd-based live migration of virtual machines satisfy UFFD
> > > >         faults by fetching native-sized pages over the network (to avoid
> > > >         latency of transferring an entire hugepage).  However, after guest
> > > >         memory has been fully copied to the new host, MADV_COLLAPSE can
> > > >         be used to immediately increase guest performance.
> > > >
> > > > Since khugepaged is single threaded, this change now introduces
> > > > possibility of collapse contexts racing in file collapse path.  There a
> > > > important few places to consider:
> > > >
> > > > (1)     hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
> > > >         We could have the memory collapsed out from under us, but
> > > >         the next xas_for_each() iteration will correctly pick up the
> > > >         hugepage.  The hugepage might not be up to date (insofar as
> > > >         copying of small page contents might not have completed - the
> > > >         page still may be locked), but regardless what small page index
> > > >         we were iterating over, we'll find the hugepage and identify it
> > > >         as a suitably aligned compound page of order HPAGE_PMD_ORDER.
> > > >
> > > >         In khugepaged path, we locklessly check the value of the pmd,
> > > >         and only add it to deferred collapse array if we find pmd
> > > >         mapping pte table. This is fine, since other values that could
> > > >         have raced in right afterwards denote failure, or that the
> > > >         memory was successfully collapsed, so we don't need further
> > > >         processing.
> > > >
> > > >         In madvise path, we'll take mmap_lock() in write to serialize
> > > >         against page table updates and will know what to do based on the
> > > >         true value of the pmd: recheck all ptes if we point to a pte table,
> > > >         directly install the pmd, if the pmd has been cleared, but
> > > >         memory not yet faulted, or nothing at all if we find a huge pmd.
> > > >
> > > >         It's worth putting emphasis here on how we treat the none pmd
> > > >         here.  If khugepaged has processed this mm's page tables
> > > >         already, it will have left the pmd cleared (ready for refault by
> > > >         the process).  Depending on the VMA flags and sysfs settings,
> > > >         amount of RAM on the machine, and the current load, could be a
> > > >         relatively common occurrence - and as such is one we'd like to
> > > >         handle successfully in MADV_COLLAPSE.  When we see the none pmd
> > > >         in collapse_pte_mapped_thp(), we've locked mmap_lock in write
> > > >         and checked (a) huepaged_vma_check() to see if the backing
> > > >         memory is appropriate still, along with VMA sizing and
> > > >         appropriate hugepage alignment within the file, and (b) we've
> > > >         found a hugepage head of order HPAGE_PMD_ORDER at the offset
> > > >         in the file mapped by our hugepage-aligned virtual address.
> > > >         Even though the common-case is likely race with khugepaged,
> > > >         given these checks (regardless how we got here - we could be
> > > >         operating on a completely different file than originally checked
> > > >         in hpage_collapse_scan_file() for all we know) it should be safe
> > > >         to directly make the pmd a huge pmd pointing to this hugepage.
> > > >
> > > > (2)     collapse_file() is mostly serialized on the same file extent by
> > > >         lock sequence:
> > > >
> > > >                 |       lock hupepage
> > > >                 |               lock mapping->i_pages
> > > >                 |                       lock 1st page
> > > >                 |               unlock mapping->i_pages
> > > >                 |                               <page checks>
> > > >                 |               lock mapping->i_pages
> > > >                 |                               page_ref_freeze(3)
> > > >                 |                               xas_store(hugepage)
> > > >                 |               unlock mapping->i_pages
> > > >                 |                               page_ref_unfreeze(1)
> > > >                 |                       unlock 1st page
> > > >                 V       unlock hugepage
> > > >
> > > >         Once a context (who already has their fresh hugepage locked)
> > > >         locks mapping->i_pages exclusively, it will hold said lock
> > > >         until it locks the first page, and it will hold that lock until
> > > >         the after the hugepage has been added to the page cache (and
> > > >         will unlock the hugepage after page table update, though that
> > > >         isn't important here).
> > > >
> > > >         A racing context that loses the race for mapping->i_pages will
> > > >         then lose the race to locking the first page.  Here - depending
> > > >         on how far the other racing context has gotten - we might find
> > > >         the new hugepage (in which case we'll exit cleanly when we
> > > >         check PageTransCompound()), or we'll find the "old" 1st small
> > > >         page (in which we'll exit cleanly when we discover unexpected
> > > >         refcount of 2 after isolate_lru_page()).  This is assuming we
> > > >         are able to successfully lock the page we find - in shmem path,
> > > >         we could just fail the trylock and exit cleanly anyways.
> > > >
> > > >         Failure path in collapse_file() is similar: once we hold lock
> > > >         on 1st small page, we are serialized against other collapse
> > > >         contexts.  Before the 1st small page is unlocked, we add it
> > > >         back to the pagecache and unfreeze the refcount appropriately.
> > > >         Contexts who lost the race to the 1st small page will then find
> > > >         the same 1st small page with the correct refcount and will be
> > > >         able to proceed.
> > > >
> > > > Signed-off-by: Zach O'Keefe <zokeefe@google.com>
> > > > ---
> > > >  include/linux/khugepaged.h         |  13 +-
> > > >  include/trace/events/huge_memory.h |   1 +
> > > >  kernel/events/uprobes.c            |   2 +-
> > > >  mm/khugepaged.c                    | 238 ++++++++++++++++++++++-------
> > > >  4 files changed, 194 insertions(+), 60 deletions(-)
> > > >
> > > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > > > index 384f034ae947..70162d707caf 100644
> > > > --- a/include/linux/khugepaged.h
> > > > +++ b/include/linux/khugepaged.h
> > > > @@ -16,11 +16,13 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma,
> > > >                                  unsigned long vm_flags);
> > > >  extern void khugepaged_min_free_kbytes_update(void);
> > > >  #ifdef CONFIG_SHMEM
> > > > -extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
> > > > +extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > > > +                                  bool install_pmd);
> > > >  #else
> > > > -static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> > > > -                                          unsigned long addr)
> > > > +static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
> > > > +                                         unsigned long addr, bool install_pmd)
> > > >  {
> > > > +       return 0;
> > > >  }
> > > >  #endif
> > > >
> > > > @@ -46,9 +48,10 @@ static inline void khugepaged_enter_vma(struct vm_area_struct *vma,
> > > >                                         unsigned long vm_flags)
> > > >  {
> > > >  }
> > > > -static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> > > > -                                          unsigned long addr)
> > > > +static inline int collapse_pte_mapped_thp(struct mm_struct *mm,
> > > > +                                         unsigned long addr, bool install_pmd)
> > > >  {
> > > > +       return 0;
> > > >  }
> > > >
> > > >  static inline void khugepaged_min_free_kbytes_update(void)
> > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > > index fbbb25494d60..df33453b70fc 100644
> > > > --- a/include/trace/events/huge_memory.h
> > > > +++ b/include/trace/events/huge_memory.h
> > > > @@ -11,6 +11,7 @@
> > > >         EM( SCAN_FAIL,                  "failed")                       \
> > > >         EM( SCAN_SUCCEED,               "succeeded")                    \
> > > >         EM( SCAN_PMD_NULL,              "pmd_null")                     \
> > > > +       EM( SCAN_PMD_NONE,              "pmd_none")                     \
> > > >         EM( SCAN_PMD_MAPPED,            "page_pmd_mapped")              \
> > > >         EM( SCAN_EXCEED_NONE_PTE,       "exceed_none_pte")              \
> > > >         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
> > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > index e0a9b945e7bc..d9e357b7e17c 100644
> > > > --- a/kernel/events/uprobes.c
> > > > +++ b/kernel/events/uprobes.c
> > > > @@ -555,7 +555,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > > >
> > > >         /* try collapse pmd for compound page */
> > > >         if (!ret && orig_page_huge)
> > > > -               collapse_pte_mapped_thp(mm, vaddr);
> > > > +               collapse_pte_mapped_thp(mm, vaddr, false);
> > > >
> > > >         return ret;
> > > >  }
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 31ccf49cf279..66457a06b4e7 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -29,6 +29,7 @@ enum scan_result {
> > > >         SCAN_FAIL,
> > > >         SCAN_SUCCEED,
> > > >         SCAN_PMD_NULL,
> > > > +       SCAN_PMD_NONE,
> > > >         SCAN_PMD_MAPPED,
> > > >         SCAN_EXCEED_NONE_PTE,
> > > >         SCAN_EXCEED_SWAP_PTE,
> > > > @@ -838,6 +839,18 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
> > > >                                 cc->is_khugepaged))
> > > >                 return SCAN_VMA_CHECK;
> > > > +       return SCAN_SUCCEED;
> > > > +}
> > > > +
> > > > +static int hugepage_vma_revalidate_anon(struct mm_struct *mm,
> >
> > Hey Yang,
> >
> > Thanks for taking the time to review this series - particularly this patch,
> > which I found tricky.
> >
> > >
> > > Do we really need a new function for anon vma dedicatedly? Can't we
> > > add a parameter to hugepage_vma_revalidate()?
> > >
> >
> > Good point - at some point I think I utilized it more, but you're right that
> > it it's overkill now.  Have added a "expect_anon" argument to
> > hugepage_vma_revalidate().  Thanks for the suggestions.
> >
> > > > +                                       unsigned long address,
> > > > +                                       struct vm_area_struct **vmap,
> > > > +                                       struct collapse_control *cc)
> > > > +{
> > > > +       int ret = hugepage_vma_revalidate(mm, address, vmap, cc);
> > > > +
> > > > +       if (ret != SCAN_SUCCEED)
> > > > +               return ret;
> > > >         /*
> > > >          * Anon VMA expected, the address may be unmapped then
> > > >          * remapped to file after khugepaged reaquired the mmap_lock.
> > > > @@ -845,8 +858,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > >          * hugepage_vma_check may return true for qualified file
> > > >          * vmas.
> > > >          */
> > > > -       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > > > -               return SCAN_VMA_CHECK;
> > > > +       if (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap))
> > > > +               return SCAN_PAGE_ANON;
> > > >         return SCAN_SUCCEED;
> > > >  }
> > > >
> > > > @@ -866,8 +879,8 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> > > >         /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> > > >         barrier();
> > > >  #endif
> > > > -       if (!pmd_present(pmde))
> > > > -               return SCAN_PMD_NULL;
> > > > +       if (pmd_none(pmde))
> > > > +               return SCAN_PMD_NONE;
> > > >         if (pmd_trans_huge(pmde))
> > > >                 return SCAN_PMD_MAPPED;
> > > >         if (pmd_bad(pmde))
> > > > @@ -995,7 +1008,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > >                 goto out_nolock;
> > > >
> > > >         mmap_read_lock(mm);
> > > > -       result = hugepage_vma_revalidate(mm, address, &vma, cc);
> > > > +       result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
> > > >         if (result != SCAN_SUCCEED) {
> > > >                 mmap_read_unlock(mm);
> > > >                 goto out_nolock;
> > > > @@ -1026,7 +1039,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > >          * handled by the anon_vma lock + PG_lock.
> > > >          */
> > > >         mmap_write_lock(mm);
> > > > -       result = hugepage_vma_revalidate(mm, address, &vma, cc);
> > > > +       result = hugepage_vma_revalidate_anon(mm, address, &vma, cc);
> > > >         if (result != SCAN_SUCCEED)
> > > >                 goto out_up_write;
> > > >         /* check if the pmd is still valid */
> > > > @@ -1332,13 +1345,44 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> > > >         slot = mm_slot_lookup(mm_slots_hash, mm);
> > > >         mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
> > > >         if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
> > > > +               int i;
> > > > +               /*
> > > > +                * Multiple callers may be adding entries here.  Do a quick
> > > > +                * check to see the entry hasn't already been added by someone
> > > > +                * else.
> > > > +                */
> > > > +               for (i = 0; i < mm_slot->nr_pte_mapped_thp; ++i)
> > > > +                       if (mm_slot->pte_mapped_thp[i] == addr)
> > > > +                               goto out;
> > >
> > > I don't quite get why we need this. I'm supposed just khugepaged could
> > > add the addr to the array and MADV_COLLAPSE just handles pte-mapped
> > > hugepage immediately IIRC, right? If so there is actually no change on
> > > khugepaged side.
> > >
> >
> > So you're right to say that this change isn't needed.  The "multi-add"
> > sequence is:
> >
> > (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
> >     emptying the A's ->pte_mapped_thp[] array.
> > (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
> >     retract_page_tables() finds a VMA in mm_struct A mapping the same extent
> >     (at virtual address X) and adds an entry (for X) into mm_struct A's
> >     ->pte-mapped_thp[] array.
> > (3) khugepaged calls khugepagedge_collapse_scan_file() for mm_struct A at X,
> >     sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry (for X)
> >     into mm_struct A's ->pte-mapped_thp[] array.
> >
> > Which is somewhat contrived/rare - but it can occur.  If we don't have this,
> > the second time we call collapse_pte_mapped_thp() for the same
> > mm_struct/address, we should take the "if (result == SCAN_PMD_MAPPED) {...}"
> > branch early and return before grabbing any other locks (we already have
> > exclusive mmap_lock).  So, perhaps we can drop this check?
> >
> > > >                 mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
> > > >                 ret = true;
> > > >         }
> > > > +out:
> > > >         spin_unlock(&khugepaged_mm_lock);
> > > >         return ret;
> > > >  }
> > > >
> > > > +/* hpage must be locked, and mmap_lock must be held in write */
> > > > +static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > > +                       pmd_t *pmdp, struct page *hpage)
> > > > +{
> > > > +       struct vm_fault vmf = {
> > > > +               .vma = vma,
> > > > +               .address = addr,
> > > > +               .flags = 0,
> > > > +               .pmd = pmdp,
> > > > +       };
> > > > +
> > > > +       VM_BUG_ON(!PageTransHuge(hpage));
> > > > +       mmap_assert_write_locked(vma->vm_mm);
> > > > +
> > > > +       if (do_set_pmd(&vmf, hpage))
> > > > +               return SCAN_FAIL;
> > > > +
> > > > +       get_page(hpage);
> > > > +       return SCAN_SUCCEED;
> > > > +}
> > > > +
> > > >  static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> > > >                                   unsigned long addr, pmd_t *pmdp)
> > > >  {
> > > > @@ -1360,12 +1404,14 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
> > > >   *
> > > >   * @mm: process address space where collapse happens
> > > >   * @addr: THP collapse address
> > > > + * @install_pmd: If a huge PMD should be installed
> > > >   *
> > > >   * This function checks whether all the PTEs in the PMD are pointing to the
> > > >   * right THP. If so, retract the page table so the THP can refault in with
> > > > - * as pmd-mapped.
> > > > + * as pmd-mapped. Possibly install a huge PMD mapping the THP.
> > > >   */
> > > > -void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > > > +int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > > > +                           bool install_pmd)
> > > >  {
> > > >         unsigned long haddr = addr & HPAGE_PMD_MASK;
> > > >         struct vm_area_struct *vma = vma_lookup(mm, haddr);
> > > > @@ -1380,12 +1426,12 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > > >
> > > >         /* Fast check before locking page if already PMD-mapped  */
> > > >         result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> > > > -       if (result != SCAN_SUCCEED)
> > > > -               return;
> > > > +       if (result == SCAN_PMD_MAPPED)
> > > > +               return result;
> > > >
> > > >         if (!vma || !vma->vm_file ||
> > > >             !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> > > > -               return;
> > > > +               return SCAN_VMA_CHECK;
> > > >
> > > >         /*
> > > >          * If we are here, we've succeeded in replacing all the native pages
> > > > @@ -1395,24 +1441,39 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > > >          * analogously elide sysfs THP settings here.
> > > >          */
> > > >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> > > > -               return;
> > > > +               return SCAN_VMA_CHECK;
> > > >
> > > >         /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> > > >         if (userfaultfd_wp(vma))
> > > > -               return;
> > > > +               return SCAN_PTE_UFFD_WP;
> > > >
> > > >         hpage = find_lock_page(vma->vm_file->f_mapping,
> > > >                                linear_page_index(vma, haddr));
> > > >         if (!hpage)
> > > > -               return;
> > > > +               return SCAN_PAGE_NULL;
> > > >
> > > > -       if (!PageHead(hpage))
> > > > +       if (!PageHead(hpage)) {
> > > > +               result = SCAN_FAIL;
> > >
> > > I don't think you could trust this must be a HPAGE_PMD_ORDER hugepage
> > > anymore since the vma might point to a different file, so a different
> > > page cache. And the current kernel does support arbitrary order of
> > > large foios for page cache. [...]
> >
> > Good catch! Yes, I think we need to double check HPAGE_PMD_ORDER here,
> > and that applies equally to khugepaged as well.
>
> BTW, it should be better to have a separate patch to fix this issue as
> a prerequisite of this series.

Yes, good idea. Will send that patch out hopefully by EOD.

Best,
Zach

> >
> > > [...] The below pte traverse may remove rmap for
> > > the wrong page IIUC. Khugepaged should experience the same problem as
> > > well.
> > >
> >
> > Just to confirm, you mean this is only a danger if we don't check the compound
> > order, correct? I.e. if compound_order < HPAGE_PMD_ORDER  we'll iterate over
> > ptes that map something other than our compound page and erroneously adjust rmap
> > for wrong pages.  So, adding a check for compound_order == HPAGE_PMD_ORDER above
> > alleviates this possibility.
> >
> > > >                 goto drop_hpage;
> > > > +       }
> > > >
> > > > -       if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED)
> > > > +       result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> > > > +       switch (result) {
> > > > +       case SCAN_SUCCEED:
> > > > +               break;
> > > > +       case SCAN_PMD_NONE:
> > > > +               /*
> > > > +                * In MADV_COLLAPSE path, possible race with khugepaged where
> > > > +                * all pte entries have been removed and pmd cleared.  If so,
> > > > +                * skip all the pte checks and just update the pmd mapping.
> > > > +                */
> > > > +               goto maybe_install_pmd;
> > > > +       default:
> > > >                 goto drop_hpage;
> > > > +       }
> > > >
> > > >         start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> > > > +       result = SCAN_FAIL;
> > > >
> > > >         /* step 1: check all mapped PTEs are to the right huge page */
> > > >         for (i = 0, addr = haddr, pte = start_pte;
> > > > @@ -1424,8 +1485,10 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > > >                         continue;
> > > >
> > > >                 /* page swapped out, abort */
> > > > -               if (!pte_present(*pte))
> > > > +               if (!pte_present(*pte)) {
> > > > +                       result = SCAN_PTE_NON_PRESENT;
> > > >                         goto abort;
> > > > +               }
> > > >
> > > >                 page = vm_normal_page(vma, addr, *pte);
> > > >                 if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> > > > @@ -1460,12 +1523,19 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> > > >                 add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
> > > >         }
> > > >
> > > > -       /* step 4: collapse pmd */
> > > > +       /* step 4: remove pte entries */
> > >
> > > It also collapses and flushes pmd.
> > >
> >
> > True, will update the comment.
> >
> > Thanks again for your time,
> > Zach
> >
> > > >         collapse_and_free_pmd(mm, vma, haddr, pmd);
> > > > +
> > > > +maybe_install_pmd:
> > > > +       /* step 5: install pmd entry */
> > > > +       result = install_pmd
> > > > +                       ? set_huge_pmd(vma, haddr, pmd, hpage)
> > > > +                       : SCAN_SUCCEED;
> > > > +
> > > >  drop_hpage:
> > > >         unlock_page(hpage);
> > > >         put_page(hpage);
> > > > -       return;
> > > > +       return result;
> > > >
> > > >  abort:
> > > >         pte_unmap_unlock(start_pte, ptl);
> > > > @@ -1488,22 +1558,29 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
> > > >                 goto out;
> > > >
> > > >         for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
> > > > -               collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i]);
> > > > +               collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
> > > >
> > > >  out:
> > > >         mm_slot->nr_pte_mapped_thp = 0;
> > > >         mmap_write_unlock(mm);
> > > >  }
> > > >
> > > > -static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > > > +static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > > +                              struct mm_struct *target_mm,
> > > > +                              unsigned long target_addr, struct page *hpage,
> > > > +                              struct collapse_control *cc)
> > > >  {
> > > >         struct vm_area_struct *vma;
> > > > -       struct mm_struct *mm;
> > > > -       unsigned long addr;
> > > > -       pmd_t *pmd;
> > > > +       int target_result = SCAN_FAIL;
> > > >
> > > >         i_mmap_lock_write(mapping);
> > > >         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > > +               int result = SCAN_FAIL;
> > > > +               struct mm_struct *mm = NULL;
> > > > +               unsigned long addr = 0;
> > > > +               pmd_t *pmd;
> > > > +               bool is_target = false;
> > > > +
> > > >                 /*
> > > >                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > > >                  * got written to. These VMAs are likely not worth investing
> > > > @@ -1520,24 +1597,34 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > > >                  * ptl. It has higher chance to recover THP for the VMA, but
> > > >                  * has higher cost too.
> > > >                  */
> > > > -               if (vma->anon_vma)
> > > > -                       continue;
> > > > +               if (vma->anon_vma) {
> > > > +                       result = SCAN_PAGE_ANON;
> > > > +                       goto next;
> > > > +               }
> > > >                 addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > > > -               if (addr & ~HPAGE_PMD_MASK)
> > > > -                       continue;
> > > > -               if (vma->vm_end < addr + HPAGE_PMD_SIZE)
> > > > -                       continue;
> > > > +               if (addr & ~HPAGE_PMD_MASK ||
> > > > +                   vma->vm_end < addr + HPAGE_PMD_SIZE) {
> > > > +                       result = SCAN_VMA_CHECK;
> > > > +                       goto next;
> > > > +               }
> > > >                 mm = vma->vm_mm;
> > > > -               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> > > > -                       continue;
> > > > +               is_target = mm == target_mm && addr == target_addr;
> > > > +               result = find_pmd_or_thp_or_none(mm, addr, &pmd);
> > > > +               if (result != SCAN_SUCCEED)
> > > > +                       goto next;
> > > >                 /*
> > > >                  * We need exclusive mmap_lock to retract page table.
> > > >                  *
> > > >                  * We use trylock due to lock inversion: we need to acquire
> > > >                  * mmap_lock while holding page lock. Fault path does it in
> > > >                  * reverse order. Trylock is a way to avoid deadlock.
> > > > +                *
> > > > +                * Also, it's not MADV_COLLAPSE's job to collapse other
> > > > +                * mappings - let khugepaged take care of them later.
> > > >                  */
> > > > -               if (mmap_write_trylock(mm)) {
> > > > +               result = SCAN_PTE_MAPPED_HUGEPAGE;
> > > > +               if ((cc->is_khugepaged || is_target) &&
> > > > +                   mmap_write_trylock(mm)) {
> > > >                         /*
> > > >                          * When a vma is registered with uffd-wp, we can't
> > > >                          * recycle the pmd pgtable because there can be pte
> > > > @@ -1546,22 +1633,45 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > > >                          * it'll always mapped in small page size for uffd-wp
> > > >                          * registered ranges.
> > > >                          */
> > > > -                       if (!hpage_collapse_test_exit(mm) &&
> > > > -                           !userfaultfd_wp(vma))
> > > > -                               collapse_and_free_pmd(mm, vma, addr, pmd);
> > > > +                       if (hpage_collapse_test_exit(mm)) {
> > > > +                               result = SCAN_ANY_PROCESS;
> > > > +                               goto unlock_next;
> > > > +                       }
> > > > +                       if (userfaultfd_wp(vma)) {
> > > > +                               result = SCAN_PTE_UFFD_WP;
> > > > +                               goto unlock_next;
> > > > +                       }
> > > > +                       collapse_and_free_pmd(mm, vma, addr, pmd);
> > > > +                       if (!cc->is_khugepaged && is_target)
> > > > +                               result = set_huge_pmd(vma, addr, pmd, hpage);
> > > > +                       else
> > > > +                               result = SCAN_SUCCEED;
> > > > +
> > > > +unlock_next:
> > > >                         mmap_write_unlock(mm);
> > > > -               } else {
> > > > -                       /* Try again later */
> > > > +                       goto next;
> > > > +               }
> > > > +               /*
> > > > +                * Calling context will handle target mm/addr. Otherwise, let
> > > > +                * khugepaged try again later.
> > > > +                */
> > > > +               if (!is_target) {
> > > >                         khugepaged_add_pte_mapped_thp(mm, addr);
> > > > +                       continue;
> > > >                 }
> > > > +next:
> > > > +               if (is_target)
> > > > +                       target_result = result;
> > > >         }
> > > >         i_mmap_unlock_write(mapping);
> > > > +       return target_result;
> > > >  }
> > > >
> > > >  /**
> > > >   * collapse_file - collapse filemap/tmpfs/shmem pages into huge one.
> > > >   *
> > > >   * @mm: process address space where collapse happens
> > > > + * @addr: virtual collapse start address
> > > >   * @file: file that collapse on
> > > >   * @start: collapse start address
> > > >   * @cc: collapse context and scratchpad
> > > > @@ -1581,8 +1691,9 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > > >   *    + restore gaps in the page cache;
> > > >   *    + unlock and free huge page;
> > > >   */
> > > > -static int collapse_file(struct mm_struct *mm, struct file *file,
> > > > -                        pgoff_t start, struct collapse_control *cc)
> > > > +static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > +                        struct file *file, pgoff_t start,
> > > > +                        struct collapse_control *cc)
> > > >  {
> > > >         struct address_space *mapping = file->f_mapping;
> > > >         struct page *hpage;
> > > > @@ -1890,7 +2001,8 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
> > > >                 /*
> > > >                  * Remove pte page tables, so we can re-fault the page as huge.
> > > >                  */
> > > > -               retract_page_tables(mapping, start);
> > > > +               result = retract_page_tables(mapping, start, mm, addr, hpage,
> > > > +                                            cc);
> > > >                 unlock_page(hpage);
> > > >                 hpage = NULL;
> > > >         } else {
> > > > @@ -1946,8 +2058,9 @@ static int collapse_file(struct mm_struct *mm, struct file *file,
> > > >         return result;
> > > >  }
> > > >
> > > > -static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > > > -                               pgoff_t start, struct collapse_control *cc)
> > > > +static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > > > +                                   struct file *file, pgoff_t start,
> > > > +                                   struct collapse_control *cc)
> > > >  {
> > > >         struct page *page = NULL;
> > > >         struct address_space *mapping = file->f_mapping;
> > > > @@ -2035,7 +2148,7 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > > >                         result = SCAN_EXCEED_NONE_PTE;
> > > >                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> > > >                 } else {
> > > > -                       result = collapse_file(mm, file, start, cc);
> > > > +                       result = collapse_file(mm, addr, file, start, cc);
> > > >                 }
> > > >         }
> > > >
> > > > @@ -2043,8 +2156,9 @@ static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > > >         return result;
> > > >  }
> > > >  #else
> > > > -static int khugepaged_scan_file(struct mm_struct *mm, struct file *file,
> > > > -                               pgoff_t start, struct collapse_control *cc)
> > > > +static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > > > +                                   struct file *file, pgoff_t start,
> > > > +                                   struct collapse_control *cc)
> > > >  {
> > > >         BUILD_BUG();
> > > >  }
> > > > @@ -2142,8 +2256,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > > >                                                 khugepaged_scan.address);
> > > >
> > > >                                 mmap_read_unlock(mm);
> > > > -                               *result = khugepaged_scan_file(mm, file, pgoff,
> > > > -                                                              cc);
> > > > +                               *result = hpage_collapse_scan_file(mm,
> > > > +                                                                  khugepaged_scan.address,
> > > > +                                                                  file, pgoff, cc);
> > > >                                 mmap_locked = false;
> > > >                                 fput(file);
> > > >                         } else {
> > > > @@ -2449,10 +2564,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > >
> > > >         *prev = vma;
> > > >
> > > > -       /* TODO: Support file/shmem */
> > > > -       if (!vma->anon_vma || !vma_is_anonymous(vma))
> > > > -               return -EINVAL;
> > > > -
> > > >         if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> > > >                 return -EINVAL;
> > > >
> > > > @@ -2483,16 +2594,35 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > >                 }
> > > >                 mmap_assert_locked(mm);
> > > >                 memset(cc->node_load, 0, sizeof(cc->node_load));
> > > > -               result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
> > > > -                                                cc);
> > > > +               if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
> > > > +                       struct file *file = get_file(vma->vm_file);
> > > > +                       pgoff_t pgoff = linear_page_index(vma, addr);
> > > > +
> > > > +                       mmap_read_unlock(mm);
> > > > +                       mmap_locked = false;
> > > > +                       result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> > > > +                                                         cc);
> > > > +                       fput(file);
> > > > +               } else {
> > > > +                       result = hpage_collapse_scan_pmd(mm, vma, addr,
> > > > +                                                        &mmap_locked, cc);
> > > > +               }
> > > >                 if (!mmap_locked)
> > > >                         *prev = NULL;  /* Tell caller we dropped mmap_lock */
> > > >
> > > > +handle_result:
> > > >                 switch (result) {
> > > >                 case SCAN_SUCCEED:
> > > >                 case SCAN_PMD_MAPPED:
> > > >                         ++thps;
> > > >                         break;
> > > > +               case SCAN_PTE_MAPPED_HUGEPAGE:
> > > > +                       BUG_ON(mmap_locked);
> > > > +                       BUG_ON(*prev);
> > > > +                       mmap_write_lock(mm);
> > > > +                       result = collapse_pte_mapped_thp(mm, addr, true);
> > > > +                       mmap_write_unlock(mm);
> > > > +                       goto handle_result;
> > > >                 /* Whitelisted set of results where continuing OK */
> > > >                 case SCAN_PMD_NULL:
> > > >                 case SCAN_PTE_NON_PRESENT:
> > > > --
> > > > 2.37.2.789.g6183377224-goog
> > > >


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-09-21 18:26 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-07 14:45 [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 01/10] mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() Zach O'Keefe
2022-09-16 17:46   ` Yang Shi
2022-09-16 22:22     ` Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 02/10] mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds Zach O'Keefe
2022-09-16 18:26   ` Yang Shi
2022-09-19 15:36     ` Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 03/10] mm/madvise: add file and shmem support to MADV_COLLAPSE Zach O'Keefe
2022-09-16 20:38   ` Yang Shi
2022-09-19 15:29     ` Zach O'Keefe
2022-09-19 17:54       ` Yang Shi
2022-09-19 18:12       ` Yang Shi
2022-09-21 18:26         ` Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 04/10] mm/khugepaged: add tracepoint to hpage_collapse_scan_file() Zach O'Keefe
2022-09-16 20:41   ` Yang Shi
2022-09-16 23:05     ` Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 05/10] selftests/vm: dedup THP helpers Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 06/10] selftests/vm: modularize thp collapse memory operations Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 07/10] selftests/vm: add thp collapse file and tmpfs testing Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 08/10] selftests/vm: add thp collapse shmem testing Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 09/10] selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd Zach O'Keefe
2022-09-07 14:45 ` [PATCH mm-unstable v3 10/10] selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory Zach O'Keefe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).