[v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent

All of lore.kernel.org
 help / color / mirror / Atom feed

* [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent
@ 2022-03-17 23:48 Yang Shi
  2022-03-17 23:48 ` [v2 PATCH 1/8] sched: coredump.h: clarify the use of MMF_VM_HUGEPAGE Yang Shi
                   ` (10 more replies)
  0 siblings, 11 replies; 20+ messages in thread
From: Yang Shi @ 2022-03-17 23:48 UTC (permalink / raw)
  To: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, tytso, adilger.kernel, darrick.wong
  Cc: shy828301, linux-mm, linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

Changelog
v2: * Collected reviewed-by tags from Miaohe Lin.
    * Fixed build error for patch 4/8.

The readonly FS THP relies on khugepaged to collapse THP for suitable
vmas.  But it is kind of "random luck" for khugepaged to see the
readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when:
  - Anon huge pmd page fault
  - VMA merge
  - MADV_HUGEPAGE
  - Shmem mmap

If the above conditions are not met, even though khugepaged is enabled
it won't see readonly FS vmas at all.  MADV_HUGEPAGE could be specified
explicitly to tell khugepaged to collapse this area, but when khugepaged
mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE
is not set.

So make sure readonly FS vmas are registered to khugepaged to make the
behavior more consistent.

Registering the vmas in mmap path seems more preferred from performance
point of view since page fault path is definitely hot path.

The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches.
The patch 8 converts ext4 and xfs.  We may need convert more filesystems,
but I'd like to hear some comments before doing that.

Tested with khugepaged test in selftests and the testcase provided by
Vlastimil Babka in https://lore.kernel.org/lkml/df3b5d1c-a36b-2c73-3e27-99e74983de3a@suse.cz/
by commenting out MADV_HUGEPAGE call.

 b/fs/ext4/file.c                 |    4 +++
 b/fs/xfs/xfs_file.c              |    4 +++
 b/include/linux/huge_mm.h        |    9 +++++++
 b/include/linux/khugepaged.h     |   69 +++++++++++++++++++++----------------------------------------
 b/include/linux/sched/coredump.h |    3 +-
 b/kernel/fork.c                  |    4 ---
 b/mm/huge_memory.c               |   15 +++----------
 b/mm/khugepaged.c                |   71 ++++++++++++++++++++++++++++++++++++++++++++-------------------
 b/mm/shmem.c                     |   14 +++---------
 9 files changed, 102 insertions(+), 91 deletions(-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [v2 PATCH 1/8] sched: coredump.h: clarify the use of MMF_VM_HUGEPAGE
  2022-03-17 23:48 [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Yang Shi
@ 2022-03-17 23:48 ` Yang Shi
  2022-03-17 23:48 ` [v2 PATCH 2/8] mm: khugepaged: remove redundant check for VM_NO_KHUGEPAGED Yang Shi
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2022-03-17 23:48 UTC (permalink / raw)
  To: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, tytso, adilger.kernel, darrick.wong
  Cc: shy828301, linux-mm, linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

MMF_VM_HUGEPAGE is set as long as the mm is available for khugepaged by
khugepaged_enter(), not only when VM_HUGEPAGE is set on vma.  Correct
the comment to avoid confusion.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/sched/coredump.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 4d9e3a656875..4d0a5be28b70 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -57,7 +57,8 @@ static inline int get_dumpable(struct mm_struct *mm)
 #endif
 					/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
-#define MMF_VM_HUGEPAGE		17	/* set when VM_HUGEPAGE is set on vma */
+#define MMF_VM_HUGEPAGE		17	/* set when mm is available for
+					   khugepaged */
 /*
  * This one-shot flag is dropped due to necessity of changing exe once again
  * on NFS restore
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [v2 PATCH 2/8] mm: khugepaged: remove redundant check for VM_NO_KHUGEPAGED
  2022-03-17 23:48 [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Yang Shi
  2022-03-17 23:48 ` [v2 PATCH 1/8] sched: coredump.h: clarify the use of MMF_VM_HUGEPAGE Yang Shi
@ 2022-03-17 23:48 ` Yang Shi
  2022-03-17 23:48 ` [v2 PATCH 3/8] mm: khugepaged: skip DAX vma Yang Shi
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2022-03-17 23:48 UTC (permalink / raw)
  To: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, tytso, adilger.kernel, darrick.wong
  Cc: shy828301, linux-mm, linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

The hugepage_vma_check() called by khugepaged_enter_vma_merge() does
check VM_NO_KHUGEPAGED. Remove the check from caller and move the check
in hugepage_vma_check() up.

More checks may be run for VM_NO_KHUGEPAGED vmas, but MADV_HUGEPAGE is
definitely not a hot path, so cleaner code does outweigh.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/khugepaged.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 131492fd1148..82c71c6da9ce 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -366,8 +366,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
 		 * register it here without waiting a page fault that
 		 * may not happen any time soon.
 		 */
-		if (!(*vm_flags & VM_NO_KHUGEPAGED) &&
-				khugepaged_enter_vma_merge(vma, *vm_flags))
+		if (khugepaged_enter_vma_merge(vma, *vm_flags))
 			return -ENOMEM;
 		break;
 	case MADV_NOHUGEPAGE:
@@ -446,6 +445,9 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
 	if (!transhuge_vma_enabled(vma, vm_flags))
 		return false;
 
+	if (vm_flags & VM_NO_KHUGEPAGED)
+		return false;
+
 	if (vma->vm_file && !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
 				vma->vm_pgoff, HPAGE_PMD_NR))
 		return false;
@@ -471,7 +473,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
 		return false;
 	if (vma_is_temporary_stack(vma))
 		return false;
-	return !(vm_flags & VM_NO_KHUGEPAGED);
+
+	return true;
 }
 
 int __khugepaged_enter(struct mm_struct *mm)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [v2 PATCH 3/8] mm: khugepaged: skip DAX vma
  2022-03-17 23:48 [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Yang Shi
  2022-03-17 23:48 ` [v2 PATCH 1/8] sched: coredump.h: clarify the use of MMF_VM_HUGEPAGE Yang Shi
  2022-03-17 23:48 ` [v2 PATCH 2/8] mm: khugepaged: remove redundant check for VM_NO_KHUGEPAGED Yang Shi
@ 2022-03-17 23:48 ` Yang Shi
  2022-03-21 12:04   ` Hyeonggon Yoo
  2022-03-17 23:48 ` [v2 PATCH 4/8] mm: thp: only regular file could be THP eligible Yang Shi
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 20+ messages in thread
From: Yang Shi @ 2022-03-17 23:48 UTC (permalink / raw)
  To: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, tytso, adilger.kernel, darrick.wong
  Cc: shy828301, linux-mm, linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

The DAX vma may be seen by khugepaged when the mm has other khugepaged
suitable vmas.  So khugepaged may try to collapse THP for DAX vma, but
it will fail due to page sanity check, for example, page is not
on LRU.

So it is not harmful, but it is definitely pointless to run khugepaged
against DAX vma, so skip it in early check.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/khugepaged.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 82c71c6da9ce..a0e4fa33660e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -448,6 +448,10 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
 	if (vm_flags & VM_NO_KHUGEPAGED)
 		return false;
 
+	/* Don't run khugepaged against DAX vma */
+	if (vma_is_dax(vma))
+		return false;
+
 	if (vma->vm_file && !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
 				vma->vm_pgoff, HPAGE_PMD_NR))
 		return false;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [v2 PATCH 4/8] mm: thp: only regular file could be THP eligible
  2022-03-17 23:48 [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Yang Shi
                   ` (2 preceding siblings ...)
  2022-03-17 23:48 ` [v2 PATCH 3/8] mm: khugepaged: skip DAX vma Yang Shi
@ 2022-03-17 23:48 ` Yang Shi
  2022-03-17 23:48 ` [v2 PATCH 5/8] mm: khugepaged: make khugepaged_enter() void function Yang Shi
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2022-03-17 23:48 UTC (permalink / raw)
  To: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, tytso, adilger.kernel, darrick.wong
  Cc: shy828301, linux-mm, linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

Since commit a4aeaa06d45e ("mm: khugepaged: skip huge page collapse for
special files"), khugepaged just collapses THP for regular file which is
the intended usecase for readonly fs THP.  Only show regular file as THP
eligible accordingly.

And make file_thp_enabled() available for khugepaged too in order to remove
duplicate code.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/huge_mm.h | 14 ++++++++++++++
 mm/huge_memory.c        | 11 ++---------
 mm/khugepaged.c         |  9 ++-------
 3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e4c18ba8d3bf..3cfa79732112 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -172,6 +172,20 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool file_thp_enabled(struct vm_area_struct *vma)
+{
+	struct inode *inode;
+
+	if (!vma->vm_file)
+		return false;
+
+	inode = vma->vm_file->f_inode;
+
+	return (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) &&
+	       (vma->vm_flags & VM_EXEC) &&
+	       !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
+}
+
 bool transparent_hugepage_active(struct vm_area_struct *vma);
 
 #define transparent_hugepage_use_zero_page()				\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..a87b3df63209 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -64,13 +64,6 @@ static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
 
-static inline bool file_thp_enabled(struct vm_area_struct *vma)
-{
-	return transhuge_vma_enabled(vma, vma->vm_flags) && vma->vm_file &&
-	       !inode_is_open_for_write(vma->vm_file->f_inode) &&
-	       (vma->vm_flags & VM_EXEC);
-}
-
 bool transparent_hugepage_active(struct vm_area_struct *vma)
 {
 	/* The addr is used to check if the vma size fits */
@@ -82,8 +75,8 @@ bool transparent_hugepage_active(struct vm_area_struct *vma)
 		return __transparent_hugepage_enabled(vma);
 	if (vma_is_shmem(vma))
 		return shmem_huge_enabled(vma);
-	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
-		return file_thp_enabled(vma);
+	if (transhuge_vma_enabled(vma, vma->vm_flags) && file_thp_enabled(vma))
+		return true;
 
 	return false;
 }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a0e4fa33660e..3dbac3e23f43 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -465,13 +465,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
 		return false;
 
 	/* Only regular file is valid */
-	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file &&
-	    (vm_flags & VM_EXEC)) {
-		struct inode *inode = vma->vm_file->f_inode;
-
-		return !inode_is_open_for_write(inode) &&
-			S_ISREG(inode->i_mode);
-	}
+	if (file_thp_enabled(vma))
+		return true;
 
 	if (!vma->anon_vma || vma->vm_ops)
 		return false;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [v2 PATCH 5/8] mm: khugepaged: make khugepaged_enter() void function
  2022-03-17 23:48 [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Yang Shi
                   ` (3 preceding siblings ...)
  2022-03-17 23:48 ` [v2 PATCH 4/8] mm: thp: only regular file could be THP eligible Yang Shi
@ 2022-03-17 23:48 ` Yang Shi
  2022-03-17 23:48 ` [v2 PATCH 6/8] mm: khugepaged: move some khugepaged_* functions to khugepaged.c Yang Shi
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2022-03-17 23:48 UTC (permalink / raw)
  To: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, tytso, adilger.kernel, darrick.wong
  Cc: shy828301, linux-mm, linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

The most callers of khugepaged_enter() don't care about the return
value.  Only dup_mmap(), anonymous THP page fault and MADV_HUGEPAGE handle
the error by returning -ENOMEM.  Actually it is not harmful for them to
ignore the error case either.  It also sounds overkilling to fail fork()
and page fault early due to khugepaged_enter() error, and MADV_HUGEPAGE
does set VM_HUGEPAGE flag regardless of the error.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/khugepaged.h | 30 ++++++++++++------------------
 kernel/fork.c              |  4 +---
 mm/huge_memory.c           |  4 ++--
 mm/khugepaged.c            | 18 +++++++-----------
 4 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 2fcc01891b47..0423d3619f26 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -12,10 +12,10 @@ extern struct attribute_group khugepaged_attr_group;
 extern int khugepaged_init(void);
 extern void khugepaged_destroy(void);
 extern int start_stop_khugepaged(void);
-extern int __khugepaged_enter(struct mm_struct *mm);
+extern void __khugepaged_enter(struct mm_struct *mm);
 extern void __khugepaged_exit(struct mm_struct *mm);
-extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
-				      unsigned long vm_flags);
+extern void khugepaged_enter_vma_merge(struct vm_area_struct *vma,
+				       unsigned long vm_flags);
 extern void khugepaged_min_free_kbytes_update(void);
 #ifdef CONFIG_SHMEM
 extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
@@ -40,11 +40,10 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
 	(transparent_hugepage_flags &				\
 	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
 
-static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
 	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
-		return __khugepaged_enter(mm);
-	return 0;
+		__khugepaged_enter(mm);
 }
 
 static inline void khugepaged_exit(struct mm_struct *mm)
@@ -53,7 +52,7 @@ static inline void khugepaged_exit(struct mm_struct *mm)
 		__khugepaged_exit(mm);
 }
 
-static inline int khugepaged_enter(struct vm_area_struct *vma,
+static inline void khugepaged_enter(struct vm_area_struct *vma,
 				   unsigned long vm_flags)
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
@@ -62,27 +61,22 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
 		     (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
 		    !(vm_flags & VM_NOHUGEPAGE) &&
 		    !test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-			if (__khugepaged_enter(vma->vm_mm))
-				return -ENOMEM;
-	return 0;
+			__khugepaged_enter(vma->vm_mm);
 }
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
-static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
-	return 0;
 }
 static inline void khugepaged_exit(struct mm_struct *mm)
 {
 }
-static inline int khugepaged_enter(struct vm_area_struct *vma,
-				   unsigned long vm_flags)
+static inline void khugepaged_enter(struct vm_area_struct *vma,
+				    unsigned long vm_flags)
 {
-	return 0;
 }
-static inline int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
-					     unsigned long vm_flags)
+static inline void khugepaged_enter_vma_merge(struct vm_area_struct *vma,
+					      unsigned long vm_flags)
 {
-	return 0;
 }
 static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
 					   unsigned long addr)
diff --git a/kernel/fork.c b/kernel/fork.c
index a024bf6254df..dc85418c426a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -523,9 +523,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	retval = ksm_fork(mm, oldmm);
 	if (retval)
 		goto out;
-	retval = khugepaged_fork(mm, oldmm);
-	if (retval)
-		goto out;
+	khugepaged_fork(mm, oldmm);
 
 	prev = NULL;
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a87b3df63209..ec2490d6af09 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -725,8 +725,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		return VM_FAULT_FALLBACK;
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
-	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
-		return VM_FAULT_OOM;
+	khugepaged_enter(vma, vma->vm_flags);
+
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
 			transparent_hugepage_use_zero_page()) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3dbac3e23f43..b87af297e652 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -366,8 +366,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
 		 * register it here without waiting a page fault that
 		 * may not happen any time soon.
 		 */
-		if (khugepaged_enter_vma_merge(vma, *vm_flags))
-			return -ENOMEM;
+		khugepaged_enter_vma_merge(vma, *vm_flags);
 		break;
 	case MADV_NOHUGEPAGE:
 		*vm_flags &= ~VM_HUGEPAGE;
@@ -476,20 +475,20 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
 	return true;
 }
 
-int __khugepaged_enter(struct mm_struct *mm)
+void __khugepaged_enter(struct mm_struct *mm)
 {
 	struct mm_slot *mm_slot;
 	int wakeup;
 
 	mm_slot = alloc_mm_slot();
 	if (!mm_slot)
-		return -ENOMEM;
+		return;
 
 	/* __khugepaged_exit() must not run from under us */
 	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
 		free_mm_slot(mm_slot);
-		return 0;
+		return;
 	}
 
 	spin_lock(&khugepaged_mm_lock);
@@ -505,11 +504,9 @@ int __khugepaged_enter(struct mm_struct *mm)
 	mmgrab(mm);
 	if (wakeup)
 		wake_up_interruptible(&khugepaged_wait);
-
-	return 0;
 }
 
-int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
+void khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 			       unsigned long vm_flags)
 {
 	unsigned long hstart, hend;
@@ -520,13 +517,12 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 	 * file-private shmem THP is not supported.
 	 */
 	if (!hugepage_vma_check(vma, vm_flags))
-		return 0;
+		return;
 
 	hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
 	hend = vma->vm_end & HPAGE_PMD_MASK;
 	if (hstart < hend)
-		return khugepaged_enter(vma, vm_flags);
-	return 0;
+		khugepaged_enter(vma, vm_flags);
 }
 
 void __khugepaged_exit(struct mm_struct *mm)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [v2 PATCH 6/8] mm: khugepaged: move some khugepaged_* functions to khugepaged.c
  2022-03-17 23:48 [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Yang Shi
                   ` (4 preceding siblings ...)
  2022-03-17 23:48 ` [v2 PATCH 5/8] mm: khugepaged: make khugepaged_enter() void function Yang Shi
@ 2022-03-17 23:48 ` Yang Shi
  2022-03-17 23:48 ` [v2 PATCH 7/8] mm: khugepaged: introduce khugepaged_enter_file() helper Yang Shi
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2022-03-17 23:48 UTC (permalink / raw)
  To: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, tytso, adilger.kernel, darrick.wong
  Cc: shy828301, linux-mm, linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

This move also makes the following patches easier.  The following patches
will call khugepaged_enter() for regular filesystems to make readonly FS
THP collapse more consistent.  They need to use some macros defined in
huge_mm.h, for example, HPAGE_PMD_*, but it seems not preferred to
polute filesystems code with including unnecessary header files.  With
this move the filesystems code just need include khugepaged.h, which is
quite small and the use is quite specific, to call khugepaged_enter()
to hook mm with khugepaged.

And the khugepaged_* functions actually are wrappers for some non-inline
functions, so it seems the benefits are not too much to keep them inline.

This also helps to reuse hugepage_vma_check() for khugepaged_enter() so
that we could remove some duplicate checks.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/khugepaged.h | 33 ++++++---------------------------
 mm/khugepaged.c            | 20 ++++++++++++++++++++
 2 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 0423d3619f26..54e169116d49 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -16,6 +16,12 @@ extern void __khugepaged_enter(struct mm_struct *mm);
 extern void __khugepaged_exit(struct mm_struct *mm);
 extern void khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 				       unsigned long vm_flags);
+extern void khugepaged_fork(struct mm_struct *mm,
+			    struct mm_struct *oldmm);
+extern void khugepaged_exit(struct mm_struct *mm);
+extern void khugepaged_enter(struct vm_area_struct *vma,
+			     unsigned long vm_flags);
+
 extern void khugepaged_min_free_kbytes_update(void);
 #ifdef CONFIG_SHMEM
 extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
@@ -33,36 +39,9 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
 #define khugepaged_always()				\
 	(transparent_hugepage_flags &			\
 	 (1<<TRANSPARENT_HUGEPAGE_FLAG))
-#define khugepaged_req_madv()					\
-	(transparent_hugepage_flags &				\
-	 (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))
 #define khugepaged_defrag()					\
 	(transparent_hugepage_flags &				\
 	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
-
-static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
-{
-	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
-		__khugepaged_enter(mm);
-}
-
-static inline void khugepaged_exit(struct mm_struct *mm)
-{
-	if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
-		__khugepaged_exit(mm);
-}
-
-static inline void khugepaged_enter(struct vm_area_struct *vma,
-				   unsigned long vm_flags)
-{
-	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
-		if ((khugepaged_always() ||
-		     (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) ||
-		     (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
-		    !(vm_flags & VM_NOHUGEPAGE) &&
-		    !test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-			__khugepaged_enter(vma->vm_mm);
-}
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b87af297e652..4cb4379ecf25 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -557,6 +557,26 @@ void __khugepaged_exit(struct mm_struct *mm)
 	}
 }
 
+void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
+		__khugepaged_enter(mm);
+}
+
+void khugepaged_exit(struct mm_struct *mm)
+{
+	if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
+		__khugepaged_exit(mm);
+}
+
+void khugepaged_enter(struct vm_area_struct *vma, unsigned long vm_flags)
+{
+	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
+	    khugepaged_enabled())
+		if (hugepage_vma_check(vma, vm_flags))
+			__khugepaged_enter(vma->vm_mm);
+}
+
 static void release_pte_page(struct page *page)
 {
 	mod_node_page_state(page_pgdat(page),
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [v2 PATCH 7/8] mm: khugepaged: introduce khugepaged_enter_file() helper
  2022-03-17 23:48 [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Yang Shi
                   ` (5 preceding siblings ...)
  2022-03-17 23:48 ` [v2 PATCH 6/8] mm: khugepaged: move some khugepaged_* functions to khugepaged.c Yang Shi
@ 2022-03-17 23:48 ` Yang Shi
  2022-03-17 23:48 ` [v2 PATCH 8/8] fs: register suitable readonly vmas for khugepaged Yang Shi
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2022-03-17 23:48 UTC (permalink / raw)
  To: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, tytso, adilger.kernel, darrick.wong
  Cc: shy828301, linux-mm, linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

The following patch will have filesystems code call khugepaged_enter()
to make readonly FS THP collapse more consistent.  Extract the current
implementation used by shmem in khugepaged_enter_file() helper so that
it could be reused by other filesystems and export the symbol for
modules.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/khugepaged.h |  6 ++++++
 mm/khugepaged.c            | 11 +++++++++++
 mm/shmem.c                 | 14 ++++----------
 3 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 54e169116d49..06464e9a1f91 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -21,6 +21,8 @@ extern void khugepaged_fork(struct mm_struct *mm,
 extern void khugepaged_exit(struct mm_struct *mm);
 extern void khugepaged_enter(struct vm_area_struct *vma,
 			     unsigned long vm_flags);
+extern void khugepaged_enter_file(struct vm_area_struct *vma,
+				  unsigned long vm_flags);
 
 extern void khugepaged_min_free_kbytes_update(void);
 #ifdef CONFIG_SHMEM
@@ -53,6 +55,10 @@ static inline void khugepaged_enter(struct vm_area_struct *vma,
 				    unsigned long vm_flags)
 {
 }
+static inline void khugepaged_enter_file(struct vm_area_struct *vma,
+					 unsigned long vm_flags)
+{
+}
 static inline void khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 					      unsigned long vm_flags)
 {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4cb4379ecf25..93c9072983e2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -577,6 +577,17 @@ void khugepaged_enter(struct vm_area_struct *vma, unsigned long vm_flags)
 			__khugepaged_enter(vma->vm_mm);
 }
 
+void khugepaged_enter_file(struct vm_area_struct *vma, unsigned long vm_flags)
+{
+	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
+	    khugepaged_enabled() &&
+	    (((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) <
+	     (vma->vm_end & HPAGE_PMD_MASK)))
+		if (hugepage_vma_check(vma, vm_flags))
+			__khugepaged_enter(vma->vm_mm);
+}
+EXPORT_SYMBOL_GPL(khugepaged_enter_file);
+
 static void release_pte_page(struct page *page)
 {
 	mod_node_page_state(page_pgdat(page),
diff --git a/mm/shmem.c b/mm/shmem.c
index a09b29ec2b45..c2346e5d2b24 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2233,11 +2233,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-			((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) <
-			(vma->vm_end & HPAGE_PMD_MASK)) {
-		khugepaged_enter(vma, vma->vm_flags);
-	}
+
+	khugepaged_enter_file(vma, vma->vm_flags);
+
 	return 0;
 }
 
@@ -4132,11 +4130,7 @@ int shmem_zero_setup(struct vm_area_struct *vma)
 	vma->vm_file = file;
 	vma->vm_ops = &shmem_vm_ops;
 
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-			((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) <
-			(vma->vm_end & HPAGE_PMD_MASK)) {
-		khugepaged_enter(vma, vma->vm_flags);
-	}
+	khugepaged_enter_file(vma, vma->vm_flags);
 
 	return 0;
 }
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [v2 PATCH 8/8] fs: register suitable readonly vmas for khugepaged
  2022-03-17 23:48 [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Yang Shi
                   ` (6 preceding siblings ...)
  2022-03-17 23:48 ` [v2 PATCH 7/8] mm: khugepaged: introduce khugepaged_enter_file() helper Yang Shi
@ 2022-03-17 23:48 ` Yang Shi
  2022-03-18  1:10 ` [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Song Liu
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2022-03-17 23:48 UTC (permalink / raw)
  To: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, tytso, adilger.kernel, darrick.wong
  Cc: shy828301, linux-mm, linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

The readonly FS THP relies on khugepaged to collapse THP for suitable
vmas.  But it is kind of "random luck" for khugepaged to see the
readonly FS vmas (https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/)
since currently the vmas are registered to khugepaged when:
  - Anon huge pmd page fault
  - VMA merge
  - MADV_HUGEPAGE
  - Shmem mmap

If the above conditions are not met, even though khugepaged is enabled
it won't see readonly FS vmas at all.  MADV_HUGEPAGE could be specified
explicitly to tell khugepaged to collapse this area, but when khugepaged
mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE
is not set.

So make sure readonly FS vmas are registered to khugepaged to make the
behavior more consistent.

Registering the vmas in mmap path seems more preferred from performance
point of view since page fault path is definitely hot path.

Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 fs/ext4/file.c    | 4 ++++
 fs/xfs/xfs_file.c | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8cc11715518a..b894cd5aff44 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -30,6 +30,7 @@
 #include <linux/uio.h>
 #include <linux/mman.h>
 #include <linux/backing-dev.h>
+#include <linux/khugepaged.h>
 #include "ext4.h"
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -782,6 +783,9 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
+
+	khugepaged_enter_file(vma, vma->vm_flags);
+
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5bddb1e9e0b3..d94144b1fb0f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -30,6 +30,7 @@
 #include <linux/mman.h>
 #include <linux/fadvise.h>
 #include <linux/mount.h>
+#include <linux/khugepaged.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
 
@@ -1407,6 +1408,9 @@ xfs_file_mmap(
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(inode))
 		vma->vm_flags |= VM_HUGEPAGE;
+
+	khugepaged_enter_file(vma, vma->vm_flags);
+
 	return 0;
 }
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent
  2022-03-17 23:48 [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Yang Shi
                   ` (7 preceding siblings ...)
  2022-03-17 23:48 ` [v2 PATCH 8/8] fs: register suitable readonly vmas for khugepaged Yang Shi
@ 2022-03-18  1:10 ` Song Liu
  2022-03-18  1:29 ` Dave Chinner
  2022-03-24  1:47 ` Theodore Ts'o
  10 siblings, 0 replies; 20+ messages in thread
From: Song Liu @ 2022-03-18  1:10 UTC (permalink / raw)
  To: Yang Shi
  Cc: Vlastimil Babka, Kirill A. Shutemov, linmiaohe, Rik van Riel,
	Matthew Wilcox, Zi Yan, Andrew Morton, tytso, adilger.kernel,
	darrick.wong, Linux Memory Management List, Linux-Fsdevel,
	linux-ext4, linux-xfs, linux-kernel



> On Mar 17, 2022, at 4:48 PM, Yang Shi <shy828301@gmail.com> wrote:
> 
> 
> Changelog
> v2: * Collected reviewed-by tags from Miaohe Lin.
>    * Fixed build error for patch 4/8.
> 
> The readonly FS THP relies on khugepaged to collapse THP for suitable
> vmas.  But it is kind of "random luck" for khugepaged to see the
> readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when:
>  - Anon huge pmd page fault
>  - VMA merge
>  - MADV_HUGEPAGE
>  - Shmem mmap
> 
> If the above conditions are not met, even though khugepaged is enabled
> it won't see readonly FS vmas at all.  MADV_HUGEPAGE could be specified
> explicitly to tell khugepaged to collapse this area, but when khugepaged
> mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE
> is not set.
> 
> So make sure readonly FS vmas are registered to khugepaged to make the
> behavior more consistent.
> 
> Registering the vmas in mmap path seems more preferred from performance
> point of view since page fault path is definitely hot path.
> 
> 
> The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches.
> The patch 8 converts ext4 and xfs.  We may need convert more filesystems,
> but I'd like to hear some comments before doing that.
> 
> 
> Tested with khugepaged test in selftests and the testcase provided by
> Vlastimil Babka in https://lore.kernel.org/lkml/df3b5d1c-a36b-2c73-3e27-99e74983de3a@suse.cz/
> by commenting out MADV_HUGEPAGE call.

LGTM. For the series:

Acked-by: Song Liu <song@kernel.org>

> 
> 
> b/fs/ext4/file.c                 |    4 +++
> b/fs/xfs/xfs_file.c              |    4 +++
> b/include/linux/huge_mm.h        |    9 +++++++
> b/include/linux/khugepaged.h     |   69 +++++++++++++++++++++----------------------------------------
> b/include/linux/sched/coredump.h |    3 +-
> b/kernel/fork.c                  |    4 ---
> b/mm/huge_memory.c               |   15 +++----------
> b/mm/khugepaged.c                |   71 ++++++++++++++++++++++++++++++++++++++++++++-------------------
> b/mm/shmem.c                     |   14 +++---------
> 9 files changed, 102 insertions(+), 91 deletions(-)
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent
  2022-03-17 23:48 [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Yang Shi
                   ` (8 preceding siblings ...)
  2022-03-18  1:10 ` [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Song Liu
@ 2022-03-18  1:29 ` Dave Chinner
  2022-03-18  3:38   ` Matthew Wilcox
  2022-03-18 17:31   ` Yang Shi
  2022-03-24  1:47 ` Theodore Ts'o
  10 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2022-03-18  1:29 UTC (permalink / raw)
  To: Yang Shi
  Cc: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, tytso, adilger.kernel, darrick.wong, linux-mm,
	linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

On Thu, Mar 17, 2022 at 04:48:19PM -0700, Yang Shi wrote:
> 
> Changelog
> v2: * Collected reviewed-by tags from Miaohe Lin.
>     * Fixed build error for patch 4/8.
> 
> The readonly FS THP relies on khugepaged to collapse THP for suitable
> vmas.  But it is kind of "random luck" for khugepaged to see the
> readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when:
>   - Anon huge pmd page fault
>   - VMA merge
>   - MADV_HUGEPAGE
>   - Shmem mmap
> 
> If the above conditions are not met, even though khugepaged is enabled
> it won't see readonly FS vmas at all.  MADV_HUGEPAGE could be specified
> explicitly to tell khugepaged to collapse this area, but when khugepaged
> mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE
> is not set.
> 
> So make sure readonly FS vmas are registered to khugepaged to make the
> behavior more consistent.
> 
> Registering the vmas in mmap path seems more preferred from performance
> point of view since page fault path is definitely hot path.
> 
> 
> The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches.
> The patch 8 converts ext4 and xfs.  We may need convert more filesystems,
> but I'd like to hear some comments before doing that.

After reading through the patchset, I have no idea what this is even
doing or enabling. I can't comment on the last patch and it's effect
on XFS because there's no high level explanation of the
functionality or feature to provide me with the context in which I
should be reviewing this patchset.

I understand this has something to do with hugepages, but there's no
explaination of exactly where huge pages are going to be used in the
filesystem, what the problems with khugepaged and filesystems are
that this apparently solves, what constraints it places on
filesystems to enable huge pages to be used, etc.

I'm guessing that the result is that we'll suddenly see huge pages
in the page cache for some undefined set of files in some undefined
set of workloads. But that doesn't help me understand any of the
impacts it may have. e.g:

- how does this relate to the folio conversion and use of large
  pages in the page cache?
- why do we want two completely separate large page mechanisms in
  the page cache?
- why is this limited to "read only VMAs" and how does the
  filesystem actually ensure that the VMAs are read only?
- what happens if we have a file that huge pages mapped into the
  page cache via read only VMAs then has write() called on it via a
  different file descriptor and so we need to dirty the page cache
  that has huge pages in it?

I've got a lot more questions, but to save me having to ask them,
how about you explain what this new functionality actually does, why
we need to support it, and why it is better than the fully writeable
huge page support via folios that we already have in the works...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent
  2022-03-18  1:29 ` Dave Chinner
@ 2022-03-18  3:38   ` Matthew Wilcox
  2022-03-18 18:04     ` Yang Shi
  2022-03-18 17:31   ` Yang Shi
  1 sibling, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2022-03-18  3:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Yang Shi, vbabka, kirill.shutemov, linmiaohe, songliubraving,
	riel, ziy, akpm, tytso, adilger.kernel, darrick.wong, linux-mm,
	linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

On Fri, Mar 18, 2022 at 12:29:48PM +1100, Dave Chinner wrote:
> On Thu, Mar 17, 2022 at 04:48:19PM -0700, Yang Shi wrote:
> > 
> > Changelog
> > v2: * Collected reviewed-by tags from Miaohe Lin.
> >     * Fixed build error for patch 4/8.
> > 
> > The readonly FS THP relies on khugepaged to collapse THP for suitable
> > vmas.  But it is kind of "random luck" for khugepaged to see the
> > readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when:
> >   - Anon huge pmd page fault
> >   - VMA merge
> >   - MADV_HUGEPAGE
> >   - Shmem mmap
> > 
> > If the above conditions are not met, even though khugepaged is enabled
> > it won't see readonly FS vmas at all.  MADV_HUGEPAGE could be specified
> > explicitly to tell khugepaged to collapse this area, but when khugepaged
> > mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE
> > is not set.
> > 
> > So make sure readonly FS vmas are registered to khugepaged to make the
> > behavior more consistent.
> > 
> > Registering the vmas in mmap path seems more preferred from performance
> > point of view since page fault path is definitely hot path.
> > 
> > 
> > The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches.
> > The patch 8 converts ext4 and xfs.  We may need convert more filesystems,
> > but I'd like to hear some comments before doing that.
> 
> After reading through the patchset, I have no idea what this is even
> doing or enabling. I can't comment on the last patch and it's effect
> on XFS because there's no high level explanation of the
> functionality or feature to provide me with the context in which I
> should be reviewing this patchset.
> 
> I understand this has something to do with hugepages, but there's no
> explaination of exactly where huge pages are going to be used in the
> filesystem, what the problems with khugepaged and filesystems are
> that this apparently solves, what constraints it places on
> filesystems to enable huge pages to be used, etc.
> 
> I'm guessing that the result is that we'll suddenly see huge pages
> in the page cache for some undefined set of files in some undefined
> set of workloads. But that doesn't help me understand any of the
> impacts it may have. e.g:
> 
> - how does this relate to the folio conversion and use of large
>   pages in the page cache?
> - why do we want two completely separate large page mechanisms in
>   the page cache?
> - why is this limited to "read only VMAs" and how does the
>   filesystem actually ensure that the VMAs are read only?
> - what happens if we have a file that huge pages mapped into the
>   page cache via read only VMAs then has write() called on it via a
>   different file descriptor and so we need to dirty the page cache
>   that has huge pages in it?
> 
> I've got a lot more questions, but to save me having to ask them,
> how about you explain what this new functionality actually does, why
> we need to support it, and why it is better than the fully writeable
> huge page support via folios that we already have in the works...

Back in Puerto Rico when we set up the THP Cabal, we had two competing
approaches for using larger pages in the page cache; mine (which turned
into folios after I realised that THPs were the wrong model) and Song
Liu's CONFIG_READ_ONLY_THP_FOR_FS.  Song's patches were ready earlier
(2019) and were helpful in unveiling some of the problems which needed
to be fixed.  The filesystem never sees the large pages because they're
only used for read-only files, and the pages are already Uptodate at
the point they're collapsed into a THP.  So there's no changes needed
to the filesystem.

This collection of patches I'm agnostic about.  As far as I can
tell, they're a way to improve how often the ROTHP feature gets used.
That doesn't really interest me since we're so close to having proper
support for large pages/folios in filesystems.  So I'm not particularly
interested in improving a feature that we're about to delete.  But I also
don't like it that the filesystem now has to do something; the ROTHP
feature is supposed to be completely transparent from the point of view
of the filesystem.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent
  2022-03-18  1:29 ` Dave Chinner
  2022-03-18  3:38   ` Matthew Wilcox
@ 2022-03-18 17:31   ` Yang Shi
  1 sibling, 0 replies; 20+ messages in thread
From: Yang Shi @ 2022-03-18 17:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, tytso, adilger.kernel, darrick.wong, linux-mm,
	linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

On Thu, Mar 17, 2022 at 6:29 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Mar 17, 2022 at 04:48:19PM -0700, Yang Shi wrote:
> >
> > Changelog
> > v2: * Collected reviewed-by tags from Miaohe Lin.
> >     * Fixed build error for patch 4/8.
> >
> > The readonly FS THP relies on khugepaged to collapse THP for suitable
> > vmas.  But it is kind of "random luck" for khugepaged to see the
> > readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when:
> >   - Anon huge pmd page fault
> >   - VMA merge
> >   - MADV_HUGEPAGE
> >   - Shmem mmap
> >
> > If the above conditions are not met, even though khugepaged is enabled
> > it won't see readonly FS vmas at all.  MADV_HUGEPAGE could be specified
> > explicitly to tell khugepaged to collapse this area, but when khugepaged
> > mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE
> > is not set.
> >
> > So make sure readonly FS vmas are registered to khugepaged to make the
> > behavior more consistent.
> >
> > Registering the vmas in mmap path seems more preferred from performance
> > point of view since page fault path is definitely hot path.
> >
> >
> > The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches.
> > The patch 8 converts ext4 and xfs.  We may need convert more filesystems,
> > but I'd like to hear some comments before doing that.
>
> After reading through the patchset, I have no idea what this is even
> doing or enabling. I can't comment on the last patch and it's effect
> on XFS because there's no high level explanation of the
> functionality or feature to provide me with the context in which I
> should be reviewing this patchset.
>
> I understand this has something to do with hugepages, but there's no
> explaination of exactly where huge pages are going to be used in the
> filesystem, what the problems with khugepaged and filesystems are
> that this apparently solves, what constraints it places on
> filesystems to enable huge pages to be used, etc.
>
> I'm guessing that the result is that we'll suddenly see huge pages
> in the page cache for some undefined set of files in some undefined
> set of workloads. But that doesn't help me understand any of the
> impacts it may have. e.g:

Song introduced READ_ONLY_THP_FOR_FS back in 2019. It collapses huge
pages for readonly executable file mappings to speed up the programs
with huge text section. The huge page is allocated/collapsed by
khugepaged instead of in page fault path.

Vlastimil reported the huge pages are not collapsed consistently since
the suitable MMs were not registered by khugepaged consistently as the
commit log elaborated. So this patchset makes the behavior of
khugepaged (for collapsing readonly file THP) more consistent.

>
> - how does this relate to the folio conversion and use of large
>   pages in the page cache?
> - why do we want two completely separate large page mechanisms in
>   the page cache?

It has nothing to do with the folio conversion. But once the file
THP/huge page is fully supported, the READ_ONLY_THP_FOR_FS may be
deprecated. However, making khugepaged collapse file THP more
consistently is applicable to full file huge page support as well as
long as we still use khugepaged to collapse THP.

> - why is this limited to "read only VMAs" and how does the
>   filesystem actually ensure that the VMAs are read only?

It uses the below check:

(IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) &&
               (vma->vm_flags & VM_EXEC) &&
               !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);

This condition was introduced by READ_ONLY_THP_FOR_FS in the first
place, not this patchset.

> - what happens if we have a file that huge pages mapped into the
>   page cache via read only VMAs then has write() called on it via a
>   different file descriptor and so we need to dirty the page cache
>   that has huge pages in it?

Once someone else opens the fd in write mode, the THP will be
truncated and khugepaged will backoff IIUC.

>
> I've got a lot more questions, but to save me having to ask them,
> how about you explain what this new functionality actually does, why
> we need to support it, and why it is better than the fully writeable
> huge page support via folios that we already have in the works...
>
> Cheers,
>
> Dave.
>
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent
  2022-03-18  3:38   ` Matthew Wilcox
@ 2022-03-18 18:04     ` Yang Shi
  2022-03-18 18:48       ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Yang Shi @ 2022-03-18 18:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, vbabka, kirill.shutemov, linmiaohe, songliubraving,
	riel, ziy, akpm, tytso, adilger.kernel, darrick.wong, linux-mm,
	linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

On Thu, Mar 17, 2022 at 8:38 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Mar 18, 2022 at 12:29:48PM +1100, Dave Chinner wrote:
> > On Thu, Mar 17, 2022 at 04:48:19PM -0700, Yang Shi wrote:
> > >
> > > Changelog
> > > v2: * Collected reviewed-by tags from Miaohe Lin.
> > >     * Fixed build error for patch 4/8.
> > >
> > > The readonly FS THP relies on khugepaged to collapse THP for suitable
> > > vmas.  But it is kind of "random luck" for khugepaged to see the
> > > readonly FS vmas (see report: https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/) since currently the vmas are registered to khugepaged when:
> > >   - Anon huge pmd page fault
> > >   - VMA merge
> > >   - MADV_HUGEPAGE
> > >   - Shmem mmap
> > >
> > > If the above conditions are not met, even though khugepaged is enabled
> > > it won't see readonly FS vmas at all.  MADV_HUGEPAGE could be specified
> > > explicitly to tell khugepaged to collapse this area, but when khugepaged
> > > mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE
> > > is not set.
> > >
> > > So make sure readonly FS vmas are registered to khugepaged to make the
> > > behavior more consistent.
> > >
> > > Registering the vmas in mmap path seems more preferred from performance
> > > point of view since page fault path is definitely hot path.
> > >
> > >
> > > The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches.
> > > The patch 8 converts ext4 and xfs.  We may need convert more filesystems,
> > > but I'd like to hear some comments before doing that.
> >
> > After reading through the patchset, I have no idea what this is even
> > doing or enabling. I can't comment on the last patch and it's effect
> > on XFS because there's no high level explanation of the
> > functionality or feature to provide me with the context in which I
> > should be reviewing this patchset.
> >
> > I understand this has something to do with hugepages, but there's no
> > explaination of exactly where huge pages are going to be used in the
> > filesystem, what the problems with khugepaged and filesystems are
> > that this apparently solves, what constraints it places on
> > filesystems to enable huge pages to be used, etc.
> >
> > I'm guessing that the result is that we'll suddenly see huge pages
> > in the page cache for some undefined set of files in some undefined
> > set of workloads. But that doesn't help me understand any of the
> > impacts it may have. e.g:
> >
> > - how does this relate to the folio conversion and use of large
> >   pages in the page cache?
> > - why do we want two completely separate large page mechanisms in
> >   the page cache?
> > - why is this limited to "read only VMAs" and how does the
> >   filesystem actually ensure that the VMAs are read only?
> > - what happens if we have a file that huge pages mapped into the
> >   page cache via read only VMAs then has write() called on it via a
> >   different file descriptor and so we need to dirty the page cache
> >   that has huge pages in it?
> >
> > I've got a lot more questions, but to save me having to ask them,
> > how about you explain what this new functionality actually does, why
> > we need to support it, and why it is better than the fully writeable
> > huge page support via folios that we already have in the works...
>
> Back in Puerto Rico when we set up the THP Cabal, we had two competing
> approaches for using larger pages in the page cache; mine (which turned
> into folios after I realised that THPs were the wrong model) and Song
> Liu's CONFIG_READ_ONLY_THP_FOR_FS.  Song's patches were ready earlier
> (2019) and were helpful in unveiling some of the problems which needed
> to be fixed.  The filesystem never sees the large pages because they're
> only used for read-only files, and the pages are already Uptodate at
> the point they're collapsed into a THP.  So there's no changes needed
> to the filesystem.
>
> This collection of patches I'm agnostic about.  As far as I can
> tell, they're a way to improve how often the ROTHP feature gets used.
> That doesn't really interest me since we're so close to having proper
> support for large pages/folios in filesystems.  So I'm not particularly
> interested in improving a feature that we're about to delete.  But I also
> don't like it that the filesystem now has to do something; the ROTHP
> feature is supposed to be completely transparent from the point of view
> of the filesystem.

I agree once page cache huge page is fully supported,
READ_ONLY_THP_FOR_FS could be deprecated. But actually this patchset
makes khugepaged collapse file THP more consistently. It guarantees
the THP could be collapsed as long as file THP is supported and
configured properly and there is suitable file vmas, it is not
guaranteed by the current code. So it should be useful even though
READ_ONLY_THP_FOR_FS is gone IMHO.

>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent
  2022-03-18 18:04     ` Yang Shi
@ 2022-03-18 18:48       ` Matthew Wilcox
  2022-03-18 20:19         ` Yang Shi
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2022-03-18 18:48 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dave Chinner, vbabka, kirill.shutemov, linmiaohe, songliubraving,
	riel, ziy, akpm, tytso, adilger.kernel, darrick.wong, linux-mm,
	linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

On Fri, Mar 18, 2022 at 11:04:29AM -0700, Yang Shi wrote:
> I agree once page cache huge page is fully supported,
> READ_ONLY_THP_FOR_FS could be deprecated. But actually this patchset
> makes khugepaged collapse file THP more consistently. It guarantees
> the THP could be collapsed as long as file THP is supported and
> configured properly and there is suitable file vmas, it is not
> guaranteed by the current code. So it should be useful even though
> READ_ONLY_THP_FOR_FS is gone IMHO.

I don't know if it's a good thing or not.  Experiments with 64k
PAGE_SIZE on arm64 shows some benchmarks improving and others regressing.
Just because we _can_ collapse a 2MB range of pages into a single 2MB
page doesn't mean we _should_.  I suspect the right size folio for any
given file will depend on the access pattern.  For example, dirtying a
few bytes in a folio will result in the entire folio being written back.
Is that what you want?  Maybe!  It may prompt the filesystem to defragment
that range, which would be good.  On the other hand, if you're bandwidth
limited, it may decrease your performance.  And if your media has limited
write endurance, it may result in your drive wearing out more quickly.

Changing the heuristics should come with data.  Preferably from a wide
range of systems and use cases.  I know that's hard to do, but how else
can we proceed?

And I think you ignored my point that READ_ONLY_THP_FOR_FS required
no changes to filesystems.  It was completely invisible to them, by
design.  Now this patchset requires each filesystem to do something.
That's not a great step.

P.S. khugepaged currently does nothing if a range contains a compound
page.  It assumes that the page is compound because it's now a THP.
Large folios break that assumption, so khugepaged will now never
collapse a range which includes large folios.  Thanks to commit
    mm/filemap: Support VM_HUGEPAGE for file mappings
we'll always try to bring in PMD-sized pages for MADV_HUGEPAGE, so
it _probably_ doesn't matter.  But it's something we should watch
for as filesystems grow support for large folios.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent
  2022-03-18 18:48       ` Matthew Wilcox
@ 2022-03-18 20:19         ` Yang Shi
  0 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2022-03-18 20:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, vbabka, kirill.shutemov, linmiaohe, songliubraving,
	riel, ziy, akpm, tytso, adilger.kernel, darrick.wong, linux-mm,
	linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

On Fri, Mar 18, 2022 at 11:48 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Mar 18, 2022 at 11:04:29AM -0700, Yang Shi wrote:
> > I agree once page cache huge page is fully supported,
> > READ_ONLY_THP_FOR_FS could be deprecated. But actually this patchset
> > makes khugepaged collapse file THP more consistently. It guarantees
> > the THP could be collapsed as long as file THP is supported and
> > configured properly and there is suitable file vmas, it is not
> > guaranteed by the current code. So it should be useful even though
> > READ_ONLY_THP_FOR_FS is gone IMHO.
>
> I don't know if it's a good thing or not.  Experiments with 64k
> PAGE_SIZE on arm64 shows some benchmarks improving and others regressing.
> Just because we _can_ collapse a 2MB range of pages into a single 2MB
> page doesn't mean we _should_.  I suspect the right size folio for any
> given file will depend on the access pattern.  For example, dirtying a
> few bytes in a folio will result in the entire folio being written back.
> Is that what you want?  Maybe!  It may prompt the filesystem to defragment
> that range, which would be good.  On the other hand, if you're bandwidth
> limited, it may decrease your performance.  And if your media has limited
> write endurance, it may result in your drive wearing out more quickly.
>
> Changing the heuristics should come with data.  Preferably from a wide
> range of systems and use cases.  I know that's hard to do, but how else
> can we proceed?

TBH I don't think it belongs to "change the heuristics". Its users'
decision if their workloads could benefit from huge pages or not. They
could set THP to always/madivse/never per their workloads. The
patchset is aimed to fix the misbehavior. The user visible issue is
even though users enable READ_ONLY_THP_FOR_FS and configure THP to
"always" (khugepaged always runs) and do expect their huge text
section is backed by THP but THP may not be collapsed.

>
> And I think you ignored my point that READ_ONLY_THP_FOR_FS required
> no changes to filesystems.  It was completely invisible to them, by
> design.  Now this patchset requires each filesystem to do something.
> That's not a great step.

I don't mean to ignore your point. I do understand it is not perfect.
I was thinking about making it FS agnostic in the first place. But I
didn't think of a perfect way to do it at that time, so I followed
what tmpfs does.

However, by rethinking this we may be able to call
khugepaged_enter_file() in filemap_fault(). I was concerned about the
overhead in the page fault path. But it may be neglectable since
khugepaged_enter_file() does bail out in the first place if the mm is
already registered in khugepaged, just the first page fault needs to
go through all the check, but the first page fault is typically a
major fault so the overhead should be not noticeable comparing to the
overhead of I/O. Calling khugepaged_enter() in page fault path is the
approach used by anonymous THP too.

>
> P.S. khugepaged currently does nothing if a range contains a compound
> page.  It assumes that the page is compound because it's now a THP.
> Large folios break that assumption, so khugepaged will now never
> collapse a range which includes large folios.  Thanks to commit
>     mm/filemap: Support VM_HUGEPAGE for file mappings
> we'll always try to bring in PMD-sized pages for MADV_HUGEPAGE, so
> it _probably_ doesn't matter.  But it's something we should watch
> for as filesystems grow support for large folios.

Yeah, I agree, thanks for reminding this. In addition I think the
users of READ_ONLY_THP_FOR_FS should also expect the PMD-sized THP to
be collapsed for their usecase with full page cache THP support since
their benefits come from reduced TLB miss.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [v2 PATCH 3/8] mm: khugepaged: skip DAX vma
  2022-03-17 23:48 ` [v2 PATCH 3/8] mm: khugepaged: skip DAX vma Yang Shi
@ 2022-03-21 12:04   ` Hyeonggon Yoo
  2022-03-21 20:59     ` Yang Shi
  0 siblings, 1 reply; 20+ messages in thread
From: Hyeonggon Yoo @ 2022-03-21 12:04 UTC (permalink / raw)
  To: Yang Shi
  Cc: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, tytso, adilger.kernel, darrick.wong, linux-mm,
	linux-fsdevel, linux-ext4, linux-xfs, linux-kernel

On Thu, Mar 17, 2022 at 04:48:22PM -0700, Yang Shi wrote:
> The DAX vma may be seen by khugepaged when the mm has other khugepaged
> suitable vmas.  So khugepaged may try to collapse THP for DAX vma, but
> it will fail due to page sanity check, for example, page is not
> on LRU.
> 
> So it is not harmful, but it is definitely pointless to run khugepaged
> against DAX vma, so skip it in early check.

in fs/xfs/xfs_file.c:
1391 STATIC int
1392 xfs_file_mmap(
1393         struct file             *file,
1394         struct vm_area_struct   *vma)
1395 {
1396         struct inode            *inode = file_inode(file);
1397         struct xfs_buftarg      *target = xfs_inode_buftarg(XFS_I(inode));
1398 
1399         /*
1400          * We don't support synchronous mappings for non-DAX files and
1401          * for DAX files if underneath dax_device is not synchronous.
1402          */
1403         if (!daxdev_mapping_supported(vma, target->bt_daxdev))
1404                 return -EOPNOTSUPP;
1405 
1406         file_accessed(file);
1407         vma->vm_ops = &xfs_file_vm_ops;
1408         if (IS_DAX(inode))
1409                 vma->vm_flags |= VM_HUGEPAGE;

Are xfs and other filesystems setting VM_HUGEPAGE flag even if it can
never be collapsed?

1410         return 0;
1411 }


> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/khugepaged.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 82c71c6da9ce..a0e4fa33660e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -448,6 +448,10 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
>  	if (vm_flags & VM_NO_KHUGEPAGED)
>  		return false;
>  
> +	/* Don't run khugepaged against DAX vma */
> +	if (vma_is_dax(vma))
> +		return false;
> +
>  	if (vma->vm_file && !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
>  				vma->vm_pgoff, HPAGE_PMD_NR))
>  		return false;
> -- 
> 2.26.3
> 
> 

-- 
Thank you, You are awesome!
Hyeonggon :-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [v2 PATCH 3/8] mm: khugepaged: skip DAX vma
  2022-03-21 12:04   ` Hyeonggon Yoo
@ 2022-03-21 20:59     ` Yang Shi
  0 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2022-03-21 20:59 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Vlastimil Babka, Kirill A. Shutemov, Miaohe Lin, Song Liu,
	Rik van Riel, Matthew Wilcox, Zi Yan, Andrew Morton,
	Theodore Ts'o, Andreas Dilger, darrick.wong, Linux MM,
	Linux FS-devel Mailing List, linux-ext4, linux-xfs,
	Linux Kernel Mailing List

On Mon, Mar 21, 2022 at 5:04 AM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote:
>
> On Thu, Mar 17, 2022 at 04:48:22PM -0700, Yang Shi wrote:
> > The DAX vma may be seen by khugepaged when the mm has other khugepaged
> > suitable vmas.  So khugepaged may try to collapse THP for DAX vma, but
> > it will fail due to page sanity check, for example, page is not
> > on LRU.
> >
> > So it is not harmful, but it is definitely pointless to run khugepaged
> > against DAX vma, so skip it in early check.
>
> in fs/xfs/xfs_file.c:
> 1391 STATIC int
> 1392 xfs_file_mmap(
> 1393         struct file             *file,
> 1394         struct vm_area_struct   *vma)
> 1395 {
> 1396         struct inode            *inode = file_inode(file);
> 1397         struct xfs_buftarg      *target = xfs_inode_buftarg(XFS_I(inode));
> 1398
> 1399         /*
> 1400          * We don't support synchronous mappings for non-DAX files and
> 1401          * for DAX files if underneath dax_device is not synchronous.
> 1402          */
> 1403         if (!daxdev_mapping_supported(vma, target->bt_daxdev))
> 1404                 return -EOPNOTSUPP;
> 1405
> 1406         file_accessed(file);
> 1407         vma->vm_ops = &xfs_file_vm_ops;
> 1408         if (IS_DAX(inode))
> 1409                 vma->vm_flags |= VM_HUGEPAGE;
>
> Are xfs and other filesystems setting VM_HUGEPAGE flag even if it can
> never be collapsed?

DAX is available or intended to use on some special devices, for
example, persistent memory. Collapsing huge pages on such devices is
not the intended usecase of khugepaged.

>
> 1410         return 0;
> 1411 }
>
>
> > Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/khugepaged.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 82c71c6da9ce..a0e4fa33660e 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -448,6 +448,10 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
> >       if (vm_flags & VM_NO_KHUGEPAGED)
> >               return false;
> >
> > +     /* Don't run khugepaged against DAX vma */
> > +     if (vma_is_dax(vma))
> > +             return false;
> > +
> >       if (vma->vm_file && !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
> >                               vma->vm_pgoff, HPAGE_PMD_NR))
> >               return false;
> > --
> > 2.26.3
> >
> >
>
> --
> Thank you, You are awesome!
> Hyeonggon :-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent
  2022-03-17 23:48 [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Yang Shi
                   ` (9 preceding siblings ...)
  2022-03-18  1:29 ` Dave Chinner
@ 2022-03-24  1:47 ` Theodore Ts'o
  2022-03-24  2:46   ` Yang Shi
  10 siblings, 1 reply; 20+ messages in thread
From: Theodore Ts'o @ 2022-03-24  1:47 UTC (permalink / raw)
  To: Yang Shi
  Cc: vbabka, kirill.shutemov, linmiaohe, songliubraving, riel, willy,
	ziy, akpm, adilger.kernel, darrick.wong, linux-mm, linux-fsdevel,
	linux-ext4, linux-xfs, linux-kernel

On Thu, Mar 17, 2022 at 04:48:19PM -0700, Yang Shi wrote:
> 
> The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches.
> The patch 8 converts ext4 and xfs.  We may need convert more filesystems,
> but I'd like to hear some comments before doing that.

Adding a hard-coded call to khugepage_enter_file() in ext4 and xfs,
and potentially, each file system, seems kludgy as all heck.  Is there
any reason not to simply call it in the mm code which calls f_op->mmap()?

    	       	  	      	 - Ted

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent
  2022-03-24  1:47 ` Theodore Ts'o
@ 2022-03-24  2:46   ` Yang Shi
  0 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2022-03-24  2:46 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Vlastimil Babka, Kirill A. Shutemov, Miaohe Lin, Song Liu,
	Rik van Riel, Matthew Wilcox, Zi Yan, Andrew Morton,
	Andreas Dilger, darrick.wong, Linux MM,
	Linux FS-devel Mailing List, linux-ext4, linux-xfs,
	Linux Kernel Mailing List

On Wed, Mar 23, 2022 at 6:48 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Thu, Mar 17, 2022 at 04:48:19PM -0700, Yang Shi wrote:
> >
> > The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches.
> > The patch 8 converts ext4 and xfs.  We may need convert more filesystems,
> > but I'd like to hear some comments before doing that.
>
> Adding a hard-coded call to khugepage_enter_file() in ext4 and xfs,
> and potentially, each file system, seems kludgy as all heck.  Is there
> any reason not to simply call it in the mm code which calls f_op->mmap()?

Thanks, Ted. Very good point. I just didn't think of it. I think it is
doable. We may be able to clean up the code further.

>
>                                  - Ted

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-03-24  2:47 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-17 23:48 [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Yang Shi
2022-03-17 23:48 ` [v2 PATCH 1/8] sched: coredump.h: clarify the use of MMF_VM_HUGEPAGE Yang Shi
2022-03-17 23:48 ` [v2 PATCH 2/8] mm: khugepaged: remove redundant check for VM_NO_KHUGEPAGED Yang Shi
2022-03-17 23:48 ` [v2 PATCH 3/8] mm: khugepaged: skip DAX vma Yang Shi
2022-03-21 12:04   ` Hyeonggon Yoo
2022-03-21 20:59     ` Yang Shi
2022-03-17 23:48 ` [v2 PATCH 4/8] mm: thp: only regular file could be THP eligible Yang Shi
2022-03-17 23:48 ` [v2 PATCH 5/8] mm: khugepaged: make khugepaged_enter() void function Yang Shi
2022-03-17 23:48 ` [v2 PATCH 6/8] mm: khugepaged: move some khugepaged_* functions to khugepaged.c Yang Shi
2022-03-17 23:48 ` [v2 PATCH 7/8] mm: khugepaged: introduce khugepaged_enter_file() helper Yang Shi
2022-03-17 23:48 ` [v2 PATCH 8/8] fs: register suitable readonly vmas for khugepaged Yang Shi
2022-03-18  1:10 ` [v2 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent Song Liu
2022-03-18  1:29 ` Dave Chinner
2022-03-18  3:38   ` Matthew Wilcox
2022-03-18 18:04     ` Yang Shi
2022-03-18 18:48       ` Matthew Wilcox
2022-03-18 20:19         ` Yang Shi
2022-03-18 17:31   ` Yang Shi
2022-03-24  1:47 ` Theodore Ts'o
2022-03-24  2:46   ` Yang Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.