All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v12 0/6] THP aware uprobe
@ 2019-08-07 23:37 Song Liu
  2019-08-07 23:37 ` [PATCH v12 1/6] mm: move memcmp_pages() and pages_identical() Song Liu
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Song Liu @ 2019-08-07 23:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: matthew.wilcox, kirill.shutemov, oleg, kernel-team,
	william.kucharski, srikar, Song Liu

This set makes uprobe aware of THPs.

Currently, when uprobe is attached to text on THP, the page is split by
FOLL_SPLIT. As a result, uprobe eliminates the performance benefit of THP.

This set makes uprobe THP-aware. Instead of FOLL_SPLIT, we introduces
FOLL_SPLIT_PMD, which only split PMD for uprobe.

After all uprobes within the THP are removed, the PTE-mapped pages are
regrouped as huge PMD.

This set (plus a few THP patches) is also available at

   https://github.com/liu-song-6/linux/tree/uprobe-thp


Changes v11.4 => v12
1. Combine the first 4 patches with the rest 2 patches again in the same
   set.
2. Improve checks for the page in collapse_pte_mapped_thp() (Oleg).
3. Fixed build error w/o CONFIG_SHMEM.

v11.1 to v11.4 are only the last two patches.

Changes v11.3 => v11.4:
1. Simplify locking for pte_mapped_thp (Oleg).
2. Improve checks for the page in collapse_pte_mapped_thp() (Oleg).
3. Move HPAGE_PMD_MASK to collapse_pte_mapped_thp() (kbuild test robot).

Changes v11.2 => v11.3:
1. Update vma/pmd check in collapse_pte_mapped_thp() (Oleg).
2. Add Acked-by from Kirill

Changes v11.1 => v11.2:
1. Call collapse_pte_mapped_thp() directly from uprobe_write_opcode();
2. Add VM_BUG_ON() for addr alignment in khugepaged_add_pte_mapped_thp()
   and collapse_pte_mapped_thp().

Changes v9 => v10:
1. 2/4 incorporate suggestion by Oleg Nesterov.
2. Reword change log of 4/4.

Changes v8 => v9:
1. To replace with orig_page, only unmap old_page. Let the orig_page fault
   in (Oleg Nesterov).

Changes v7 => v8:
1. check PageUptodate() for orig_page (Oleg Nesterov).

Changes v6 => v7:
1. Include Acked-by from Kirill A. Shutemov for the first 4 patches;
2. Keep only the first 4 patches (while I working on improving the last 2).

Changes v5 => v6:
1. Enable khugepaged to collapse pmd for pte-mapped THP
   (Kirill A. Shutemov).
2. uprobe asks khuagepaged to collaspe pmd. (Kirill A. Shutemov)

Note: Theast two patches in v6 the set apply _after_ v7 of set "Enable THP
      for text section of non-shmem files"

Changes v4 => v5:
1. Propagate pte_alloc() error out of follow_pmd_mask().

Changes since v3:
1. Simplify FOLL_SPLIT_PMD case in follow_pmd_mask(), (Kirill A. Shutemov)
2. Fix try_collapse_huge_pmd() to match change in follow_pmd_mask().

Changes since v2:
1. For FOLL_SPLIT_PMD, populated the page table in follow_pmd_mask().
2. Simplify logic in uprobe_write_opcode. (Oleg Nesterov)
3. Fix page refcount handling with FOLL_SPLIT_PMD.
4. Much more testing, together with THP on ext4 and btrfs (sending in
   separate set).
5. Rebased.

Changes since v1:
1. introduces FOLL_SPLIT_PMD, instead of modifying split_huge_pmd*();
2. reuse pages_identical() from ksm.c;
3. rewrite most of try_collapse_huge_pmd().

Song Liu (6):
  mm: move memcmp_pages() and pages_identical()
  uprobe: use original page when all uprobes are removed
  mm, thp: introduce FOLL_SPLIT_PMD
  uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT
  khugepaged: enable collapse pmd for pte-mapped THP
  uprobe: collapse THP pmd after removing all uprobes

 include/linux/khugepaged.h |  12 ++++
 include/linux/mm.h         |   8 +++
 kernel/events/uprobes.c    |  81 ++++++++++++++++-----
 mm/gup.c                   |   8 ++-
 mm/khugepaged.c            | 140 ++++++++++++++++++++++++++++++++++++-
 mm/ksm.c                   |  18 -----
 mm/util.c                  |  13 ++++
 7 files changed, 240 insertions(+), 40 deletions(-)

--
2.17.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v12 1/6] mm: move memcmp_pages() and pages_identical()
  2019-08-07 23:37 [PATCH v12 0/6] THP aware uprobe Song Liu
@ 2019-08-07 23:37 ` Song Liu
  2019-08-07 23:37 ` [PATCH v12 2/6] uprobe: use original page when all uprobes are removed Song Liu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Song Liu @ 2019-08-07 23:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: matthew.wilcox, kirill.shutemov, oleg, kernel-team,
	william.kucharski, srikar, Song Liu

This patch moves memcmp_pages() to mm/util.c and pages_identical() to
mm.h, so that we can use them in other files.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/mm.h |  7 +++++++
 mm/ksm.c           | 18 ------------------
 mm/util.c          | 13 +++++++++++++
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0334ca97c584..f189176dabed 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2891,5 +2891,12 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+extern int memcmp_pages(struct page *page1, struct page *page2);
+
+static inline int pages_identical(struct page *page1, struct page *page2)
+{
+	return !memcmp_pages(page1, page2);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/ksm.c b/mm/ksm.c
index 3dc4346411e4..dbee2eb4dd05 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1029,24 +1029,6 @@ static u32 calc_checksum(struct page *page)
 	return checksum;
 }
 
-static int memcmp_pages(struct page *page1, struct page *page2)
-{
-	char *addr1, *addr2;
-	int ret;
-
-	addr1 = kmap_atomic(page1);
-	addr2 = kmap_atomic(page2);
-	ret = memcmp(addr1, addr2, PAGE_SIZE);
-	kunmap_atomic(addr2);
-	kunmap_atomic(addr1);
-	return ret;
-}
-
-static inline int pages_identical(struct page *page1, struct page *page2)
-{
-	return !memcmp_pages(page1, page2);
-}
-
 static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 			      pte_t *orig_pte)
 {
diff --git a/mm/util.c b/mm/util.c
index e6351a80f248..0d5e2f425612 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -783,3 +783,16 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen)
 out:
 	return res;
 }
+
+int memcmp_pages(struct page *page1, struct page *page2)
+{
+	char *addr1, *addr2;
+	int ret;
+
+	addr1 = kmap_atomic(page1);
+	addr2 = kmap_atomic(page2);
+	ret = memcmp(addr1, addr2, PAGE_SIZE);
+	kunmap_atomic(addr2);
+	kunmap_atomic(addr1);
+	return ret;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v12 2/6] uprobe: use original page when all uprobes are removed
  2019-08-07 23:37 [PATCH v12 0/6] THP aware uprobe Song Liu
  2019-08-07 23:37 ` [PATCH v12 1/6] mm: move memcmp_pages() and pages_identical() Song Liu
@ 2019-08-07 23:37 ` Song Liu
  2019-08-07 23:37 ` [PATCH v12 3/6] mm, thp: introduce FOLL_SPLIT_PMD Song Liu
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Song Liu @ 2019-08-07 23:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: matthew.wilcox, kirill.shutemov, oleg, kernel-team,
	william.kucharski, srikar, Song Liu

Currently, uprobe swaps the target page with a anonymous page in both
install_breakpoint() and remove_breakpoint(). When all uprobes on a page
are removed, the given mm is still using an anonymous page (not the
original page).

This patch allows uprobe to use original page when possible (all uprobes
on the page are already removed, and the original page is in page cache
and uptodate).

As suggested by Oleg, we unmap the old_page and let the original page
fault in.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/events/uprobes.c | 66 +++++++++++++++++++++++++++++++----------
 1 file changed, 51 insertions(+), 15 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 84fa00497c49..648f47553bff 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -143,10 +143,12 @@ static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
  *
  * @vma:      vma that holds the pte pointing to page
  * @addr:     address the old @page is mapped at
- * @page:     the cowed page we are replacing by kpage
- * @kpage:    the modified page we replace page by
+ * @old_page: the page we are replacing by new_page
+ * @new_page: the modified page we replace page by
  *
- * Returns 0 on success, -EFAULT on failure.
+ * If @new_page is NULL, only unmap @old_page.
+ *
+ * Returns 0 on success, negative error code otherwise.
  */
 static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 				struct page *old_page, struct page *new_page)
@@ -166,10 +168,12 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	VM_BUG_ON_PAGE(PageTransHuge(old_page), old_page);
 
-	err = mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL, &memcg,
-			false);
-	if (err)
-		return err;
+	if (new_page) {
+		err = mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,
+					    &memcg, false);
+		if (err)
+			return err;
+	}
 
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(old_page);
@@ -177,15 +181,20 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	mmu_notifier_invalidate_range_start(&range);
 	err = -EAGAIN;
 	if (!page_vma_mapped_walk(&pvmw)) {
-		mem_cgroup_cancel_charge(new_page, memcg, false);
+		if (new_page)
+			mem_cgroup_cancel_charge(new_page, memcg, false);
 		goto unlock;
 	}
 	VM_BUG_ON_PAGE(addr != pvmw.address, old_page);
 
-	get_page(new_page);
-	page_add_new_anon_rmap(new_page, vma, addr, false);
-	mem_cgroup_commit_charge(new_page, memcg, false, false);
-	lru_cache_add_active_or_unevictable(new_page, vma);
+	if (new_page) {
+		get_page(new_page);
+		page_add_new_anon_rmap(new_page, vma, addr, false);
+		mem_cgroup_commit_charge(new_page, memcg, false, false);
+		lru_cache_add_active_or_unevictable(new_page, vma);
+	} else
+		/* no new page, just dec_mm_counter for old_page */
+		dec_mm_counter(mm, MM_ANONPAGES);
 
 	if (!PageAnon(old_page)) {
 		dec_mm_counter(mm, mm_counter_file(old_page));
@@ -194,8 +203,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*pvmw.pte));
 	ptep_clear_flush_notify(vma, addr, pvmw.pte);
-	set_pte_at_notify(mm, addr, pvmw.pte,
-			mk_pte(new_page, vma->vm_page_prot));
+	if (new_page)
+		set_pte_at_notify(mm, addr, pvmw.pte,
+				  mk_pte(new_page, vma->vm_page_prot));
 
 	page_remove_rmap(old_page, false);
 	if (!page_mapped(old_page))
@@ -488,6 +498,10 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 		ref_ctr_updated = 1;
 	}
 
+	ret = 0;
+	if (!is_register && !PageAnon(old_page))
+		goto put_old;
+
 	ret = anon_vma_prepare(vma);
 	if (ret)
 		goto put_old;
@@ -501,8 +515,30 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	copy_highpage(new_page, old_page);
 	copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
 
+	if (!is_register) {
+		struct page *orig_page;
+		pgoff_t index;
+
+		VM_BUG_ON_PAGE(!PageAnon(old_page), old_page);
+
+		index = vaddr_to_offset(vma, vaddr & PAGE_MASK) >> PAGE_SHIFT;
+		orig_page = find_get_page(vma->vm_file->f_inode->i_mapping,
+					  index);
+
+		if (orig_page) {
+			if (PageUptodate(orig_page) &&
+			    pages_identical(new_page, orig_page)) {
+				/* let go new_page */
+				put_page(new_page);
+				new_page = NULL;
+			}
+			put_page(orig_page);
+		}
+	}
+
 	ret = __replace_page(vma, vaddr, old_page, new_page);
-	put_page(new_page);
+	if (new_page)
+		put_page(new_page);
 put_old:
 	put_page(old_page);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v12 3/6] mm, thp: introduce FOLL_SPLIT_PMD
  2019-08-07 23:37 [PATCH v12 0/6] THP aware uprobe Song Liu
  2019-08-07 23:37 ` [PATCH v12 1/6] mm: move memcmp_pages() and pages_identical() Song Liu
  2019-08-07 23:37 ` [PATCH v12 2/6] uprobe: use original page when all uprobes are removed Song Liu
@ 2019-08-07 23:37 ` Song Liu
  2019-08-08 16:37   ` Oleg Nesterov
  2019-08-07 23:37 ` [PATCH v12 4/6] uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT Song Liu
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Song Liu @ 2019-08-07 23:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: matthew.wilcox, kirill.shutemov, oleg, kernel-team,
	william.kucharski, srikar, Song Liu

This patches introduces a new foll_flag: FOLL_SPLIT_PMD. As the name says
FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
page stays as-is.

FOLL_SPLIT_PMD is useful for cases where we need to use regular pages,
but would switch back to huge page and huge pmd on. One of such example
is uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.

Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/mm.h | 1 +
 mm/gup.c           | 8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f189176dabed..74db879711eb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2614,6 +2614,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_COW	0x4000	/* internal GUP flag */
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
+#define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 
 /*
  * NOTE on FOLL_LONGTERM:
diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37bac..c20afe800b3f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -399,7 +399,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-	if (flags & FOLL_SPLIT) {
+	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
 		int ret;
 		page = pmd_page(*pmd);
 		if (is_huge_zero_page(page)) {
@@ -408,7 +408,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			split_huge_pmd(vma, pmd, address);
 			if (pmd_trans_unstable(pmd))
 				ret = -EBUSY;
-		} else {
+		} else if (flags & FOLL_SPLIT) {
 			if (unlikely(!try_get_page(page))) {
 				spin_unlock(ptl);
 				return ERR_PTR(-ENOMEM);
@@ -420,6 +420,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			put_page(page);
 			if (pmd_none(*pmd))
 				return no_page_table(vma, flags);
+		} else {  /* flags & FOLL_SPLIT_PMD */
+			spin_unlock(ptl);
+			split_huge_pmd(vma, pmd, address);
+			ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
 		}
 
 		return ret ? ERR_PTR(ret) :
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v12 4/6] uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT
  2019-08-07 23:37 [PATCH v12 0/6] THP aware uprobe Song Liu
                   ` (2 preceding siblings ...)
  2019-08-07 23:37 ` [PATCH v12 3/6] mm, thp: introduce FOLL_SPLIT_PMD Song Liu
@ 2019-08-07 23:37 ` Song Liu
  2019-08-07 23:37 ` [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP Song Liu
  2019-08-07 23:37 ` [PATCH v12 6/6] uprobe: collapse THP pmd after removing all uprobes Song Liu
  5 siblings, 0 replies; 30+ messages in thread
From: Song Liu @ 2019-08-07 23:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: matthew.wilcox, kirill.shutemov, oleg, kernel-team,
	william.kucharski, srikar, Song Liu

This patch uses newly added FOLL_SPLIT_PMD in uprobe. This preserves the
huge page when the uprobe is enabled. When the uprobe is disabled, newer
instances of the same application could still benefit from huge page.

For the next step, we will enable khugepaged to regroup the pmd, so that
existing instances of the application could also benefit from huge page
after the uprobe is disabled.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/events/uprobes.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 648f47553bff..27b596f14463 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -155,7 +155,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct page_vma_mapped_walk pvmw = {
-		.page = old_page,
+		.page = compound_head(old_page),
 		.vma = vma,
 		.address = addr,
 	};
@@ -166,8 +166,6 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr,
 				addr + PAGE_SIZE);
 
-	VM_BUG_ON_PAGE(PageTransHuge(old_page), old_page);
-
 	if (new_page) {
 		err = mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,
 					    &memcg, false);
@@ -481,7 +479,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 retry:
 	/* Read the page with vaddr into memory */
 	ret = get_user_pages_remote(NULL, mm, vaddr, 1,
-			FOLL_FORCE | FOLL_SPLIT, &old_page, &vma, NULL);
+			FOLL_FORCE | FOLL_SPLIT_PMD, &old_page, &vma, NULL);
 	if (ret <= 0)
 		return ret;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-07 23:37 [PATCH v12 0/6] THP aware uprobe Song Liu
                   ` (3 preceding siblings ...)
  2019-08-07 23:37 ` [PATCH v12 4/6] uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT Song Liu
@ 2019-08-07 23:37 ` Song Liu
  2019-08-08 16:33   ` Oleg Nesterov
  2019-08-07 23:37 ` [PATCH v12 6/6] uprobe: collapse THP pmd after removing all uprobes Song Liu
  5 siblings, 1 reply; 30+ messages in thread
From: Song Liu @ 2019-08-07 23:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: matthew.wilcox, kirill.shutemov, oleg, kernel-team,
	william.kucharski, srikar, Song Liu

khugepaged needs exclusive mmap_sem to access page table. When it fails
to lock mmap_sem, the page will fault in as pte-mapped THP. As the page
is already a THP, khugepaged will not handle this pmd again.

This patch enables the khugepaged to retry collapse the page table.

struct mm_slot (in khugepaged.c) is extended with an array, containing
addresses of pte-mapped THPs. We use array here for simplicity. We can
easily replace it with more advanced data structures when needed.

In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try
to collapse the page table.

Since collapse may happen at an later time, some pages may already fault
in. collapse_pte_mapped_thp() is added to properly handle these pages.
collapse_pte_mapped_thp() also double checks whether all ptes in this pmd
are mapping to the same THP. This is necessary because some subpage of
the THP may be replaced, for example by uprobe. In such cases, it is not
possible to collapse the pmd.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/khugepaged.h |  12 ++++
 mm/khugepaged.c            | 140 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 151 insertions(+), 1 deletion(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 082d1d2a5216..bc45ea1efbf7 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -15,6 +15,14 @@ extern int __khugepaged_enter(struct mm_struct *mm);
 extern void __khugepaged_exit(struct mm_struct *mm);
 extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 				      unsigned long vm_flags);
+#ifdef CONFIG_SHMEM
+extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
+#else
+static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
+					   unsigned long addr)
+{
+}
+#endif
 
 #define khugepaged_enabled()					       \
 	(transparent_hugepage_flags &				       \
@@ -73,6 +81,10 @@ static inline int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 {
 	return 0;
 }
+static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
+					   unsigned long addr)
+{
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_KHUGEPAGED_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 40c25ddf29e4..208ea1ce204a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -77,6 +77,8 @@ static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __read_mostly;
 
+#define MAX_PTE_MAPPED_THP 8
+
 /**
  * struct mm_slot - hash lookup from mm to mm_slot
  * @hash: hash collision list
@@ -87,6 +89,10 @@ struct mm_slot {
 	struct hlist_node hash;
 	struct list_head mm_node;
 	struct mm_struct *mm;
+
+	/* pte-mapped THP in this mm */
+	int nr_pte_mapped_thp;
+	unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
 };
 
 /**
@@ -1254,6 +1260,131 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
 }
 
 #if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
+/*
+ * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
+ * khugepaged should try to collapse the page table.
+ */
+static int khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
+					 unsigned long addr)
+{
+	struct mm_slot *mm_slot;
+
+	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = get_mm_slot(mm);
+	if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP))
+		mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
+	spin_unlock(&khugepaged_mm_lock);
+	return 0;
+}
+
+/**
+ * Try to collapse a pte-mapped THP for mm at address haddr.
+ *
+ * This function checks whether all the PTEs in the PMD are pointing to the
+ * right THP. If so, retract the page table so the THP can refault in with
+ * as pmd-mapped.
+ */
+void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
+{
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	struct vm_area_struct *vma = find_vma(mm, haddr);
+	struct page *hpage = NULL;
+	pmd_t *pmd, _pmd;
+	spinlock_t *ptl;
+	int count = 0;
+	int i;
+
+	if (!vma || !vma->vm_file ||
+	    vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE)
+		return;
+
+	/*
+	 * This vm_flags may not have VM_HUGEPAGE if the page was not
+	 * collapsed by this mm. But we can still collapse if the page is
+	 * the valid THP. Add extra VM_HUGEPAGE so hugepage_vma_check()
+	 * will not fail the vma for missing VM_HUGEPAGE
+	 */
+	if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
+		return;
+
+	pmd = mm_find_pmd(mm, haddr);
+	if (!pmd)
+		return;
+
+	/* step 1: check all mapped PTEs are to the right huge page */
+	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+		pte_t *pte = pte_offset_map(pmd, addr);
+		struct page *page;
+
+		if (pte_none(*pte))
+			continue;
+
+		page = vm_normal_page(vma, addr, *pte);
+
+		if (!page || !PageCompound(page))
+			return;
+
+		if (!hpage) {
+			hpage = compound_head(page);
+			if (hpage->mapping != vma->vm_file->f_mapping)
+				return;
+		}
+
+		if (hpage + i != page)
+			return;
+		count++;
+	}
+
+	/* step 2: adjust rmap */
+	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+		pte_t *pte = pte_offset_map(pmd, addr);
+		struct page *page;
+
+		if (pte_none(*pte))
+			continue;
+		page = vm_normal_page(vma, addr, *pte);
+		page_remove_rmap(page, false);
+	}
+
+	/* step 3: set proper refcount and mm_counters. */
+	if (hpage) {
+		page_ref_sub(hpage, count);
+		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
+	}
+
+	/* step 4: collapse pmd */
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	_pmd = pmdp_collapse_flush(vma, addr, pmd);
+	spin_unlock(ptl);
+	mm_dec_nr_ptes(mm);
+	pte_free(mm, pmd_pgtable(_pmd));
+}
+
+static int khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
+{
+	struct mm_struct *mm = mm_slot->mm;
+	int i;
+
+	if (likely(mm_slot->nr_pte_mapped_thp == 0))
+		return 0;
+
+	if (!down_write_trylock(&mm->mmap_sem))
+		return -EBUSY;
+
+	if (unlikely(khugepaged_test_exit(mm)))
+		goto out;
+
+	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
+		collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i]);
+
+out:
+	mm_slot->nr_pte_mapped_thp = 0;
+	up_write(&mm->mmap_sem);
+	return 0;
+}
+
 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
@@ -1287,7 +1418,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			up_write(&vma->vm_mm->mmap_sem);
 			mm_dec_nr_ptes(vma->vm_mm);
 			pte_free(vma->vm_mm, pmd_pgtable(_pmd));
-		}
+		} else
+			khugepaged_add_pte_mapped_thp(vma->vm_mm, addr);
 	}
 	i_mmap_unlock_write(mapping);
 }
@@ -1709,6 +1841,11 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 {
 	BUILD_BUG();
 }
+
+static int khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
+{
+	return 0;
+}
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
@@ -1733,6 +1870,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 		khugepaged_scan.mm_slot = mm_slot;
 	}
 	spin_unlock(&khugepaged_mm_lock);
+	khugepaged_collapse_pte_mapped_thps(mm_slot);
 
 	mm = mm_slot->mm;
 	/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v12 6/6] uprobe: collapse THP pmd after removing all uprobes
  2019-08-07 23:37 [PATCH v12 0/6] THP aware uprobe Song Liu
                   ` (4 preceding siblings ...)
  2019-08-07 23:37 ` [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP Song Liu
@ 2019-08-07 23:37 ` Song Liu
  5 siblings, 0 replies; 30+ messages in thread
From: Song Liu @ 2019-08-07 23:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm
  Cc: matthew.wilcox, kirill.shutemov, oleg, kernel-team,
	william.kucharski, srikar, Song Liu

After all uprobes are removed from the huge page (with PTE pgtable), it
is possible to collapse the pmd and benefit from THP again. This patch
does the collapse by calling collapse_pte_mapped_thp().

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/events/uprobes.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 27b596f14463..94d38a39d72e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -26,6 +26,7 @@
 #include <linux/percpu-rwsem.h>
 #include <linux/task_work.h>
 #include <linux/shmem_fs.h>
+#include <linux/khugepaged.h>
 
 #include <linux/uprobes.h>
 
@@ -472,6 +473,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	struct page *old_page, *new_page;
 	struct vm_area_struct *vma;
 	int ret, is_register, ref_ctr_updated = 0;
+	bool orig_page_huge = false;
 
 	is_register = is_swbp_insn(&opcode);
 	uprobe = container_of(auprobe, struct uprobe, arch);
@@ -529,6 +531,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 				/* let go new_page */
 				put_page(new_page);
 				new_page = NULL;
+
+				if (PageCompound(orig_page))
+					orig_page_huge = true;
 			}
 			put_page(orig_page);
 		}
@@ -547,6 +552,10 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (ret && is_register && ref_ctr_updated)
 		update_ref_ctr(uprobe, mm, -1);
 
+	/* try collapse pmd for compound page */
+	if (!ret && orig_page_huge)
+		collapse_pte_mapped_thp(mm, vaddr);
+
 	return ret;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-07 23:37 ` [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP Song Liu
@ 2019-08-08 16:33   ` Oleg Nesterov
  2019-08-08 17:05     ` Song Liu
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2019-08-08 16:33 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, linux-mm, akpm, matthew.wilcox, kirill.shutemov,
	kernel-team, william.kucharski, srikar

On 08/07, Song Liu wrote:
>
> +void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> +{
> +	unsigned long haddr = addr & HPAGE_PMD_MASK;
> +	struct vm_area_struct *vma = find_vma(mm, haddr);
> +	struct page *hpage = NULL;
> +	pmd_t *pmd, _pmd;
> +	spinlock_t *ptl;
> +	int count = 0;
> +	int i;
> +
> +	if (!vma || !vma->vm_file ||
> +	    vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE)
> +		return;
> +
> +	/*
> +	 * This vm_flags may not have VM_HUGEPAGE if the page was not
> +	 * collapsed by this mm. But we can still collapse if the page is
> +	 * the valid THP. Add extra VM_HUGEPAGE so hugepage_vma_check()
> +	 * will not fail the vma for missing VM_HUGEPAGE
> +	 */
> +	if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
> +		return;
> +
> +	pmd = mm_find_pmd(mm, haddr);

OK, I do not see anything really wrong...

a couple of questions below.

> +	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
> +		pte_t *pte = pte_offset_map(pmd, addr);
> +		struct page *page;
> +
> +		if (pte_none(*pte))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, *pte);
> +
> +		if (!page || !PageCompound(page))
> +			return;
> +
> +		if (!hpage) {
> +			hpage = compound_head(page);

OK,

> +			if (hpage->mapping != vma->vm_file->f_mapping)
> +				return;

is it really possible? May be WARN_ON(hpage->mapping != vm_file->f_mapping)
makes more sense ?

> +		if (hpage + i != page)
> +			return;

ditto.

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 3/6] mm, thp: introduce FOLL_SPLIT_PMD
  2019-08-07 23:37 ` [PATCH v12 3/6] mm, thp: introduce FOLL_SPLIT_PMD Song Liu
@ 2019-08-08 16:37   ` Oleg Nesterov
  2019-08-08 17:16     ` Song Liu
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2019-08-08 16:37 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, linux-mm, akpm, matthew.wilcox, kirill.shutemov,
	kernel-team, william.kucharski, srikar

On 08/07, Song Liu wrote:
>
> @@ -399,7 +399,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  		spin_unlock(ptl);
>  		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>  	}
> -	if (flags & FOLL_SPLIT) {
> +	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
>  		int ret;
>  		page = pmd_page(*pmd);
>  		if (is_huge_zero_page(page)) {
> @@ -408,7 +408,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  			split_huge_pmd(vma, pmd, address);
>  			if (pmd_trans_unstable(pmd))
>  				ret = -EBUSY;
> -		} else {
> +		} else if (flags & FOLL_SPLIT) {
>  			if (unlikely(!try_get_page(page))) {
>  				spin_unlock(ptl);
>  				return ERR_PTR(-ENOMEM);
> @@ -420,6 +420,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  			put_page(page);
>  			if (pmd_none(*pmd))
>  				return no_page_table(vma, flags);
> +		} else {  /* flags & FOLL_SPLIT_PMD */
> +			spin_unlock(ptl);
> +			split_huge_pmd(vma, pmd, address);
> +			ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
>  		}

Can't resist, let me repeat that I do not like this patch because imo
it complicates this code for no reason.

But I can't insist and of course I could miss something.

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-08 16:33   ` Oleg Nesterov
@ 2019-08-08 17:05     ` Song Liu
  2019-08-09 15:24       ` Oleg Nesterov
  0 siblings, 1 reply; 30+ messages in thread
From: Song Liu @ 2019-08-08 17:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar



> On Aug 8, 2019, at 9:33 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> On 08/07, Song Liu wrote:
>> 
>> +void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
>> +{
>> +	unsigned long haddr = addr & HPAGE_PMD_MASK;
>> +	struct vm_area_struct *vma = find_vma(mm, haddr);
>> +	struct page *hpage = NULL;
>> +	pmd_t *pmd, _pmd;
>> +	spinlock_t *ptl;
>> +	int count = 0;
>> +	int i;
>> +
>> +	if (!vma || !vma->vm_file ||
>> +	    vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE)
>> +		return;
>> +
>> +	/*
>> +	 * This vm_flags may not have VM_HUGEPAGE if the page was not
>> +	 * collapsed by this mm. But we can still collapse if the page is
>> +	 * the valid THP. Add extra VM_HUGEPAGE so hugepage_vma_check()
>> +	 * will not fail the vma for missing VM_HUGEPAGE
>> +	 */
>> +	if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
>> +		return;
>> +
>> +	pmd = mm_find_pmd(mm, haddr);
> 
> OK, I do not see anything really wrong...
> 
> a couple of questions below.
> 
>> +	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>> +		pte_t *pte = pte_offset_map(pmd, addr);
>> +		struct page *page;
>> +
>> +		if (pte_none(*pte))
>> +			continue;
>> +
>> +		page = vm_normal_page(vma, addr, *pte);
>> +
>> +		if (!page || !PageCompound(page))
>> +			return;
>> +
>> +		if (!hpage) {
>> +			hpage = compound_head(page);
> 
> OK,
> 
>> +			if (hpage->mapping != vma->vm_file->f_mapping)
>> +				return;
> 
> is it really possible? May be WARN_ON(hpage->mapping != vm_file->f_mapping)
> makes more sense ?

I haven't found code paths lead to this, but this is technically possible. 
This pmd could contain subpages from different THPs. The __replace_page() 
function in uprobes.c creates similar pmd. 

Current uprobe code won't really create this problem, because 
!PageCompound() check above is sufficient. But it won't be difficult to 
modify uprobe code to break this. For this code to be accurate and safe, 
I think both this check and the one below are necessary. Also, this code 
is not on any critical path, so the overhead should be negligible. 

Does this make sense?

Thanks,
Song

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 3/6] mm, thp: introduce FOLL_SPLIT_PMD
  2019-08-08 16:37   ` Oleg Nesterov
@ 2019-08-08 17:16     ` Song Liu
  2019-08-09 16:35       ` Oleg Nesterov
  0 siblings, 1 reply; 30+ messages in thread
From: Song Liu @ 2019-08-08 17:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Linux MM, akpm, matthew.wilcox,
	kirill.shutemov, Kernel Team, william.kucharski, srikar



> On Aug 8, 2019, at 9:37 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> On 08/07, Song Liu wrote:
>> 
>> @@ -399,7 +399,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>> 		spin_unlock(ptl);
>> 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>> 	}
>> -	if (flags & FOLL_SPLIT) {
>> +	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
>> 		int ret;
>> 		page = pmd_page(*pmd);
>> 		if (is_huge_zero_page(page)) {
>> @@ -408,7 +408,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>> 			split_huge_pmd(vma, pmd, address);
>> 			if (pmd_trans_unstable(pmd))
>> 				ret = -EBUSY;
>> -		} else {
>> +		} else if (flags & FOLL_SPLIT) {
>> 			if (unlikely(!try_get_page(page))) {
>> 				spin_unlock(ptl);
>> 				return ERR_PTR(-ENOMEM);
>> @@ -420,6 +420,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>> 			put_page(page);
>> 			if (pmd_none(*pmd))
>> 				return no_page_table(vma, flags);
>> +		} else {  /* flags & FOLL_SPLIT_PMD */
>> +			spin_unlock(ptl);
>> +			split_huge_pmd(vma, pmd, address);
>> +			ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
>> 		}
> 
> Can't resist, let me repeat that I do not like this patch because imo
> it complicates this code for no reason.

Personally, I don't think this is more complicated than your version. 
This patch is safe as it doesn't change any code for is_huge_zero_page() 
case. 

Also, if some code calls follow_pmd_mask() with flags contains both 
FOLL_SPLIT and FOLL_SPLIT_PMD, we should honor FOLL_SPLIT and split the
huge page. Of course, there is no code that sets both flags.

Does this resolve your concern here?

Thanks,
Song


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-08 17:05     ` Song Liu
@ 2019-08-09 15:24       ` Oleg Nesterov
  2019-08-09 16:30         ` Song Liu
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2019-08-09 15:24 UTC (permalink / raw)
  To: Song Liu
  Cc: Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar

On 08/08, Song Liu wrote:
>
> > On Aug 8, 2019, at 9:33 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> >> +	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
> >> +		pte_t *pte = pte_offset_map(pmd, addr);
> >> +		struct page *page;
> >> +
> >> +		if (pte_none(*pte))
> >> +			continue;
> >> +
> >> +		page = vm_normal_page(vma, addr, *pte);

just noticed... shouldn't you also check pte_present() before
vm_normal_page() ?

> >> +		if (!page || !PageCompound(page))
> >> +			return;
> >> +
> >> +		if (!hpage) {
> >> +			hpage = compound_head(page);
> >
> > OK,
> >
> >> +			if (hpage->mapping != vma->vm_file->f_mapping)
> >> +				return;
> >
> > is it really possible? May be WARN_ON(hpage->mapping != vm_file->f_mapping)
> > makes more sense ?
>
> I haven't found code paths lead to this,

Neither me, that is why I asked. I think this should not be possible,
but again this is not my area.

> but this is technically possible.
> This pmd could contain subpages from different THPs.

Then please explain how this can happen ?

> The __replace_page()
> function in uprobes.c creates similar pmd.

No it doesn't,

> Current uprobe code won't really create this problem, because
> !PageCompound() check above is sufficient. But it won't be difficult to
> modify uprobe code to break this.

I bet it will be a) difficult and b) the very idea to do this would be wrong.

> For this code to be accurate and safe,
> I think both this check and the one below are necessary.

I didn't suggest to remove these checks.

> Also, this code
> is not on any critical path, so the overhead should be negligible.

I do not care about overhead. But I do care about a poor reader like me
who will try to understand this code.

If you too do not understand how a THP page can have a different mapping
then use VM_WARN or at least add a comment to explain that this is not
supposed to happen!

> Does this make sense?

Not to me :/

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-09 15:24       ` Oleg Nesterov
@ 2019-08-09 16:30         ` Song Liu
  2019-08-09 18:01           ` Song Liu
  0 siblings, 1 reply; 30+ messages in thread
From: Song Liu @ 2019-08-09 16:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar



> On Aug 9, 2019, at 8:24 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> On 08/08, Song Liu wrote:
>> 
>>> On Aug 8, 2019, at 9:33 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> 
>>>> +	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>>>> +		pte_t *pte = pte_offset_map(pmd, addr);
>>>> +		struct page *page;
>>>> +
>>>> +		if (pte_none(*pte))
>>>> +			continue;
>>>> +
>>>> +		page = vm_normal_page(vma, addr, *pte);
> 
> just noticed... shouldn't you also check pte_present() before
> vm_normal_page() ?

Good catch! Let me fix this. 

> 
>>>> +		if (!page || !PageCompound(page))
>>>> +			return;
>>>> +
>>>> +		if (!hpage) {
>>>> +			hpage = compound_head(page);
>>> 
>>> OK,
>>> 
>>>> +			if (hpage->mapping != vma->vm_file->f_mapping)
>>>> +				return;
>>> 
>>> is it really possible? May be WARN_ON(hpage->mapping != vm_file->f_mapping)
>>> makes more sense ?
>> 
>> I haven't found code paths lead to this,
> 
> Neither me, that is why I asked. I think this should not be possible,
> but again this is not my area.
> 
>> but this is technically possible.
>> This pmd could contain subpages from different THPs.
> 
> Then please explain how this can happen ?
> 
>> The __replace_page()
>> function in uprobes.c creates similar pmd.
> 
> No it doesn't,
> 
>> Current uprobe code won't really create this problem, because
>> !PageCompound() check above is sufficient. But it won't be difficult to
>> modify uprobe code to break this.
> 
> I bet it will be a) difficult and b) the very idea to do this would be wrong.
> 
>> For this code to be accurate and safe,
>> I think both this check and the one below are necessary.
> 
> I didn't suggest to remove these checks.
> 
>> Also, this code
>> is not on any critical path, so the overhead should be negligible.
> 
> I do not care about overhead. But I do care about a poor reader like me
> who will try to understand this code.
> 
> If you too do not understand how a THP page can have a different mapping
> then use VM_WARN or at least add a comment to explain that this is not
> supposed to happen!

Fair enough. I will add WARN and more comments. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 3/6] mm, thp: introduce FOLL_SPLIT_PMD
  2019-08-08 17:16     ` Song Liu
@ 2019-08-09 16:35       ` Oleg Nesterov
  2019-08-09 16:50         ` Song Liu
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2019-08-09 16:35 UTC (permalink / raw)
  To: Song Liu
  Cc: Linux Kernel Mailing List, Linux MM, akpm, matthew.wilcox,
	kirill.shutemov, Kernel Team, william.kucharski, srikar

On 08/08, Song Liu wrote:
>
> > On Aug 8, 2019, at 9:37 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 08/07, Song Liu wrote:
> >>
> >> @@ -399,7 +399,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> >> 		spin_unlock(ptl);
> >> 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> >> 	}
> >> -	if (flags & FOLL_SPLIT) {
> >> +	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
> >> 		int ret;
> >> 		page = pmd_page(*pmd);
> >> 		if (is_huge_zero_page(page)) {
> >> @@ -408,7 +408,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> >> 			split_huge_pmd(vma, pmd, address);
> >> 			if (pmd_trans_unstable(pmd))
> >> 				ret = -EBUSY;
> >> -		} else {
> >> +		} else if (flags & FOLL_SPLIT) {
> >> 			if (unlikely(!try_get_page(page))) {
> >> 				spin_unlock(ptl);
> >> 				return ERR_PTR(-ENOMEM);
> >> @@ -420,6 +420,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> >> 			put_page(page);
> >> 			if (pmd_none(*pmd))
> >> 				return no_page_table(vma, flags);
> >> +		} else {  /* flags & FOLL_SPLIT_PMD */
> >> +			spin_unlock(ptl);
> >> +			split_huge_pmd(vma, pmd, address);
> >> +			ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> >> 		}
> >
> > Can't resist, let me repeat that I do not like this patch because imo
> > it complicates this code for no reason.
>
> Personally, I don't think this is more complicated than your version.

I do, but of course this is subjective.

> Also, if some code calls follow_pmd_mask() with flags contains both
> FOLL_SPLIT and FOLL_SPLIT_PMD, we should honor FOLL_SPLIT and split the
> huge page.

Heh. why not other way around?

> Of course, there is no code that sets both flags.

and of course, nobody should ever pass both FOLL_SPLIT and FOLL_SPLIT_PMD,
perhaps this deserves a warning.

Not to mention that it would be nice to kill FOLL_SPLIT which has a single
user, but this is another story.

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 3/6] mm, thp: introduce FOLL_SPLIT_PMD
  2019-08-09 16:35       ` Oleg Nesterov
@ 2019-08-09 16:50         ` Song Liu
  0 siblings, 0 replies; 30+ messages in thread
From: Song Liu @ 2019-08-09 16:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Linux MM, akpm, matthew.wilcox,
	kirill.shutemov, Kernel Team, william.kucharski, srikar



> On Aug 9, 2019, at 9:35 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> On 08/08, Song Liu wrote:
>> 
>>> On Aug 8, 2019, at 9:37 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> 
>>> On 08/07, Song Liu wrote:
>>>> 
>>>> @@ -399,7 +399,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>>>> 		spin_unlock(ptl);
>>>> 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>>>> 	}
>>>> -	if (flags & FOLL_SPLIT) {
>>>> +	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
>>>> 		int ret;
>>>> 		page = pmd_page(*pmd);
>>>> 		if (is_huge_zero_page(page)) {
>>>> @@ -408,7 +408,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>>>> 			split_huge_pmd(vma, pmd, address);
>>>> 			if (pmd_trans_unstable(pmd))
>>>> 				ret = -EBUSY;
>>>> -		} else {
>>>> +		} else if (flags & FOLL_SPLIT) {
>>>> 			if (unlikely(!try_get_page(page))) {
>>>> 				spin_unlock(ptl);
>>>> 				return ERR_PTR(-ENOMEM);
>>>> @@ -420,6 +420,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>>>> 			put_page(page);
>>>> 			if (pmd_none(*pmd))
>>>> 				return no_page_table(vma, flags);
>>>> +		} else {  /* flags & FOLL_SPLIT_PMD */
>>>> +			spin_unlock(ptl);
>>>> +			split_huge_pmd(vma, pmd, address);
>>>> +			ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
>>>> 		}
>>> 
>>> Can't resist, let me repeat that I do not like this patch because imo
>>> it complicates this code for no reason.
>> 
>> Personally, I don't think this is more complicated than your version.
> 
> I do, but of course this is subjective.
> 
>> Also, if some code calls follow_pmd_mask() with flags contains both
>> FOLL_SPLIT and FOLL_SPLIT_PMD, we should honor FOLL_SPLIT and split the
>> huge page.
> 
> Heh. why not other way around?

Because FOLL_SPLIT splits both the page and the pmd. FOLL_SPLIT_PMD 
only splits the pmd, so it is a subset of FOLL_SPLIT. When the user
sets both, we should split both the page and the pmd. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-09 16:30         ` Song Liu
@ 2019-08-09 18:01           ` Song Liu
  2019-08-12 12:11             ` Kirill A. Shutemov
  2019-08-12 13:06             ` Oleg Nesterov
  0 siblings, 2 replies; 30+ messages in thread
From: Song Liu @ 2019-08-09 18:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar



> On Aug 9, 2019, at 9:30 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Aug 9, 2019, at 8:24 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> 
>> On 08/08, Song Liu wrote:
>>> 
>>>> On Aug 8, 2019, at 9:33 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>>> 
>>>>> +	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>>>>> +		pte_t *pte = pte_offset_map(pmd, addr);
>>>>> +		struct page *page;
>>>>> +
>>>>> +		if (pte_none(*pte))
>>>>> +			continue;
>>>>> +
>>>>> +		page = vm_normal_page(vma, addr, *pte);
>> 
>> just noticed... shouldn't you also check pte_present() before
>> vm_normal_page() ?
> 
> Good catch! Let me fix this. 
> 
>> 
>>>>> +		if (!page || !PageCompound(page))
>>>>> +			return;
>>>>> +
>>>>> +		if (!hpage) {
>>>>> +			hpage = compound_head(page);
>>>> 
>>>> OK,
>>>> 
>>>>> +			if (hpage->mapping != vma->vm_file->f_mapping)
>>>>> +				return;
>>>> 
>>>> is it really possible? May be WARN_ON(hpage->mapping != vm_file->f_mapping)
>>>> makes more sense ?
>>> 
>>> I haven't found code paths lead to this,
>> 
>> Neither me, that is why I asked. I think this should not be possible,
>> but again this is not my area.
>> 
>>> but this is technically possible.
>>> This pmd could contain subpages from different THPs.
>> 
>> Then please explain how this can happen ?
>> 
>>> The __replace_page()
>>> function in uprobes.c creates similar pmd.
>> 
>> No it doesn't,
>> 
>>> Current uprobe code won't really create this problem, because
>>> !PageCompound() check above is sufficient. But it won't be difficult to
>>> modify uprobe code to break this.
>> 
>> I bet it will be a) difficult and b) the very idea to do this would be wrong.
>> 
>>> For this code to be accurate and safe,
>>> I think both this check and the one below are necessary.
>> 
>> I didn't suggest to remove these checks.
>> 
>>> Also, this code
>>> is not on any critical path, so the overhead should be negligible.
>> 
>> I do not care about overhead. But I do care about a poor reader like me
>> who will try to understand this code.
>> 
>> If you too do not understand how a THP page can have a different mapping
>> then use VM_WARN or at least add a comment to explain that this is not
>> supposed to happen!
> 
> Fair enough. I will add WARN and more comments. 
> 
> Thanks,
> Song

To reduce spamming, I attached updated 5/6 here. 

Thanks,
Song

====================== 8< =============================

From 3fb735e03b149bf8a90918dd383a3a31b3f9008a Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Sun, 28 Jul 2019 03:43:48 -0700
Subject: [PATCH v13 5/6] khugepaged: enable collapse pmd for pte-mapped THP

khugepaged needs exclusive mmap_sem to access page table. When it fails
to lock mmap_sem, the page will fault in as pte-mapped THP. As the page
is already a THP, khugepaged will not handle this pmd again.

This patch enables the khugepaged to retry collapse the page table.

struct mm_slot (in khugepaged.c) is extended with an array, containing
addresses of pte-mapped THPs. We use array here for simplicity. We can
easily replace it with more advanced data structures when needed.

In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try
to collapse the page table.

Since collapse may happen at an later time, some pages may already fault
in. collapse_pte_mapped_thp() is added to properly handle these pages.
collapse_pte_mapped_thp() also double checks whether all ptes in this pmd
are mapping to the same THP. This is necessary because some subpage of
the THP may be replaced, for example by uprobe. In such cases, it is not
possible to collapse the pmd.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/khugepaged.h |  12 +++
 mm/khugepaged.c            | 154 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 165 insertions(+), 1 deletion(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 082d1d2a5216..bc45ea1efbf7 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -15,6 +15,14 @@ extern int __khugepaged_enter(struct mm_struct *mm);
 extern void __khugepaged_exit(struct mm_struct *mm);
 extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 				      unsigned long vm_flags);
+#ifdef CONFIG_SHMEM
+extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
+#else
+static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
+					   unsigned long addr)
+{
+}
+#endif
 
 #define khugepaged_enabled()					       \
 	(transparent_hugepage_flags &				       \
@@ -73,6 +81,10 @@ static inline int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 {
 	return 0;
 }
+static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
+					   unsigned long addr)
+{
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_KHUGEPAGED_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 40c25ddf29e4..3e722065e909 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -77,6 +77,8 @@ static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __read_mostly;
 
+#define MAX_PTE_MAPPED_THP 8
+
 /**
  * struct mm_slot - hash lookup from mm to mm_slot
  * @hash: hash collision list
@@ -87,6 +89,10 @@ struct mm_slot {
 	struct hlist_node hash;
 	struct list_head mm_node;
 	struct mm_struct *mm;
+
+	/* pte-mapped THP in this mm */
+	int nr_pte_mapped_thp;
+	unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
 };
 
 /**
@@ -1254,6 +1260,145 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
 }
 
 #if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
+/*
+ * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
+ * khugepaged should try to collapse the page table.
+ */
+static int khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
+					 unsigned long addr)
+{
+	struct mm_slot *mm_slot;
+
+	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = get_mm_slot(mm);
+	if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP))
+		mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
+	spin_unlock(&khugepaged_mm_lock);
+	return 0;
+}
+
+/**
+ * Try to collapse a pte-mapped THP for mm at address haddr.
+ *
+ * This function checks whether all the PTEs in the PMD are pointing to the
+ * right THP. If so, retract the page table so the THP can refault in with
+ * as pmd-mapped.
+ */
+void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
+{
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	struct vm_area_struct *vma = find_vma(mm, haddr);
+	struct page *hpage = NULL;
+	pmd_t *pmd, _pmd;
+	spinlock_t *ptl;
+	int count = 0;
+	int i;
+
+	if (!vma || !vma->vm_file ||
+	    vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE)
+		return;
+
+	/*
+	 * This vm_flags may not have VM_HUGEPAGE if the page was not
+	 * collapsed by this mm. But we can still collapse if the page is
+	 * the valid THP. Add extra VM_HUGEPAGE so hugepage_vma_check()
+	 * will not fail the vma for missing VM_HUGEPAGE
+	 */
+	if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
+		return;
+
+	pmd = mm_find_pmd(mm, haddr);
+	if (!pmd)
+		return;
+
+	/* step 1: check all mapped PTEs are to the right huge page */
+	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+		pte_t *pte = pte_offset_map(pmd, addr);
+		struct page *page;
+
+		if (pte_none(*pte) || !pte_present(*pte))
+			continue;
+
+		page = vm_normal_page(vma, addr, *pte);
+
+		if (!page || !PageCompound(page))
+			return;
+
+		if (!hpage) {
+			hpage = compound_head(page);
+			/*
+			 * The mapping of the THP should not change.
+			 *
+			 * Note that uprobe may change the page table, but
+			 * the new page installed by uprobe will not pass
+			 * PageCompound() check.
+			 */
+			if (VM_WARN_ON(hpage->mapping != vma->vm_file->f_mapping))
+				return;
+		}
+
+		/*
+		 * Confirm the page maps to the correct subpage.
+		 *
+		 * Note that uprobe may change the page table, but the new
+		 * page installed by uprobe will not pass PageCompound()
+		 * check.
+		 */
+		if (VM_WARN_ON(hpage + i != page))
+			return;
+		count++;
+	}
+
+	/* step 2: adjust rmap */
+	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+		pte_t *pte = pte_offset_map(pmd, addr);
+		struct page *page;
+
+		if (pte_none(*pte))
+			continue;
+		page = vm_normal_page(vma, addr, *pte);
+		page_remove_rmap(page, false);
+	}
+
+	/* step 3: set proper refcount and mm_counters. */
+	if (hpage) {
+		page_ref_sub(hpage, count);
+		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
+	}
+
+	/* step 4: collapse pmd */
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	_pmd = pmdp_collapse_flush(vma, addr, pmd);
+	spin_unlock(ptl);
+	mm_dec_nr_ptes(mm);
+	pte_free(mm, pmd_pgtable(_pmd));
+}
+
+static int khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
+{
+	struct mm_struct *mm = mm_slot->mm;
+	int i;
+
+	if (likely(mm_slot->nr_pte_mapped_thp == 0))
+		return 0;
+
+	if (!down_write_trylock(&mm->mmap_sem))
+		return -EBUSY;
+
+	if (unlikely(khugepaged_test_exit(mm)))
+		goto out;
+
+	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
+		collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i]);
+
+out:
+	mm_slot->nr_pte_mapped_thp = 0;
+	up_write(&mm->mmap_sem);
+	return 0;
+}
+
 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
@@ -1287,7 +1432,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			up_write(&vma->vm_mm->mmap_sem);
 			mm_dec_nr_ptes(vma->vm_mm);
 			pte_free(vma->vm_mm, pmd_pgtable(_pmd));
-		}
+		} else
+			khugepaged_add_pte_mapped_thp(vma->vm_mm, addr);
 	}
 	i_mmap_unlock_write(mapping);
 }
@@ -1709,6 +1855,11 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 {
 	BUILD_BUG();
 }
+
+static int khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
+{
+	return 0;
+}
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
@@ -1733,6 +1884,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 		khugepaged_scan.mm_slot = mm_slot;
 	}
 	spin_unlock(&khugepaged_mm_lock);
+	khugepaged_collapse_pte_mapped_thps(mm_slot);
 
 	mm = mm_slot->mm;
 	/*
-- 
2.17.1





^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-09 18:01           ` Song Liu
@ 2019-08-12 12:11             ` Kirill A. Shutemov
  2019-08-12 13:22               ` Oleg Nesterov
  2019-08-12 13:06             ` Oleg Nesterov
  1 sibling, 1 reply; 30+ messages in thread
From: Kirill A. Shutemov @ 2019-08-12 12:11 UTC (permalink / raw)
  To: Song Liu
  Cc: Oleg Nesterov, Linux Kernel Mailing List, Linux MM,
	Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar

On Fri, Aug 09, 2019 at 06:01:18PM +0000, Song Liu wrote:
> +		if (pte_none(*pte) || !pte_present(*pte))
> +			continue;

You don't need to check both. Present is never none.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-09 18:01           ` Song Liu
  2019-08-12 12:11             ` Kirill A. Shutemov
@ 2019-08-12 13:06             ` Oleg Nesterov
  2019-08-12 14:36               ` Song Liu
  1 sibling, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2019-08-12 13:06 UTC (permalink / raw)
  To: Song Liu
  Cc: Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar

On 08/09, Song Liu wrote:
>
> +void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> +{
> +	unsigned long haddr = addr & HPAGE_PMD_MASK;
> +	struct vm_area_struct *vma = find_vma(mm, haddr);
> +	struct page *hpage = NULL;
> +	pmd_t *pmd, _pmd;
> +	spinlock_t *ptl;
> +	int count = 0;
> +	int i;
> +
> +	if (!vma || !vma->vm_file ||
> +	    vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE)
> +		return;
> +
> +	/*
> +	 * This vm_flags may not have VM_HUGEPAGE if the page was not
> +	 * collapsed by this mm. But we can still collapse if the page is
> +	 * the valid THP. Add extra VM_HUGEPAGE so hugepage_vma_check()
> +	 * will not fail the vma for missing VM_HUGEPAGE
> +	 */
> +	if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
> +		return;
> +
> +	pmd = mm_find_pmd(mm, haddr);
> +	if (!pmd)
> +		return;
> +
> +	/* step 1: check all mapped PTEs are to the right huge page */
> +	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
> +		pte_t *pte = pte_offset_map(pmd, addr);
> +		struct page *page;
> +
> +		if (pte_none(*pte) || !pte_present(*pte))
> +			continue;

		if (!pte_present(*pte))
			return;

you can't simply flush pmd if this page is swapped out.

> +
> +		page = vm_normal_page(vma, addr, *pte);
> +
> +		if (!page || !PageCompound(page))
> +			return;
> +
> +		if (!hpage) {
> +			hpage = compound_head(page);
> +			/*
> +			 * The mapping of the THP should not change.
> +			 *
> +			 * Note that uprobe may change the page table,

Not only uprobe can cow the page. Debugger can do. Or mmap(PROT_WRITE, MAP_PRIVATE).

uprobe() is "special" because it a) it works with a foreign mm and b)
it can't stop the process which uses this mm. Otherwise it could simply
update the page returned by get_user_pages_remote(FOLL_FORCE), just we
would need to add FOLL_WRITE and if we do this we do not even need SPLIT,
that is why, say, __access_remote_vm() works without SPLIT.

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-12 12:11             ` Kirill A. Shutemov
@ 2019-08-12 13:22               ` Oleg Nesterov
  2019-08-12 14:40                 ` Kirill A. Shutemov
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2019-08-12 13:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Song Liu, Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar

On 08/12, Kirill A. Shutemov wrote:
>
> On Fri, Aug 09, 2019 at 06:01:18PM +0000, Song Liu wrote:
> > +		if (pte_none(*pte) || !pte_present(*pte))
> > +			continue;
>
> You don't need to check both. Present is never none.

Agreed.

Kirill, while you are here, shouldn't retract_page_tables() check
vma->anon_vma (and probably do mm_find_pmd) under vm_mm->mmap_sem?

Can't it race with, say, do_cow_fault?

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-12 13:06             ` Oleg Nesterov
@ 2019-08-12 14:36               ` Song Liu
  0 siblings, 0 replies; 30+ messages in thread
From: Song Liu @ 2019-08-12 14:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar



> On Aug 12, 2019, at 6:06 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> On 08/09, Song Liu wrote:
>> 
>> +void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
>> +{
>> +	unsigned long haddr = addr & HPAGE_PMD_MASK;
>> +	struct vm_area_struct *vma = find_vma(mm, haddr);
>> +	struct page *hpage = NULL;
>> +	pmd_t *pmd, _pmd;
>> +	spinlock_t *ptl;
>> +	int count = 0;
>> +	int i;
>> +
>> +	if (!vma || !vma->vm_file ||
>> +	    vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE)
>> +		return;
>> +
>> +	/*
>> +	 * This vm_flags may not have VM_HUGEPAGE if the page was not
>> +	 * collapsed by this mm. But we can still collapse if the page is
>> +	 * the valid THP. Add extra VM_HUGEPAGE so hugepage_vma_check()
>> +	 * will not fail the vma for missing VM_HUGEPAGE
>> +	 */
>> +	if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
>> +		return;
>> +
>> +	pmd = mm_find_pmd(mm, haddr);
>> +	if (!pmd)
>> +		return;
>> +
>> +	/* step 1: check all mapped PTEs are to the right huge page */
>> +	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>> +		pte_t *pte = pte_offset_map(pmd, addr);
>> +		struct page *page;
>> +
>> +		if (pte_none(*pte) || !pte_present(*pte))
>> +			continue;
> 
> 		if (!pte_present(*pte))
> 			return;
> 
> you can't simply flush pmd if this page is swapped out.

hmm... how about
		if (pte_none(*pte))
			continue;

		if (!pte_present(*pte))
			return;

If the page hasn't faulted in for this mm, i.e. pte_none(), we
can flush the pmd. 

> 
>> +
>> +		page = vm_normal_page(vma, addr, *pte);
>> +
>> +		if (!page || !PageCompound(page))
>> +			return;
>> +
>> +		if (!hpage) {
>> +			hpage = compound_head(page);
>> +			/*
>> +			 * The mapping of the THP should not change.
>> +			 *
>> +			 * Note that uprobe may change the page table,
> 
> Not only uprobe can cow the page. Debugger can do. Or mmap(PROT_WRITE, MAP_PRIVATE).
> 
> uprobe() is "special" because it a) it works with a foreign mm and b)
> it can't stop the process which uses this mm. Otherwise it could simply
> update the page returned by get_user_pages_remote(FOLL_FORCE), just we
> would need to add FOLL_WRITE and if we do this we do not even need SPLIT,
> that is why, say, __access_remote_vm() works without SPLIT.

Will update the comment in next version. 

Thanks!
Song



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-12 13:22               ` Oleg Nesterov
@ 2019-08-12 14:40                 ` Kirill A. Shutemov
  2019-08-12 21:04                   ` Song Liu
  2019-08-13 13:30                   ` Oleg Nesterov
  0 siblings, 2 replies; 30+ messages in thread
From: Kirill A. Shutemov @ 2019-08-12 14:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Song Liu, Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar

On Mon, Aug 12, 2019 at 03:22:58PM +0200, Oleg Nesterov wrote:
> On 08/12, Kirill A. Shutemov wrote:
> >
> > On Fri, Aug 09, 2019 at 06:01:18PM +0000, Song Liu wrote:
> > > +		if (pte_none(*pte) || !pte_present(*pte))
> > > +			continue;
> >
> > You don't need to check both. Present is never none.
> 
> Agreed.
> 
> Kirill, while you are here, shouldn't retract_page_tables() check
> vma->anon_vma (and probably do mm_find_pmd) under vm_mm->mmap_sem?
> 
> Can't it race with, say, do_cow_fault?

vma->anon_vma can race, but it doesn't matter. False-negative is fine.
It's attempt to avoid taking mmap_sem where it can be not productive.

mm_find_pmd() cannot race with do_cow_fault() since the page is locked.
__do_fault() has to return locked page before we touch page tables.
It is somewhat subtle, but I wanted to avoid taking mmap_sem where it is
possible.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-12 14:40                 ` Kirill A. Shutemov
@ 2019-08-12 21:04                   ` Song Liu
  2019-08-13 14:44                     ` Song Liu
  2019-08-13 13:30                   ` Oleg Nesterov
  1 sibling, 1 reply; 30+ messages in thread
From: Song Liu @ 2019-08-12 21:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Oleg Nesterov, Linux Kernel Mailing List, Linux MM,
	Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar, Johannes Weiner



> On Aug 12, 2019, at 7:40 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Mon, Aug 12, 2019 at 03:22:58PM +0200, Oleg Nesterov wrote:
>> On 08/12, Kirill A. Shutemov wrote:
>>> 
>>> On Fri, Aug 09, 2019 at 06:01:18PM +0000, Song Liu wrote:
>>>> +		if (pte_none(*pte) || !pte_present(*pte))
>>>> +			continue;
>>> 
>>> You don't need to check both. Present is never none.
>> 
>> Agreed.
>> 
>> Kirill, while you are here, shouldn't retract_page_tables() check
>> vma->anon_vma (and probably do mm_find_pmd) under vm_mm->mmap_sem?
>> 
>> Can't it race with, say, do_cow_fault?
> 
> vma->anon_vma can race, but it doesn't matter. False-negative is fine.
> It's attempt to avoid taking mmap_sem where it can be not productive.
> 
> mm_find_pmd() cannot race with do_cow_fault() since the page is locked.
> __do_fault() has to return locked page before we touch page tables.
> It is somewhat subtle, but I wanted to avoid taking mmap_sem where it is
> possible.
> 
> -- 
> Kirill A. Shutemov

Updated version attached. 


Besides feedbacks from Oleg and Kirill, I also revise the locking in 
collapse_pte_mapped_thp(): use pte_offset_map_lock() for the two loops 
to cover highmem. zap_pte_range() has similar use of the lock. 

This change is suggested by Johannes. 

Thanks,
Song

================ 8< ======================
From 3d931bc4780abb6109fe478a4b1a0004ce81efe1 Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Sun, 28 Jul 2019 03:43:48 -0700
Subject: [PATCH 5/6] khugepaged: enable collapse pmd for pte-mapped THP

khugepaged needs exclusive mmap_sem to access page table. When it fails
to lock mmap_sem, the page will fault in as pte-mapped THP. As the page
is already a THP, khugepaged will not handle this pmd again.

This patch enables the khugepaged to retry collapse the page table.

struct mm_slot (in khugepaged.c) is extended with an array, containing
addresses of pte-mapped THPs. We use array here for simplicity. We can
easily replace it with more advanced data structures when needed.

In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try
to collapse the page table.

Since collapse may happen at an later time, some pages may already fault
in. collapse_pte_mapped_thp() is added to properly handle these pages.
collapse_pte_mapped_thp() also double checks whether all ptes in this pmd
are mapping to the same THP. This is necessary because some subpage of
the THP may be replaced, for example by uprobe. In such cases, it is not
possible to collapse the pmd.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/khugepaged.h |  12 +++
 mm/khugepaged.c            | 168 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 179 insertions(+), 1 deletion(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 082d1d2a5216..bc45ea1efbf7 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -15,6 +15,14 @@ extern int __khugepaged_enter(struct mm_struct *mm);
 extern void __khugepaged_exit(struct mm_struct *mm);
 extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
                                      unsigned long vm_flags);
+#ifdef CONFIG_SHMEM
+extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
+#else
+static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
+                                          unsigned long addr)
+{
+}
+#endif

 #define khugepaged_enabled()                                          \
        (transparent_hugepage_flags &                                  \
@@ -73,6 +81,10 @@ static inline int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 {
        return 0;
 }
+static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
+                                          unsigned long addr)
+{
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

 #endif /* _LINUX_KHUGEPAGED_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 40c25ddf29e4..cea0fbf2d7b9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -77,6 +77,8 @@ static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);

 static struct kmem_cache *mm_slot_cache __read_mostly;

+#define MAX_PTE_MAPPED_THP 8
+
 /**
  * struct mm_slot - hash lookup from mm to mm_slot
  * @hash: hash collision list
@@ -87,6 +89,10 @@ struct mm_slot {
        struct hlist_node hash;
        struct list_head mm_node;
        struct mm_struct *mm;
+
+       /* pte-mapped THP in this mm */
+       int nr_pte_mapped_thp;
+       unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
 };

 /**
@@ -1254,6 +1260,159 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
 }

 #if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
+/*
+ * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
+ * khugepaged should try to collapse the page table.
+ */
+static int khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
+                                        unsigned long addr)
+{
+       struct mm_slot *mm_slot;
+
+       VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
+
+       spin_lock(&khugepaged_mm_lock);
+       mm_slot = get_mm_slot(mm);
+       if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP))
+               mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
+       spin_unlock(&khugepaged_mm_lock);
+       return 0;
+}
+
+/**
+ * Try to collapse a pte-mapped THP for mm at address haddr.
+ *
+ * This function checks whether all the PTEs in the PMD are pointing to the
+ * right THP. If so, retract the page table so the THP can refault in with
+ * as pmd-mapped.
+ */
+void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
+{
+       unsigned long haddr = addr & HPAGE_PMD_MASK;
+       struct vm_area_struct *vma = find_vma(mm, haddr);
+       struct page *hpage = NULL;
+       pte_t *start_pte, *pte;
+       pmd_t *pmd, _pmd;
+       spinlock_t *ptl;
+       int count = 0;
+       int i;
+
+       if (!vma || !vma->vm_file ||
+           vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE)
+               return;
+
+       /*
+        * This vm_flags may not have VM_HUGEPAGE if the page was not
+        * collapsed by this mm. But we can still collapse if the page is
+        * the valid THP. Add extra VM_HUGEPAGE so hugepage_vma_check()
+        * will not fail the vma for missing VM_HUGEPAGE
+        */
+       if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
+               return;
+
+       pmd = mm_find_pmd(mm, haddr);
+       if (!pmd)
+               return;
+
+       start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
+
+       /* step 1: check all mapped PTEs are to the right huge page */
+       for (i = 0, addr = haddr, pte = start_pte;
+            i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
+               struct page *page;
+
+               /* empty pte, skip */
+               if (pte_none(*pte))
+                       continue;
+
+               /* page swapped out, abort */
+               if (!pte_present(*pte))
+                       goto abort;
+
+               page = vm_normal_page(vma, addr, *pte);
+
+               if (!page || !PageCompound(page))
+                       goto abort;
+
+               if (!hpage) {
+                       hpage = compound_head(page);
+                       /*
+                        * The mapping of the THP should not change.
+                        *
+                        * Note that uprobe, debugger, or MAP_PRIVATE may
+                        * change the page table, but the new page will
+                        * not pass PageCompound() check.
+                        */
+                       if (WARN_ON(hpage->mapping != vma->vm_file->f_mapping))
+                               goto abort;
+               }
+
+               /*
+                * Confirm the page maps to the correct subpage.
+                *
+                * Note that uprobe, debugger, or MAP_PRIVATE may change
+                * the page table, but the new page will not pass
+                * PageCompound() check.
+                */
+               if (WARN_ON(hpage + i != page))
+                       goto abort;
+               count++;
+       }
+
+       /* step 2: adjust rmap */
+       for (i = 0, addr = haddr, pte = start_pte;
+            i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
+               struct page *page;
+
+               if (pte_none(*pte))
+                       continue;
+               page = vm_normal_page(vma, addr, *pte);
+               page_remove_rmap(page, false);
+       }
+
+       pte_unmap_unlock(start_pte, ptl);
+
+       /* step 3: set proper refcount and mm_counters. */
+       if (hpage) {
+               page_ref_sub(hpage, count);
+               add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
+       }
+
+       /* step 4: collapse pmd */
+       ptl = pmd_lock(vma->vm_mm, pmd);
+       _pmd = pmdp_collapse_flush(vma, addr, pmd);
+       spin_unlock(ptl);
+       mm_dec_nr_ptes(mm);
+       pte_free(mm, pmd_pgtable(_pmd));
+       return;
+
+abort:
+       pte_unmap_unlock(start_pte, ptl);
+}
+
+static int khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
+{
+       struct mm_struct *mm = mm_slot->mm;
+       int i;
+
+       if (likely(mm_slot->nr_pte_mapped_thp == 0))
+               return 0;
+
+       if (!down_write_trylock(&mm->mmap_sem))
+               return -EBUSY;
+
+       if (unlikely(khugepaged_test_exit(mm)))
+               goto out;
+
+       for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
+               collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i]);
+
+out:
+       mm_slot->nr_pte_mapped_thp = 0;
+       up_write(&mm->mmap_sem);
+       return 0;
+}
+
 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
        struct vm_area_struct *vma;
@@ -1287,7 +1446,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
                        up_write(&vma->vm_mm->mmap_sem);
                        mm_dec_nr_ptes(vma->vm_mm);
                        pte_free(vma->vm_mm, pmd_pgtable(_pmd));
-               }
+               } else
+                       khugepaged_add_pte_mapped_thp(vma->vm_mm, addr);
        }
        i_mmap_unlock_write(mapping);
 }
@@ -1709,6 +1869,11 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 {
        BUILD_BUG();
 }
+
+static int khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
+{
+       return 0;
+}
 #endif

 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
@@ -1733,6 +1898,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
                khugepaged_scan.mm_slot = mm_slot;
        }
        spin_unlock(&khugepaged_mm_lock);
+       khugepaged_collapse_pte_mapped_thps(mm_slot);

        mm = mm_slot->mm;
        /*
--
2.17.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-12 14:40                 ` Kirill A. Shutemov
  2019-08-12 21:04                   ` Song Liu
@ 2019-08-13 13:30                   ` Oleg Nesterov
  2019-08-13 14:05                     ` Oleg Nesterov
  1 sibling, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2019-08-13 13:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Song Liu, Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar

On 08/12, Kirill A. Shutemov wrote:
>
> On Mon, Aug 12, 2019 at 03:22:58PM +0200, Oleg Nesterov wrote:
> > On 08/12, Kirill A. Shutemov wrote:
> > >
> > > On Fri, Aug 09, 2019 at 06:01:18PM +0000, Song Liu wrote:
> > > > +		if (pte_none(*pte) || !pte_present(*pte))
> > > > +			continue;
> > >
> > > You don't need to check both. Present is never none.
> >
> > Agreed.
> >
> > Kirill, while you are here, shouldn't retract_page_tables() check
> > vma->anon_vma (and probably do mm_find_pmd) under vm_mm->mmap_sem?
> >
> > Can't it race with, say, do_cow_fault?
>
> vma->anon_vma can race, but it doesn't matter. False-negative is fine.
> It's attempt to avoid taking mmap_sem where it can be not productive.

I guess I misunderstood the purpose of this check or your answer...

Let me reword my question. Why can retract_page_tables() safely do
pmdp_collapse_flush(vma) without additional checks similar to what
collapse_pte_mapped_thp() does?

I thought that retract_page_tables() checks vma->anon_vma to ensure that
this vma doesn't have a cow'ed PageAnon() page. And I still can't understand
why can't it race with __handle_mm_fault() paths.

Suppose that shmem_file was mmaped with PROT_READ|WRITE, MAP_PRIVATE.
To simplify, suppose that a non-THP page was already faulted in,
pte_present() == T.

Userspace writes to this page.

Why __handle_mm_fault()->handle_pte_fault()->do_wp_page()->wp_page_copy()
can not cow this page and update pte after the vma->anon_vma chech and
before down_write_trylock(mmap_sem) ?

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-13 13:30                   ` Oleg Nesterov
@ 2019-08-13 14:05                     ` Oleg Nesterov
  2019-08-13 15:05                       ` Kirill A. Shutemov
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2019-08-13 14:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Song Liu, Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar

On 08/13, Oleg Nesterov wrote:
>
> On 08/12, Kirill A. Shutemov wrote:
> >
> > On Mon, Aug 12, 2019 at 03:22:58PM +0200, Oleg Nesterov wrote:
> > > On 08/12, Kirill A. Shutemov wrote:
> > > >
> > > > On Fri, Aug 09, 2019 at 06:01:18PM +0000, Song Liu wrote:
> > > > > +		if (pte_none(*pte) || !pte_present(*pte))
> > > > > +			continue;
> > > >
> > > > You don't need to check both. Present is never none.
> > >
> > > Agreed.
> > >
> > > Kirill, while you are here, shouldn't retract_page_tables() check
> > > vma->anon_vma (and probably do mm_find_pmd) under vm_mm->mmap_sem?
> > >
> > > Can't it race with, say, do_cow_fault?
> >
> > vma->anon_vma can race, but it doesn't matter. False-negative is fine.
> > It's attempt to avoid taking mmap_sem where it can be not productive.
>
> I guess I misunderstood the purpose of this check or your answer...
>
> Let me reword my question. Why can retract_page_tables() safely do
> pmdp_collapse_flush(vma) without additional checks similar to what
> collapse_pte_mapped_thp() does?
>
> I thought that retract_page_tables() checks vma->anon_vma to ensure that
> this vma doesn't have a cow'ed PageAnon() page. And I still can't understand
> why can't it race with __handle_mm_fault() paths.
>
> Suppose that shmem_file was mmaped with PROT_READ|WRITE, MAP_PRIVATE.
> To simplify, suppose that a non-THP page was already faulted in,
> pte_present() == T.
>
> Userspace writes to this page.
>
> Why __handle_mm_fault()->handle_pte_fault()->do_wp_page()->wp_page_copy()
> can not cow this page and update pte after the vma->anon_vma chech and
> before down_write_trylock(mmap_sem) ?

OK, probably this is impossible, collapse_shmem() does unmap_mapping_pages(),
so handle_pte_fault() will call shmem_fault() which iiuc should block in
find_lock_entry() because new_page is locked, and thus down_write_trylock()
can't succeed.

Nevermind, I am sure I missed something. Perhaps you can update the comments
to make this more clear.

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-12 21:04                   ` Song Liu
@ 2019-08-13 14:44                     ` Song Liu
  2019-08-15 10:16                       ` Oleg Nesterov
  0 siblings, 1 reply; 30+ messages in thread
From: Song Liu @ 2019-08-13 14:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar, Johannes Weiner, Kirill A. Shutemov

Hi Oleg,

> On Aug 12, 2019, at 2:04 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Aug 12, 2019, at 7:40 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>> 
>> On Mon, Aug 12, 2019 at 03:22:58PM +0200, Oleg Nesterov wrote:
>>> On 08/12, Kirill A. Shutemov wrote:
>>>> 
>>>> On Fri, Aug 09, 2019 at 06:01:18PM +0000, Song Liu wrote:
>>>>> +		if (pte_none(*pte) || !pte_present(*pte))
>>>>> +			continue;
>>>> 
>>>> You don't need to check both. Present is never none.
>>> 
>>> Agreed.
>>> 
>>> Kirill, while you are here, shouldn't retract_page_tables() check
>>> vma->anon_vma (and probably do mm_find_pmd) under vm_mm->mmap_sem?
>>> 
>>> Can't it race with, say, do_cow_fault?
>> 
>> vma->anon_vma can race, but it doesn't matter. False-negative is fine.
>> It's attempt to avoid taking mmap_sem where it can be not productive.
>> 
>> mm_find_pmd() cannot race with do_cow_fault() since the page is locked.
>> __do_fault() has to return locked page before we touch page tables.
>> It is somewhat subtle, but I wanted to avoid taking mmap_sem where it is
>> possible.
>> 
>> -- 
>> Kirill A. Shutemov
> 
> Updated version attached. 
> 
> 
> Besides feedbacks from Oleg and Kirill, I also revise the locking in 
> collapse_pte_mapped_thp(): use pte_offset_map_lock() for the two loops 
> to cover highmem. zap_pte_range() has similar use of the lock. 
> 
> This change is suggested by Johannes. 
> 

Do you have further comments for the version below? If not, could you
please reply with your Acked-by or Reviewed-by?

Thanks,
Song


> 
> ================ 8< ======================
> From 3d931bc4780abb6109fe478a4b1a0004ce81efe1 Mon Sep 17 00:00:00 2001
> From: Song Liu <songliubraving@fb.com>
> Date: Sun, 28 Jul 2019 03:43:48 -0700
> Subject: [PATCH 5/6] khugepaged: enable collapse pmd for pte-mapped THP
> 

[...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-13 14:05                     ` Oleg Nesterov
@ 2019-08-13 15:05                       ` Kirill A. Shutemov
  2019-08-13 16:24                         ` Oleg Nesterov
  0 siblings, 1 reply; 30+ messages in thread
From: Kirill A. Shutemov @ 2019-08-13 15:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Song Liu, Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar

On Tue, Aug 13, 2019 at 04:05:53PM +0200, Oleg Nesterov wrote:
> On 08/13, Oleg Nesterov wrote:
> >
> > On 08/12, Kirill A. Shutemov wrote:
> > >
> > > On Mon, Aug 12, 2019 at 03:22:58PM +0200, Oleg Nesterov wrote:
> > > > On 08/12, Kirill A. Shutemov wrote:
> > > > >
> > > > > On Fri, Aug 09, 2019 at 06:01:18PM +0000, Song Liu wrote:
> > > > > > +		if (pte_none(*pte) || !pte_present(*pte))
> > > > > > +			continue;
> > > > >
> > > > > You don't need to check both. Present is never none.
> > > >
> > > > Agreed.
> > > >
> > > > Kirill, while you are here, shouldn't retract_page_tables() check
> > > > vma->anon_vma (and probably do mm_find_pmd) under vm_mm->mmap_sem?
> > > >
> > > > Can't it race with, say, do_cow_fault?
> > >
> > > vma->anon_vma can race, but it doesn't matter. False-negative is fine.
> > > It's attempt to avoid taking mmap_sem where it can be not productive.
> >
> > I guess I misunderstood the purpose of this check or your answer...
> >
> > Let me reword my question. Why can retract_page_tables() safely do
> > pmdp_collapse_flush(vma) without additional checks similar to what
> > collapse_pte_mapped_thp() does?
> >
> > I thought that retract_page_tables() checks vma->anon_vma to ensure that
> > this vma doesn't have a cow'ed PageAnon() page. And I still can't understand
> > why can't it race with __handle_mm_fault() paths.

vma->anon_vma check is a cheap way to exclude MAP_PRIVATE mappings that
got written from userspace. My thinking was that these VMAs are not worth
investing down_write(mmap_sem) as PMD-mapping is likely to be split later.
(It's totally made up reasoning, I don't have numbers to back it up).

vma->anon_vma can be set up after the check but before taking mmap_sem.
But page lock would prevent establishing any new ptes of the page, so we
are safe.

An alternative would be drop the check, but check that page table is clear
before calling pmdp_collapse_flush() under ptl. It has higher chance to
recover THP for the VMA, but has higher cost too.

I don't know which way is better, so I've chosen which is easier to
implement.

> >
> > Suppose that shmem_file was mmaped with PROT_READ|WRITE, MAP_PRIVATE.
> > To simplify, suppose that a non-THP page was already faulted in,
> > pte_present() == T.
> >
> > Userspace writes to this page.
> >
> > Why __handle_mm_fault()->handle_pte_fault()->do_wp_page()->wp_page_copy()
> > can not cow this page and update pte after the vma->anon_vma chech and
> > before down_write_trylock(mmap_sem) ?
> 
> OK, probably this is impossible, collapse_shmem() does unmap_mapping_pages(),
> so handle_pte_fault() will call shmem_fault() which iiuc should block in
> find_lock_entry() because new_page is locked, and thus down_write_trylock()
> can't succeed.

You've got it right.

> Nevermind, I am sure I missed something. Perhaps you can update the comments
> to make this more clear.

Let me see first that my explanation makes sense :P

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-13 15:05                       ` Kirill A. Shutemov
@ 2019-08-13 16:24                         ` Oleg Nesterov
  2019-08-16 14:54                           ` Kirill A. Shutemov
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2019-08-13 16:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Song Liu, Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar

On 08/13, Kirill A. Shutemov wrote:
>
> On Tue, Aug 13, 2019 at 04:05:53PM +0200, Oleg Nesterov wrote:
> > >
> > > I thought that retract_page_tables() checks vma->anon_vma to ensure that
> > > this vma doesn't have a cow'ed PageAnon() page. And I still can't understand
> > > why can't it race with __handle_mm_fault() paths.
>
> vma->anon_vma check is a cheap way to exclude MAP_PRIVATE mappings that
> got written from userspace.

Yes, and this is how I understood it from the very beginning, but then
I was confused.

> vma->anon_vma can be set up after the check but before taking mmap_sem.
> But page lock would prevent establishing any new ptes of the page, so we
> are safe.

And this is what was not clear to me until I noticed unmap_mapping_pages()
in collapse_shmem().

Plus I was confused by the comment above down_write_trylock(mmap_sem).
To me it looks as if we _could_ do down_write(), but we do not want to
disturb the system.

But iiuc we simply can't do down_write(), exactly because handle_mm_fault()
can wait for page lock with mmap_sem held.

> > > Suppose that shmem_file was mmaped with PROT_READ|WRITE, MAP_PRIVATE.
> > > To simplify, suppose that a non-THP page was already faulted in,
> > > pte_present() == T.
> > >
> > > Userspace writes to this page.
> > >
> > > Why __handle_mm_fault()->handle_pte_fault()->do_wp_page()->wp_page_copy()
> > > can not cow this page and update pte after the vma->anon_vma chech and
> > > before down_write_trylock(mmap_sem) ?
> >
> > OK, probably this is impossible, collapse_shmem() does unmap_mapping_pages(),
> > so handle_pte_fault() will call shmem_fault() which iiuc should block in
> > find_lock_entry() because new_page is locked, and thus down_write_trylock()
> > can't succeed.
>
> You've got it right.

Great, thanks.

> > Nevermind, I am sure I missed something. Perhaps you can update the comments
> > to make this more clear.
>
> Let me see first that my explanation makes sense :P

It does ;)

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-13 14:44                     ` Song Liu
@ 2019-08-15 10:16                       ` Oleg Nesterov
  2019-08-15 16:27                         ` Song Liu
  0 siblings, 1 reply; 30+ messages in thread
From: Oleg Nesterov @ 2019-08-15 10:16 UTC (permalink / raw)
  To: Song Liu
  Cc: Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar, Johannes Weiner, Kirill A. Shutemov

Hi Song,

sorry, I forgot to reply to this email,

On 08/13, Song Liu wrote:
>
> Do you have further comments for the version below? If not, could you
> please reply with your Acked-by or Reviewed-by?

I see nothing wrong in the last series, no objections from me.

I don't think I can't ack the changes in this area, but feel free to
add my Reviewed-by.

Oleg.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-15 10:16                       ` Oleg Nesterov
@ 2019-08-15 16:27                         ` Song Liu
  0 siblings, 0 replies; 30+ messages in thread
From: Song Liu @ 2019-08-15 16:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar, Johannes Weiner, Kirill A. Shutemov



> On Aug 15, 2019, at 3:16 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> 
> Hi Song,
> 
> sorry, I forgot to reply to this email,
> 
> On 08/13, Song Liu wrote:
>> 
>> Do you have further comments for the version below? If not, could you
>> please reply with your Acked-by or Reviewed-by?
> 
> I see nothing wrong in the last series, no objections from me.
> 
> I don't think I can't ack the changes in this area, but feel free to
> add my Reviewed-by.
> 
> Oleg.
> 

Thanks Oleg!

Will resend latest version shortly. 

Song


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP
  2019-08-13 16:24                         ` Oleg Nesterov
@ 2019-08-16 14:54                           ` Kirill A. Shutemov
  0 siblings, 0 replies; 30+ messages in thread
From: Kirill A. Shutemov @ 2019-08-16 14:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Song Liu, Linux Kernel Mailing List, Linux MM, Andrew Morton,
	Matthew Wilcox, Kirill A. Shutemov, Kernel Team,
	William Kucharski, srikar

On Tue, Aug 13, 2019 at 06:24:51PM +0200, Oleg Nesterov wrote:
> > Let me see first that my explanation makes sense :P
> 
> It does ;)

Does it look fine to you? It's on top of Song's patchset.

From 58834d6c1e63321af742b208558a6b5cb86fc7ec Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Fri, 16 Aug 2019 17:50:41 +0300
Subject: [PATCH] khugepaged: Add comments for retract_page_tables()

Oleg Nesterov pointed that logic behind checks in retract_page_tables()
are not obvious.

Add comments to clarify the reasoning for the checks and why they are
safe.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/khugepaged.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5c0a5f0826b2..00cec6a127aa 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1421,7 +1421,22 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 
 	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
-		/* probably overkill */
+		/*
+		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
+		 * got written to. These VMAs are likely not worth investing
+		 * down_write(mmap_sem) as PMD-mapping is likely to be split
+		 * later.
+		 *
+		 * Not that vma->anon_vma check is racy: it can be set up after
+		 * the check but before we took mmap_sem by the fault path.
+		 * But page lock would prevent establishing any new ptes of the
+		 * page, so we are safe.
+		 *
+		 * An alternative would be drop the check, but check that page
+		 * table is clear before calling pmdp_collapse_flush() under
+		 * ptl. It has higher chance to recover THP for the VMA, but
+		 * has higher cost too.
+		 */
 		if (vma->anon_vma)
 			continue;
 		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
@@ -1434,9 +1449,10 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			continue;
 		/*
 		 * We need exclusive mmap_sem to retract page table.
-		 * If trylock fails we would end up with pte-mapped THP after
-		 * re-fault. Not ideal, but it's more important to not disturb
-		 * the system too much.
+		 *
+		 * We use trylock due to lock inversion: we need to acquire
+		 * mmap_sem while holding page lock. Fault path does it in
+		 * reverse order. Trylock is a way to avoid deadlock.
 		 */
 		if (down_write_trylock(&vma->vm_mm->mmap_sem)) {
 			spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd);
@@ -1446,8 +1462,10 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			up_write(&vma->vm_mm->mmap_sem);
 			mm_dec_nr_ptes(vma->vm_mm);
 			pte_free(vma->vm_mm, pmd_pgtable(_pmd));
-		} else
+		} else {
+			/* Try again later */
 			khugepaged_add_pte_mapped_thp(vma->vm_mm, addr);
+		}
 	}
 	i_mmap_unlock_write(mapping);
 }
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-08-16 14:54 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-07 23:37 [PATCH v12 0/6] THP aware uprobe Song Liu
2019-08-07 23:37 ` [PATCH v12 1/6] mm: move memcmp_pages() and pages_identical() Song Liu
2019-08-07 23:37 ` [PATCH v12 2/6] uprobe: use original page when all uprobes are removed Song Liu
2019-08-07 23:37 ` [PATCH v12 3/6] mm, thp: introduce FOLL_SPLIT_PMD Song Liu
2019-08-08 16:37   ` Oleg Nesterov
2019-08-08 17:16     ` Song Liu
2019-08-09 16:35       ` Oleg Nesterov
2019-08-09 16:50         ` Song Liu
2019-08-07 23:37 ` [PATCH v12 4/6] uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT Song Liu
2019-08-07 23:37 ` [PATCH v12 5/6] khugepaged: enable collapse pmd for pte-mapped THP Song Liu
2019-08-08 16:33   ` Oleg Nesterov
2019-08-08 17:05     ` Song Liu
2019-08-09 15:24       ` Oleg Nesterov
2019-08-09 16:30         ` Song Liu
2019-08-09 18:01           ` Song Liu
2019-08-12 12:11             ` Kirill A. Shutemov
2019-08-12 13:22               ` Oleg Nesterov
2019-08-12 14:40                 ` Kirill A. Shutemov
2019-08-12 21:04                   ` Song Liu
2019-08-13 14:44                     ` Song Liu
2019-08-15 10:16                       ` Oleg Nesterov
2019-08-15 16:27                         ` Song Liu
2019-08-13 13:30                   ` Oleg Nesterov
2019-08-13 14:05                     ` Oleg Nesterov
2019-08-13 15:05                       ` Kirill A. Shutemov
2019-08-13 16:24                         ` Oleg Nesterov
2019-08-16 14:54                           ` Kirill A. Shutemov
2019-08-12 13:06             ` Oleg Nesterov
2019-08-12 14:36               ` Song Liu
2019-08-07 23:37 ` [PATCH v12 6/6] uprobe: collapse THP pmd after removing all uprobes Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.