linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] THP aware uprobe
@ 2019-06-12 22:03 Song Liu
  2019-06-12 22:03 ` [PATCH v3 1/5] mm: move memcmp_pages() and pages_identical() Song Liu
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Song Liu @ 2019-06-12 22:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: namit, peterz, oleg, rostedt, mhiramat, matthew.wilcox,
	kirill.shutemov, kernel-team, Song Liu

This set makes uprobe aware of THPs.

Currently, when uprobe is attached to text on THP, the page is split by
FOLL_SPLIT. As a result, uprobe eliminates the performance benefit of THP.

This set makes uprobe THP-aware. Instead of FOLL_SPLIT, we introduces
FOLL_SPLIT_PMD, which only split PMD for uprobe. After all uprobes within
the THP are removed, the PTEs are regrouped into huge PMD.

Note that, with uprobes attached, the process runs with PTEs for the huge
page. The performance benefit of THP is recovered _after_ all uprobes on
the huge page are detached.

This set (plus a few THP patches) is also available at

   https://github.com/liu-song-6/linux/tree/uprobe-thp

Changes since v2:
1. For FOLL_SPLIT_PMD, populated the page table in follow_pmd_mask().
2. Simplify logic in uprobe_write_opcode. (Oleg Nesterov)
3. Fix page refcount handling with FOLL_SPLIT_PMD.
4. Much more testing, together with THP on ext4 and btrfs (sending in
   separate set).
5. Rebased up on Linus's tree:
   commit 35110e38e6c5 ("Merge tag 'media/v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media")

Changes since v1:
1. introduces FOLL_SPLIT_PMD, instead of modifying split_huge_pmd*();
2. reuse pages_identical() from ksm.c;
3. rewrite most of try_collapse_huge_pmd().

Song Liu (5):
  mm: move memcmp_pages() and pages_identical()
  uprobe: use original page when all uprobes are removed
  mm, thp: introduce FOLL_SPLIT_PMD
  uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT
  uprobe: collapse THP pmd after removing all uprobes

 include/linux/huge_mm.h |  7 +++++
 include/linux/mm.h      |  8 ++++++
 kernel/events/uprobes.c | 54 ++++++++++++++++++++++++++--------
 mm/gup.c                | 38 ++++++++++++++++++++++--
 mm/huge_memory.c        | 64 +++++++++++++++++++++++++++++++++++++++++
 mm/ksm.c                | 18 ------------
 mm/util.c               | 13 +++++++++
 7 files changed, 169 insertions(+), 33 deletions(-)

--
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/5] mm: move memcmp_pages() and pages_identical()
  2019-06-12 22:03 [PATCH v3 0/5] THP aware uprobe Song Liu
@ 2019-06-12 22:03 ` Song Liu
  2019-06-12 22:03 ` [PATCH v3 2/5] uprobe: use original page when all uprobes are removed Song Liu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Song Liu @ 2019-06-12 22:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: namit, peterz, oleg, rostedt, mhiramat, matthew.wilcox,
	kirill.shutemov, kernel-team, Song Liu

This patch moves memcmp_pages() to mm/util.c and pages_identical() to
mm.h, so that we can use them in other files.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/mm.h |  7 +++++++
 mm/ksm.c           | 18 ------------------
 mm/util.c          | 13 +++++++++++++
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd0b5f4e1e45..0ab8c7d84cd0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2891,5 +2891,12 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+extern int memcmp_pages(struct page *page1, struct page *page2);
+
+static inline int pages_identical(struct page *page1, struct page *page2)
+{
+	return !memcmp_pages(page1, page2);
+}
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/ksm.c b/mm/ksm.c
index 81c20ed57bf6..6f153f976c4c 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1030,24 +1030,6 @@ static u32 calc_checksum(struct page *page)
 	return checksum;
 }
 
-static int memcmp_pages(struct page *page1, struct page *page2)
-{
-	char *addr1, *addr2;
-	int ret;
-
-	addr1 = kmap_atomic(page1);
-	addr2 = kmap_atomic(page2);
-	ret = memcmp(addr1, addr2, PAGE_SIZE);
-	kunmap_atomic(addr2);
-	kunmap_atomic(addr1);
-	return ret;
-}
-
-static inline int pages_identical(struct page *page1, struct page *page2)
-{
-	return !memcmp_pages(page1, page2);
-}
-
 static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 			      pte_t *orig_pte)
 {
diff --git a/mm/util.c b/mm/util.c
index 9834c4ab7d8e..750e586d50bc 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -755,3 +755,16 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen)
 out:
 	return res;
 }
+
+int memcmp_pages(struct page *page1, struct page *page2)
+{
+	char *addr1, *addr2;
+	int ret;
+
+	addr1 = kmap_atomic(page1);
+	addr2 = kmap_atomic(page2);
+	ret = memcmp(addr1, addr2, PAGE_SIZE);
+	kunmap_atomic(addr2);
+	kunmap_atomic(addr1);
+	return ret;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 2/5] uprobe: use original page when all uprobes are removed
  2019-06-12 22:03 [PATCH v3 0/5] THP aware uprobe Song Liu
  2019-06-12 22:03 ` [PATCH v3 1/5] mm: move memcmp_pages() and pages_identical() Song Liu
@ 2019-06-12 22:03 ` Song Liu
  2019-06-12 22:03 ` [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD Song Liu
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Song Liu @ 2019-06-12 22:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: namit, peterz, oleg, rostedt, mhiramat, matthew.wilcox,
	kirill.shutemov, kernel-team, Song Liu

Currently, uprobe swaps the target page with a anonymous page in both
install_breakpoint() and remove_breakpoint(). When all uprobes on a page
are removed, the given mm is still using an anonymous page (not the
original page).

This patch allows uprobe to use original page when possible (all uprobes
on the page are already removed).

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/events/uprobes.c | 45 +++++++++++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 8 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 78f61bfc6b79..f7c61a1ef720 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -160,16 +160,19 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	int err;
 	struct mmu_notifier_range range;
 	struct mem_cgroup *memcg;
+	bool orig = new_page->mapping != NULL;  /* new_page == orig_page */
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr,
 				addr + PAGE_SIZE);
 
 	VM_BUG_ON_PAGE(PageTransHuge(old_page), old_page);
 
-	err = mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL, &memcg,
-			false);
-	if (err)
-		return err;
+	if (!orig) {
+		err = mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,
+					    &memcg, false);
+		if (err)
+			return err;
+	}
 
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(old_page);
@@ -177,15 +180,24 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	mmu_notifier_invalidate_range_start(&range);
 	err = -EAGAIN;
 	if (!page_vma_mapped_walk(&pvmw)) {
-		mem_cgroup_cancel_charge(new_page, memcg, false);
+		if (!orig)
+			mem_cgroup_cancel_charge(new_page, memcg, false);
 		goto unlock;
 	}
 	VM_BUG_ON_PAGE(addr != pvmw.address, old_page);
 
 	get_page(new_page);
-	page_add_new_anon_rmap(new_page, vma, addr, false);
-	mem_cgroup_commit_charge(new_page, memcg, false, false);
-	lru_cache_add_active_or_unevictable(new_page, vma);
+	if (orig) {
+		lock_page(new_page);  /* for page_add_file_rmap() */
+		page_add_file_rmap(new_page, false);
+		unlock_page(new_page);
+		inc_mm_counter(mm, mm_counter_file(new_page));
+		dec_mm_counter(mm, MM_ANONPAGES);
+	} else {
+		page_add_new_anon_rmap(new_page, vma, addr, false);
+		mem_cgroup_commit_charge(new_page, memcg, false, false);
+		lru_cache_add_active_or_unevictable(new_page, vma);
+	}
 
 	if (!PageAnon(old_page)) {
 		dec_mm_counter(mm, mm_counter_file(old_page));
@@ -501,6 +513,23 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	copy_highpage(new_page, old_page);
 	copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
 
+	if (!is_register) {
+		struct page *orig_page;
+		pgoff_t index;
+
+		index = vaddr_to_offset(vma, vaddr & PAGE_MASK) >> PAGE_SHIFT;
+		orig_page = find_get_page(vma->vm_file->f_inode->i_mapping,
+					  index);
+
+		if (orig_page) {
+			if (pages_identical(new_page, orig_page)) {
+				put_page(new_page);
+				new_page = orig_page;
+			} else
+				put_page(orig_page);
+		}
+	}
+
 	ret = __replace_page(vma, vaddr, old_page, new_page);
 	put_page(new_page);
 put_old:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD
  2019-06-12 22:03 [PATCH v3 0/5] THP aware uprobe Song Liu
  2019-06-12 22:03 ` [PATCH v3 1/5] mm: move memcmp_pages() and pages_identical() Song Liu
  2019-06-12 22:03 ` [PATCH v3 2/5] uprobe: use original page when all uprobes are removed Song Liu
@ 2019-06-12 22:03 ` Song Liu
  2019-06-13 12:57   ` Kirill A. Shutemov
  2019-06-12 22:03 ` [PATCH v3 4/5] uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT Song Liu
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2019-06-12 22:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: namit, peterz, oleg, rostedt, mhiramat, matthew.wilcox,
	kirill.shutemov, kernel-team, Song Liu

This patches introduces a new foll_flag: FOLL_SPLIT_PMD. As the name says
FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
page stays as-is.

FOLL_SPLIT_PMD is useful for cases where we need to use regular pages,
but would switch back to huge page and huge pmd on. One of such example
is uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 38 +++++++++++++++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0ab8c7d84cd0..e605acc4fc81 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2642,6 +2642,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_COW	0x4000	/* internal GUP flag */
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
+#define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 
 /*
  * NOTE on FOLL_LONGTERM:
diff --git a/mm/gup.c b/mm/gup.c
index ddde097cf9e4..3d05bddb56c9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -398,7 +398,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-	if (flags & FOLL_SPLIT) {
+	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
 		int ret;
 		page = pmd_page(*pmd);
 		if (is_huge_zero_page(page)) {
@@ -407,7 +407,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			split_huge_pmd(vma, pmd, address);
 			if (pmd_trans_unstable(pmd))
 				ret = -EBUSY;
-		} else {
+		} else if (flags & FOLL_SPLIT) {
 			if (unlikely(!try_get_page(page))) {
 				spin_unlock(ptl);
 				return ERR_PTR(-ENOMEM);
@@ -419,8 +419,40 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			put_page(page);
 			if (pmd_none(*pmd))
 				return no_page_table(vma, flags);
-		}
+		} else {  /* flags & FOLL_SPLIT_PMD */
+			unsigned long addr;
+			pgprot_t prot;
+			pte_t *pte;
+			int i;
+
+			spin_unlock(ptl);
+			split_huge_pmd(vma, pmd, address);
+			lock_page(page);
+			pte = get_locked_pte(mm, address, &ptl);
+			if (!pte) {
+				unlock_page(page);
+				return no_page_table(vma, flags);
+			}
 
+			/* get refcount for every small page */
+			page_ref_add(page, HPAGE_PMD_NR);
+
+			prot = READ_ONCE(vma->vm_page_prot);
+			for (i = 0, addr = address & PMD_MASK;
+			     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+				struct page *p = page + i;
+
+				pte = pte_offset_map(pmd, addr);
+				VM_BUG_ON(!pte_none(*pte));
+				set_pte_at(mm, addr, pte, mk_pte(p, prot));
+				page_add_file_rmap(p, false);
+			}
+
+			spin_unlock(ptl);
+			unlock_page(page);
+			add_mm_counter(mm, mm_counter_file(page), HPAGE_PMD_NR);
+			ret = 0;
+		}
 		return ret ? ERR_PTR(ret) :
 			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 4/5] uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT
  2019-06-12 22:03 [PATCH v3 0/5] THP aware uprobe Song Liu
                   ` (2 preceding siblings ...)
  2019-06-12 22:03 ` [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD Song Liu
@ 2019-06-12 22:03 ` Song Liu
  2019-06-12 22:03 ` [PATCH v3 5/5] uprobe: collapse THP pmd after removing all uprobes Song Liu
  2019-06-12 22:03 ` [PATCH v3 6/6] mm, thp: introduce FOLL_SPLIT_PMD Song Liu
  5 siblings, 0 replies; 16+ messages in thread
From: Song Liu @ 2019-06-12 22:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: namit, peterz, oleg, rostedt, mhiramat, matthew.wilcox,
	kirill.shutemov, kernel-team, Song Liu

This patches uses newly added FOLL_SPLIT_PMD in uprobe. This enables easy
regroup of huge pmd after the uprobe is disabled (in next patch).

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/events/uprobes.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f7c61a1ef720..a20d7b43a056 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -153,7 +153,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct page_vma_mapped_walk pvmw = {
-		.page = old_page,
+		.page = compound_head(old_page),
 		.vma = vma,
 		.address = addr,
 	};
@@ -165,8 +165,6 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr,
 				addr + PAGE_SIZE);
 
-	VM_BUG_ON_PAGE(PageTransHuge(old_page), old_page);
-
 	if (!orig) {
 		err = mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,
 					    &memcg, false);
@@ -483,7 +481,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 retry:
 	/* Read the page with vaddr into memory */
 	ret = get_user_pages_remote(NULL, mm, vaddr, 1,
-			FOLL_FORCE | FOLL_SPLIT, &old_page, &vma, NULL);
+			FOLL_FORCE | FOLL_SPLIT_PMD, &old_page, &vma, NULL);
 	if (ret <= 0)
 		return ret;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 5/5] uprobe: collapse THP pmd after removing all uprobes
  2019-06-12 22:03 [PATCH v3 0/5] THP aware uprobe Song Liu
                   ` (3 preceding siblings ...)
  2019-06-12 22:03 ` [PATCH v3 4/5] uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT Song Liu
@ 2019-06-12 22:03 ` Song Liu
  2019-06-12 22:03 ` [PATCH v3 6/6] mm, thp: introduce FOLL_SPLIT_PMD Song Liu
  5 siblings, 0 replies; 16+ messages in thread
From: Song Liu @ 2019-06-12 22:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: namit, peterz, oleg, rostedt, mhiramat, matthew.wilcox,
	kirill.shutemov, kernel-team, Song Liu

After all uprobes are removed from the huge page (with PTE pgtable), it
is possible to collapse the pmd and benefit from THP again. This patch
does the collapse.

An issue on earlier version was discovered by kbuild test robot.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/huge_mm.h |  7 +++++
 kernel/events/uprobes.c |  5 +++-
 mm/huge_memory.c        | 64 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7cd5c150c21d..30669e9a9340 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -250,6 +250,9 @@ static inline bool thp_migration_supported(void)
 	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
 }
 
+extern void try_collapse_huge_pmd(struct vm_area_struct *vma,
+				  struct page *page);
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -368,6 +371,10 @@ static inline bool thp_migration_supported(void)
 {
 	return false;
 }
+
+static inline void try_collapse_huge_pmd(struct vm_area_struct *vma,
+					 struct page *page) {}
+
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a20d7b43a056..9bec602bf79e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -474,6 +474,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	struct page *old_page, *new_page;
 	struct vm_area_struct *vma;
 	int ret, is_register, ref_ctr_updated = 0;
+	struct page *orig_page = NULL;
 
 	is_register = is_swbp_insn(&opcode);
 	uprobe = container_of(auprobe, struct uprobe, arch);
@@ -512,7 +513,6 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
 
 	if (!is_register) {
-		struct page *orig_page;
 		pgoff_t index;
 
 		index = vaddr_to_offset(vma, vaddr & PAGE_MASK) >> PAGE_SHIFT;
@@ -540,6 +540,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (ret && is_register && ref_ctr_updated)
 		update_ref_ctr(uprobe, mm, -1);
 
+	if (!ret && orig_page && PageTransCompound(orig_page))
+		try_collapse_huge_pmd(vma, orig_page);
+
 	return ret;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9f8bce9a6b32..48e951550988 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2886,6 +2886,70 @@ static struct shrinker deferred_split_shrinker = {
 	.flags = SHRINKER_NUMA_AWARE,
 };
 
+/**
+ * try_collapse_huge_pmd - try collapse pmd for a pte mapped huge page
+ * @vma: vma containing the huge page
+ * @page: any sub page of the huge page
+ */
+void try_collapse_huge_pmd(struct vm_area_struct *vma,
+			   struct page *page)
+{
+	struct page *hpage = compound_head(page);
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_notifier_range range;
+	unsigned long haddr;
+	unsigned long addr;
+	pmd_t *pmd, _pmd;
+	spinlock_t *ptl;
+	int i;
+
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
+
+	haddr = page_address_in_vma(hpage, vma);
+	pmd = mm_find_pmd(mm, haddr);
+	if (!pmd)
+		return;
+
+	lock_page(hpage);
+	ptl = pmd_lock(mm, pmd);
+
+	/* step 1: check all mapped PTEs */
+	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+		pte_t *pte = pte_offset_map(pmd, addr);
+
+		if (hpage + i != vm_normal_page(vma, addr, *pte)) {
+			spin_unlock(ptl);
+			unlock_page(hpage);
+			return;
+		}
+	}
+
+	/* step 2: adjust rmap */
+	for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+		pte_t *pte = pte_offset_map(pmd, addr);
+		struct page *p;
+
+		p = vm_normal_page(vma, addr, *pte);
+		page_remove_rmap(p, false);
+	}
+
+	/* step 3: flip page table */
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm,
+				haddr, haddr + HPAGE_PMD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+
+	_pmd = pmdp_collapse_flush(vma, haddr, pmd);
+	spin_unlock(ptl);
+	mmu_notifier_invalidate_range_end(&range);
+
+	/* step 4: free pgtable, set refcount, mm_counters, etc. */
+	page_ref_sub(page, HPAGE_PMD_NR);
+	unlock_page(hpage);
+	mm_dec_nr_ptes(mm);
+	pte_free(mm, pmd_pgtable(_pmd));
+	add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
+}
+
 #ifdef CONFIG_DEBUG_FS
 static int split_huge_pages_set(void *data, u64 val)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 6/6] mm, thp: introduce FOLL_SPLIT_PMD
  2019-06-12 22:03 [PATCH v3 0/5] THP aware uprobe Song Liu
                   ` (4 preceding siblings ...)
  2019-06-12 22:03 ` [PATCH v3 5/5] uprobe: collapse THP pmd after removing all uprobes Song Liu
@ 2019-06-12 22:03 ` Song Liu
  2019-06-12 22:16   ` Song Liu
  5 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2019-06-12 22:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: namit, peterz, oleg, rostedt, mhiramat, matthew.wilcox,
	kirill.shutemov, kernel-team, Song Liu

This patches introduces a new foll_flag: FOLL_SPLIT_PMD. As the name says
FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
page stays as-is.

FOLL_SPLIT_PMD is useful for cases where we need to use regular pages,
but would switch back to huge page and huge pmd on. One of such example
is uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 38 +++++++++++++++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0ab8c7d84cd0..e605acc4fc81 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2642,6 +2642,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_COW	0x4000	/* internal GUP flag */
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
+#define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 
 /*
  * NOTE on FOLL_LONGTERM:
diff --git a/mm/gup.c b/mm/gup.c
index ddde097cf9e4..3d05bddb56c9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -398,7 +398,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-	if (flags & FOLL_SPLIT) {
+	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
 		int ret;
 		page = pmd_page(*pmd);
 		if (is_huge_zero_page(page)) {
@@ -407,7 +407,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			split_huge_pmd(vma, pmd, address);
 			if (pmd_trans_unstable(pmd))
 				ret = -EBUSY;
-		} else {
+		} else if (flags & FOLL_SPLIT) {
 			if (unlikely(!try_get_page(page))) {
 				spin_unlock(ptl);
 				return ERR_PTR(-ENOMEM);
@@ -419,8 +419,40 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			put_page(page);
 			if (pmd_none(*pmd))
 				return no_page_table(vma, flags);
-		}
+		} else {  /* flags & FOLL_SPLIT_PMD */
+			unsigned long addr;
+			pgprot_t prot;
+			pte_t *pte;
+			int i;
+
+			spin_unlock(ptl);
+			split_huge_pmd(vma, pmd, address);
+			lock_page(page);
+			pte = get_locked_pte(mm, address, &ptl);
+			if (!pte) {
+				unlock_page(page);
+				return no_page_table(vma, flags);
+			}
 
+			/* get refcount for every small page */
+			page_ref_add(page, HPAGE_PMD_NR);
+
+			prot = READ_ONCE(vma->vm_page_prot);
+			for (i = 0, addr = address & PMD_MASK;
+			     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+				struct page *p = page + i;
+
+				pte = pte_offset_map(pmd, addr);
+				VM_BUG_ON(!pte_none(*pte));
+				set_pte_at(mm, addr, pte, mk_pte(p, prot));
+				page_add_file_rmap(p, false);
+			}
+
+			spin_unlock(ptl);
+			unlock_page(page);
+			add_mm_counter(mm, mm_counter_file(page), HPAGE_PMD_NR);
+			ret = 0;
+		}
 		return ret ? ERR_PTR(ret) :
 			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 6/6] mm, thp: introduce FOLL_SPLIT_PMD
  2019-06-12 22:03 ` [PATCH v3 6/6] mm, thp: introduce FOLL_SPLIT_PMD Song Liu
@ 2019-06-12 22:16   ` Song Liu
  0 siblings, 0 replies; 16+ messages in thread
From: Song Liu @ 2019-06-12 22:16 UTC (permalink / raw)
  To: LKML, linux-mm
  Cc: namit, Peter Zijlstra, Oleg Nesterov, Steven Rostedt, mhiramat,
	matthew.wilcox, kirill.shutemov, Kernel Team



> On Jun 12, 2019, at 3:03 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> This patches introduces a new foll_flag: FOLL_SPLIT_PMD. As the name says
> FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
> page stays as-is.
> 
> FOLL_SPLIT_PMD is useful for cases where we need to use regular pages,
> but would switch back to huge page and huge pmd on. One of such example
> is uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>

Please ignore this one. It is a duplicated copy of 3/5, sent by accident. 

Sorry for the noise. 

Song

> ---
> include/linux/mm.h |  1 +
> mm/gup.c           | 38 +++++++++++++++++++++++++++++++++++---
> 2 files changed, 36 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0ab8c7d84cd0..e605acc4fc81 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2642,6 +2642,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
> #define FOLL_COW	0x4000	/* internal GUP flag */
> #define FOLL_ANON	0x8000	/* don't do file mappings */
> #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
> +#define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
> 
> /*
>  * NOTE on FOLL_LONGTERM:
> diff --git a/mm/gup.c b/mm/gup.c
> index ddde097cf9e4..3d05bddb56c9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -398,7 +398,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> 		spin_unlock(ptl);
> 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> 	}
> -	if (flags & FOLL_SPLIT) {
> +	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
> 		int ret;
> 		page = pmd_page(*pmd);
> 		if (is_huge_zero_page(page)) {
> @@ -407,7 +407,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> 			split_huge_pmd(vma, pmd, address);
> 			if (pmd_trans_unstable(pmd))
> 				ret = -EBUSY;
> -		} else {
> +		} else if (flags & FOLL_SPLIT) {
> 			if (unlikely(!try_get_page(page))) {
> 				spin_unlock(ptl);
> 				return ERR_PTR(-ENOMEM);
> @@ -419,8 +419,40 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> 			put_page(page);
> 			if (pmd_none(*pmd))
> 				return no_page_table(vma, flags);
> -		}
> +		} else {  /* flags & FOLL_SPLIT_PMD */
> +			unsigned long addr;
> +			pgprot_t prot;
> +			pte_t *pte;
> +			int i;
> +
> +			spin_unlock(ptl);
> +			split_huge_pmd(vma, pmd, address);
> +			lock_page(page);
> +			pte = get_locked_pte(mm, address, &ptl);
> +			if (!pte) {
> +				unlock_page(page);
> +				return no_page_table(vma, flags);
> +			}
> 
> +			/* get refcount for every small page */
> +			page_ref_add(page, HPAGE_PMD_NR);
> +
> +			prot = READ_ONCE(vma->vm_page_prot);
> +			for (i = 0, addr = address & PMD_MASK;
> +			     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
> +				struct page *p = page + i;
> +
> +				pte = pte_offset_map(pmd, addr);
> +				VM_BUG_ON(!pte_none(*pte));
> +				set_pte_at(mm, addr, pte, mk_pte(p, prot));
> +				page_add_file_rmap(p, false);
> +			}
> +
> +			spin_unlock(ptl);
> +			unlock_page(page);
> +			add_mm_counter(mm, mm_counter_file(page), HPAGE_PMD_NR);
> +			ret = 0;
> +		}
> 		return ret ? ERR_PTR(ret) :
> 			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> 	}
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD
  2019-06-12 22:03 ` [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD Song Liu
@ 2019-06-13 12:57   ` Kirill A. Shutemov
  2019-06-13 13:57     ` Song Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2019-06-13 12:57 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, linux-mm, namit, peterz, oleg, rostedt, mhiramat,
	matthew.wilcox, kirill.shutemov, kernel-team

On Wed, Jun 12, 2019 at 03:03:17PM -0700, Song Liu wrote:
> This patches introduces a new foll_flag: FOLL_SPLIT_PMD. As the name says
> FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
> page stays as-is.
> 
> FOLL_SPLIT_PMD is useful for cases where we need to use regular pages,
> but would switch back to huge page and huge pmd on. One of such example
> is uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  include/linux/mm.h |  1 +
>  mm/gup.c           | 38 +++++++++++++++++++++++++++++++++++---
>  2 files changed, 36 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0ab8c7d84cd0..e605acc4fc81 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2642,6 +2642,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>  #define FOLL_COW	0x4000	/* internal GUP flag */
>  #define FOLL_ANON	0x8000	/* don't do file mappings */
>  #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
> +#define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
>  
>  /*
>   * NOTE on FOLL_LONGTERM:
> diff --git a/mm/gup.c b/mm/gup.c
> index ddde097cf9e4..3d05bddb56c9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -398,7 +398,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  		spin_unlock(ptl);
>  		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>  	}
> -	if (flags & FOLL_SPLIT) {
> +	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
>  		int ret;
>  		page = pmd_page(*pmd);
>  		if (is_huge_zero_page(page)) {
> @@ -407,7 +407,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  			split_huge_pmd(vma, pmd, address);
>  			if (pmd_trans_unstable(pmd))
>  				ret = -EBUSY;
> -		} else {
> +		} else if (flags & FOLL_SPLIT) {
>  			if (unlikely(!try_get_page(page))) {
>  				spin_unlock(ptl);
>  				return ERR_PTR(-ENOMEM);
> @@ -419,8 +419,40 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  			put_page(page);
>  			if (pmd_none(*pmd))
>  				return no_page_table(vma, flags);
> -		}
> +		} else {  /* flags & FOLL_SPLIT_PMD */
> +			unsigned long addr;
> +			pgprot_t prot;
> +			pte_t *pte;
> +			int i;
> +
> +			spin_unlock(ptl);
> +			split_huge_pmd(vma, pmd, address);

All the code below is only relevant for file-backed THP. It will break for
anon-THP.

And I'm not convinced that it belongs here at all. User requested PMD
split and it is done after split_huge_pmd(). The rest can be handled by
the caller as needed.

> +			lock_page(page);
> +			pte = get_locked_pte(mm, address, &ptl);
> +			if (!pte) {
> +				unlock_page(page);
> +				return no_page_table(vma, flags);

Or should it be -ENOMEM?

> +			}
>  
> +			/* get refcount for every small page */
> +			page_ref_add(page, HPAGE_PMD_NR);
> +
> +			prot = READ_ONCE(vma->vm_page_prot);
> +			for (i = 0, addr = address & PMD_MASK;
> +			     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
> +				struct page *p = page + i;
> +
> +				pte = pte_offset_map(pmd, addr);
> +				VM_BUG_ON(!pte_none(*pte));
> +				set_pte_at(mm, addr, pte, mk_pte(p, prot));
> +				page_add_file_rmap(p, false);
> +			}
> +
> +			spin_unlock(ptl);
> +			unlock_page(page);
> +			add_mm_counter(mm, mm_counter_file(page), HPAGE_PMD_NR);
> +			ret = 0;
> +		}
>  		return ret ? ERR_PTR(ret) :
>  			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>  	}
> -- 
> 2.17.1
> 

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD
  2019-06-13 12:57   ` Kirill A. Shutemov
@ 2019-06-13 13:57     ` Song Liu
  2019-06-13 14:16       ` Kirill A. Shutemov
  0 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2019-06-13 13:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: LKML, linux-mm, namit, peterz, oleg, rostedt, mhiramat,
	matthew.wilcox, kirill.shutemov, Kernel Team



> On Jun 13, 2019, at 5:57 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Wed, Jun 12, 2019 at 03:03:17PM -0700, Song Liu wrote:
>> This patches introduces a new foll_flag: FOLL_SPLIT_PMD. As the name says
>> FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
>> page stays as-is.
>> 
>> FOLL_SPLIT_PMD is useful for cases where we need to use regular pages,
>> but would switch back to huge page and huge pmd on. One of such example
>> is uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.
>> 
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>> include/linux/mm.h |  1 +
>> mm/gup.c           | 38 +++++++++++++++++++++++++++++++++++---
>> 2 files changed, 36 insertions(+), 3 deletions(-)
>> 
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 0ab8c7d84cd0..e605acc4fc81 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2642,6 +2642,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>> #define FOLL_COW	0x4000	/* internal GUP flag */
>> #define FOLL_ANON	0x8000	/* don't do file mappings */
>> #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
>> +#define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
>> 
>> /*
>>  * NOTE on FOLL_LONGTERM:
>> diff --git a/mm/gup.c b/mm/gup.c
>> index ddde097cf9e4..3d05bddb56c9 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -398,7 +398,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>> 		spin_unlock(ptl);
>> 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>> 	}
>> -	if (flags & FOLL_SPLIT) {
>> +	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
>> 		int ret;
>> 		page = pmd_page(*pmd);
>> 		if (is_huge_zero_page(page)) {
>> @@ -407,7 +407,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>> 			split_huge_pmd(vma, pmd, address);
>> 			if (pmd_trans_unstable(pmd))
>> 				ret = -EBUSY;
>> -		} else {
>> +		} else if (flags & FOLL_SPLIT) {
>> 			if (unlikely(!try_get_page(page))) {
>> 				spin_unlock(ptl);
>> 				return ERR_PTR(-ENOMEM);
>> @@ -419,8 +419,40 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>> 			put_page(page);
>> 			if (pmd_none(*pmd))
>> 				return no_page_table(vma, flags);
>> -		}
>> +		} else {  /* flags & FOLL_SPLIT_PMD */
>> +			unsigned long addr;
>> +			pgprot_t prot;
>> +			pte_t *pte;
>> +			int i;
>> +
>> +			spin_unlock(ptl);
>> +			split_huge_pmd(vma, pmd, address);
> 
> All the code below is only relevant for file-backed THP. It will break for
> anon-THP.

Oh, yes, that makes sense. 

> 
> And I'm not convinced that it belongs here at all. User requested PMD
> split and it is done after split_huge_pmd(). The rest can be handled by
> the caller as needed.

I put this part here because split_huge_pmd() for file-backed THP is
not really done after split_huge_pmd(). And I would like it done before
calling follow_page_pte() below. Maybe we can still do them here, just 
for file-backed THPs?

If we would move it, shall we move to callers of follow_page_mask()? 
In that case, we will probably end up with similar code in two places:
__get_user_pages() and follow_page(). 

Did I get this right?

> 
>> +			lock_page(page);
>> +			pte = get_locked_pte(mm, address, &ptl);
>> +			if (!pte) {
>> +				unlock_page(page);
>> +				return no_page_table(vma, flags);
> 
> Or should it be -ENOMEM?

Yeah, ENOMEM is more accurate. 

Thanks,
Song

> 
>> +			}
>> 
>> +			/* get refcount for every small page */
>> +			page_ref_add(page, HPAGE_PMD_NR);
>> +
>> +			prot = READ_ONCE(vma->vm_page_prot);
>> +			for (i = 0, addr = address & PMD_MASK;
>> +			     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>> +				struct page *p = page + i;
>> +
>> +				pte = pte_offset_map(pmd, addr);
>> +				VM_BUG_ON(!pte_none(*pte));
>> +				set_pte_at(mm, addr, pte, mk_pte(p, prot));
>> +				page_add_file_rmap(p, false);
>> +			}
>> +
>> +			spin_unlock(ptl);
>> +			unlock_page(page);
>> +			add_mm_counter(mm, mm_counter_file(page), HPAGE_PMD_NR);
>> +			ret = 0;
>> +		}
>> 		return ret ? ERR_PTR(ret) :
>> 			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>> 	}
>> -- 
>> 2.17.1
>> 
> 
> -- 
> Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD
  2019-06-13 13:57     ` Song Liu
@ 2019-06-13 14:16       ` Kirill A. Shutemov
  2019-06-13 15:03         ` Song Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2019-06-13 14:16 UTC (permalink / raw)
  To: Song Liu
  Cc: LKML, linux-mm, namit, peterz, oleg, rostedt, mhiramat,
	matthew.wilcox, kirill.shutemov, Kernel Team

On Thu, Jun 13, 2019 at 01:57:30PM +0000, Song Liu wrote:
> > And I'm not convinced that it belongs here at all. User requested PMD
> > split and it is done after split_huge_pmd(). The rest can be handled by
> > the caller as needed.
> 
> I put this part here because split_huge_pmd() for file-backed THP is
> not really done after split_huge_pmd(). And I would like it done before
> calling follow_page_pte() below. Maybe we can still do them here, just 
> for file-backed THPs?
> 
> If we would move it, shall we move to callers of follow_page_mask()? 
> In that case, we will probably end up with similar code in two places:
> __get_user_pages() and follow_page(). 
> 
> Did I get this right?

Would it be enough to replace pte_offset_map_lock() in follow_page_pte()
with pte_alloc_map_lock()?

This will leave bunch not populated PTE entries, but it is fine: they will
be populated on the next access to them.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD
  2019-06-13 14:16       ` Kirill A. Shutemov
@ 2019-06-13 15:03         ` Song Liu
  2019-06-13 15:14           ` Kirill A. Shutemov
  0 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2019-06-13 15:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: LKML, linux-mm, namit, peterz, oleg, rostedt, mhiramat,
	matthew.wilcox, kirill.shutemov, Kernel Team



> On Jun 13, 2019, at 7:16 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Thu, Jun 13, 2019 at 01:57:30PM +0000, Song Liu wrote:
>>> And I'm not convinced that it belongs here at all. User requested PMD
>>> split and it is done after split_huge_pmd(). The rest can be handled by
>>> the caller as needed.
>> 
>> I put this part here because split_huge_pmd() for file-backed THP is
>> not really done after split_huge_pmd(). And I would like it done before
>> calling follow_page_pte() below. Maybe we can still do them here, just 
>> for file-backed THPs?
>> 
>> If we would move it, shall we move to callers of follow_page_mask()? 
>> In that case, we will probably end up with similar code in two places:
>> __get_user_pages() and follow_page(). 
>> 
>> Did I get this right?
> 
> Would it be enough to replace pte_offset_map_lock() in follow_page_pte()
> with pte_alloc_map_lock()?

This is similar to my previous version:

+		} else {  /* flags & FOLL_SPLIT_PMD */
+			pte_t *pte;
+			spin_unlock(ptl);
+			split_huge_pmd(vma, pmd, address);
+			pte = get_locked_pte(mm, address, &ptl);
+			if (!pte)
+				return no_page_table(vma, flags);
+			spin_unlock(ptl);
+			ret = 0;
+		}

I think this is cleaner than use pte_alloc_map_lock() in follow_page_pte(). 
What's your thought on these two versions (^^^ vs. pte_alloc_map_lock)?


> This will leave bunch not populated PTE entries, but it is fine: they will
> be populated on the next access to them.

We need to handle page fault during next access, right? Since we already
allocated everything, we can just populate the PTE entries and saves a
lot of page faults (assuming we will access them later). 

Thanks,
Song

> 
> -- 
> Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD
  2019-06-13 15:03         ` Song Liu
@ 2019-06-13 15:14           ` Kirill A. Shutemov
  2019-06-13 15:24             ` Song Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2019-06-13 15:14 UTC (permalink / raw)
  To: Song Liu
  Cc: Kirill A. Shutemov, LKML, linux-mm, namit, peterz, oleg, rostedt,
	mhiramat, matthew.wilcox, Kernel Team

On Thu, Jun 13, 2019 at 03:03:01PM +0000, Song Liu wrote:
> 
> 
> > On Jun 13, 2019, at 7:16 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > On Thu, Jun 13, 2019 at 01:57:30PM +0000, Song Liu wrote:
> >>> And I'm not convinced that it belongs here at all. User requested PMD
> >>> split and it is done after split_huge_pmd(). The rest can be handled by
> >>> the caller as needed.
> >> 
> >> I put this part here because split_huge_pmd() for file-backed THP is
> >> not really done after split_huge_pmd(). And I would like it done before
> >> calling follow_page_pte() below. Maybe we can still do them here, just 
> >> for file-backed THPs?
> >> 
> >> If we would move it, shall we move to callers of follow_page_mask()? 
> >> In that case, we will probably end up with similar code in two places:
> >> __get_user_pages() and follow_page(). 
> >> 
> >> Did I get this right?
> > 
> > Would it be enough to replace pte_offset_map_lock() in follow_page_pte()
> > with pte_alloc_map_lock()?
> 
> This is similar to my previous version:
> 
> +		} else {  /* flags & FOLL_SPLIT_PMD */
> +			pte_t *pte;
> +			spin_unlock(ptl);
> +			split_huge_pmd(vma, pmd, address);
> +			pte = get_locked_pte(mm, address, &ptl);
> +			if (!pte)
> +				return no_page_table(vma, flags);
> +			spin_unlock(ptl);
> +			ret = 0;
> +		}
> 
> I think this is cleaner than use pte_alloc_map_lock() in follow_page_pte(). 
> What's your thought on these two versions (^^^ vs. pte_alloc_map_lock)?

It's additional lock-unlock cycle and few more lines of code...

> > This will leave bunch not populated PTE entries, but it is fine: they will
> > be populated on the next access to them.
> 
> We need to handle page fault during next access, right? Since we already
> allocated everything, we can just populate the PTE entries and saves a
> lot of page faults (assuming we will access them later). 

Not a lot due to faultaround and they may never happen, but you need to
tear down the mapping any way.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD
  2019-06-13 15:14           ` Kirill A. Shutemov
@ 2019-06-13 15:24             ` Song Liu
  2019-06-13 16:47               ` Song Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2019-06-13 15:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, LKML, linux-mm, namit, peterz, oleg, rostedt,
	mhiramat, matthew.wilcox, Kernel Team



> On Jun 13, 2019, at 8:14 AM, Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:
> 
> On Thu, Jun 13, 2019 at 03:03:01PM +0000, Song Liu wrote:
>> 
>> 
>>> On Jun 13, 2019, at 7:16 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>> 
>>> On Thu, Jun 13, 2019 at 01:57:30PM +0000, Song Liu wrote:
>>>>> And I'm not convinced that it belongs here at all. User requested PMD
>>>>> split and it is done after split_huge_pmd(). The rest can be handled by
>>>>> the caller as needed.
>>>> 
>>>> I put this part here because split_huge_pmd() for file-backed THP is
>>>> not really done after split_huge_pmd(). And I would like it done before
>>>> calling follow_page_pte() below. Maybe we can still do them here, just 
>>>> for file-backed THPs?
>>>> 
>>>> If we would move it, shall we move to callers of follow_page_mask()? 
>>>> In that case, we will probably end up with similar code in two places:
>>>> __get_user_pages() and follow_page(). 
>>>> 
>>>> Did I get this right?
>>> 
>>> Would it be enough to replace pte_offset_map_lock() in follow_page_pte()
>>> with pte_alloc_map_lock()?
>> 
>> This is similar to my previous version:
>> 
>> +		} else {  /* flags & FOLL_SPLIT_PMD */
>> +			pte_t *pte;
>> +			spin_unlock(ptl);
>> +			split_huge_pmd(vma, pmd, address);
>> +			pte = get_locked_pte(mm, address, &ptl);
>> +			if (!pte)
>> +				return no_page_table(vma, flags);
>> +			spin_unlock(ptl);
>> +			ret = 0;
>> +		}
>> 
>> I think this is cleaner than use pte_alloc_map_lock() in follow_page_pte(). 
>> What's your thought on these two versions (^^^ vs. pte_alloc_map_lock)?
> 
> It's additional lock-unlock cycle and few more lines of code...
> 
>>> This will leave bunch not populated PTE entries, but it is fine: they will
>>> be populated on the next access to them.
>> 
>> We need to handle page fault during next access, right? Since we already
>> allocated everything, we can just populate the PTE entries and saves a
>> lot of page faults (assuming we will access them later). 
> 
> Not a lot due to faultaround and they may never happen, but you need to
> tear down the mapping any way.

I see. Let me try this way. 

Thanks,
Song

> 
> -- 
> Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD
  2019-06-13 15:24             ` Song Liu
@ 2019-06-13 16:47               ` Song Liu
  2019-06-13 17:42                 ` Song Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Song Liu @ 2019-06-13 16:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, LKML, linux-mm, namit, peterz, oleg, rostedt,
	mhiramat, matthew.wilcox, Kernel Team

Hi Kirill,

> On Jun 13, 2019, at 8:24 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Jun 13, 2019, at 8:14 AM, Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:
>> 
>> On Thu, Jun 13, 2019 at 03:03:01PM +0000, Song Liu wrote:
>>> 
>>> 
>>>> On Jun 13, 2019, at 7:16 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>>> 
>>>> On Thu, Jun 13, 2019 at 01:57:30PM +0000, Song Liu wrote:
>>>>>> And I'm not convinced that it belongs here at all. User requested PMD
>>>>>> split and it is done after split_huge_pmd(). The rest can be handled by
>>>>>> the caller as needed.
>>>>> 
>>>>> I put this part here because split_huge_pmd() for file-backed THP is
>>>>> not really done after split_huge_pmd(). And I would like it done before
>>>>> calling follow_page_pte() below. Maybe we can still do them here, just 
>>>>> for file-backed THPs?
>>>>> 
>>>>> If we would move it, shall we move to callers of follow_page_mask()? 
>>>>> In that case, we will probably end up with similar code in two places:
>>>>> __get_user_pages() and follow_page(). 
>>>>> 
>>>>> Did I get this right?
>>>> 
>>>> Would it be enough to replace pte_offset_map_lock() in follow_page_pte()
>>>> with pte_alloc_map_lock()?
>>> 
>>> This is similar to my previous version:
>>> 
>>> +		} else {  /* flags & FOLL_SPLIT_PMD */
>>> +			pte_t *pte;
>>> +			spin_unlock(ptl);
>>> +			split_huge_pmd(vma, pmd, address);
>>> +			pte = get_locked_pte(mm, address, &ptl);
>>> +			if (!pte)
>>> +				return no_page_table(vma, flags);
>>> +			spin_unlock(ptl);
>>> +			ret = 0;
>>> +		}
>>> 
>>> I think this is cleaner than use pte_alloc_map_lock() in follow_page_pte(). 
>>> What's your thought on these two versions (^^^ vs. pte_alloc_map_lock)?
>> 
>> It's additional lock-unlock cycle and few more lines of code...
>> 
>>>> This will leave bunch not populated PTE entries, but it is fine: they will
>>>> be populated on the next access to them.
>>> 
>>> We need to handle page fault during next access, right? Since we already
>>> allocated everything, we can just populate the PTE entries and saves a
>>> lot of page faults (assuming we will access them later). 
>> 
>> Not a lot due to faultaround and they may never happen, but you need to
>> tear down the mapping any way.
> 
> I see. Let me try this way. 
> 
> Thanks,
> Song

To make sure I understand your suggestions. Here is what I got:

diff --git c/mm/gup.c w/mm/gup.c
index ddde097cf9e4..85e6f46fd925 100644
--- c/mm/gup.c
+++ w/mm/gup.c
@@ -197,7 +197,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
        if (unlikely(pmd_bad(*pmd)))
                return no_page_table(vma, flags);

-       ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+       ptep = pte_alloc_map_lock(mm, pmd, address, &ptl);
+       if (!ptep)
+               return ERR_PTR(-ENOMEM);
+
        pte = *ptep;
        if (!pte_present(pte)) {
                swp_entry_t entry;
@@ -398,7 +401,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
                spin_unlock(ptl);
                return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
        }
-       if (flags & FOLL_SPLIT) {
+       if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
                int ret;
                page = pmd_page(*pmd);
                if (is_huge_zero_page(page)) {
@@ -407,7 +410,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
                        split_huge_pmd(vma, pmd, address);
                        if (pmd_trans_unstable(pmd))
                                ret = -EBUSY;
-               } else {
+               } else if (flags & FOLL_SPLIT) {
                        if (unlikely(!try_get_page(page))) {
                                spin_unlock(ptl);
                                return ERR_PTR(-ENOMEM);
@@ -419,6 +422,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
                        put_page(page);
                        if (pmd_none(*pmd))
                                return no_page_table(vma, flags);
+               } else {  /* flags & FOLL_SPLIT_PMD */
+                       spin_unlock(ptl);
+                       split_huge_pmd(vma, pmd, address);
+                       ret = 0;
                }

                return ret ? ERR_PTR(ret) :
                        follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);


This version doesn't work as-is, because it returns at the first check:

        if (unlikely(pmd_bad(*pmd)))
                return no_page_table(vma, flags);

Did I misunderstand anything here?

Thanks,
Song


> 
>> 
>> -- 
>> Kirill A. Shutemov


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD
  2019-06-13 16:47               ` Song Liu
@ 2019-06-13 17:42                 ` Song Liu
  0 siblings, 0 replies; 16+ messages in thread
From: Song Liu @ 2019-06-13 17:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, LKML, linux-mm, namit, peterz, oleg, rostedt,
	mhiramat, matthew.wilcox, Kernel Team



> On Jun 13, 2019, at 9:47 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> Hi Kirill,
> 
>> On Jun 13, 2019, at 8:24 AM, Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Jun 13, 2019, at 8:14 AM, Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:
>>> 
>>> On Thu, Jun 13, 2019 at 03:03:01PM +0000, Song Liu wrote:
>>>> 
>>>> 
>>>>> On Jun 13, 2019, at 7:16 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>>>> 
>>>>> On Thu, Jun 13, 2019 at 01:57:30PM +0000, Song Liu wrote:
>>>>>>> And I'm not convinced that it belongs here at all. User requested PMD
>>>>>>> split and it is done after split_huge_pmd(). The rest can be handled by
>>>>>>> the caller as needed.
>>>>>> 
>>>>>> I put this part here because split_huge_pmd() for file-backed THP is
>>>>>> not really done after split_huge_pmd(). And I would like it done before
>>>>>> calling follow_page_pte() below. Maybe we can still do them here, just 
>>>>>> for file-backed THPs?
>>>>>> 
>>>>>> If we would move it, shall we move to callers of follow_page_mask()? 
>>>>>> In that case, we will probably end up with similar code in two places:
>>>>>> __get_user_pages() and follow_page(). 
>>>>>> 
>>>>>> Did I get this right?
>>>>> 
>>>>> Would it be enough to replace pte_offset_map_lock() in follow_page_pte()
>>>>> with pte_alloc_map_lock()?
>>>> 
>>>> This is similar to my previous version:
>>>> 
>>>> +		} else {  /* flags & FOLL_SPLIT_PMD */
>>>> +			pte_t *pte;
>>>> +			spin_unlock(ptl);
>>>> +			split_huge_pmd(vma, pmd, address);
>>>> +			pte = get_locked_pte(mm, address, &ptl);
>>>> +			if (!pte)
>>>> +				return no_page_table(vma, flags);
>>>> +			spin_unlock(ptl);
>>>> +			ret = 0;
>>>> +		}
>>>> 
>>>> I think this is cleaner than use pte_alloc_map_lock() in follow_page_pte(). 
>>>> What's your thought on these two versions (^^^ vs. pte_alloc_map_lock)?
>>> 
>>> It's additional lock-unlock cycle and few more lines of code...
>>> 
>>>>> This will leave bunch not populated PTE entries, but it is fine: they will
>>>>> be populated on the next access to them.
>>>> 
>>>> We need to handle page fault during next access, right? Since we already
>>>> allocated everything, we can just populate the PTE entries and saves a
>>>> lot of page faults (assuming we will access them later). 
>>> 
>>> Not a lot due to faultaround and they may never happen, but you need to
>>> tear down the mapping any way.
>> 
>> I see. Let me try this way. 
>> 
>> Thanks,
>> Song
> 
> To make sure I understand your suggestions. Here is what I got:
> 
> diff --git c/mm/gup.c w/mm/gup.c
> index ddde097cf9e4..85e6f46fd925 100644
> --- c/mm/gup.c
> +++ w/mm/gup.c
> @@ -197,7 +197,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>        if (unlikely(pmd_bad(*pmd)))
>                return no_page_table(vma, flags);
> 
> -       ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> +       ptep = pte_alloc_map_lock(mm, pmd, address, &ptl);
> +       if (!ptep)
> +               return ERR_PTR(-ENOMEM);
> +
>        pte = *ptep;
>        if (!pte_present(pte)) {
>                swp_entry_t entry;
> @@ -398,7 +401,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>                spin_unlock(ptl);
>                return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>        }
> -       if (flags & FOLL_SPLIT) {
> +       if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
>                int ret;
>                page = pmd_page(*pmd);
>                if (is_huge_zero_page(page)) {
> @@ -407,7 +410,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>                        split_huge_pmd(vma, pmd, address);
>                        if (pmd_trans_unstable(pmd))
>                                ret = -EBUSY;
> -               } else {
> +               } else if (flags & FOLL_SPLIT) {
>                        if (unlikely(!try_get_page(page))) {
>                                spin_unlock(ptl);
>                                return ERR_PTR(-ENOMEM);
> @@ -419,6 +422,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>                        put_page(page);
>                        if (pmd_none(*pmd))
>                                return no_page_table(vma, flags);
> +               } else {  /* flags & FOLL_SPLIT_PMD */
> +                       spin_unlock(ptl);
> +                       split_huge_pmd(vma, pmd, address);
> +                       ret = 0;
>                }
> 
>                return ret ? ERR_PTR(ret) :
>                        follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> 
> 
> This version doesn't work as-is, because it returns at the first check:
> 
>        if (unlikely(pmd_bad(*pmd)))
>                return no_page_table(vma, flags);
> 
> Did I misunderstand anything here?
> 
> Thanks,
> Song

I guess this would be the best. It _is_ a lot simpler. 

diff --git c/mm/gup.c w/mm/gup.c
index ddde097cf9e4..0cd3ce599f41 100644
--- c/mm/gup.c
+++ w/mm/gup.c
@@ -398,7 +398,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
                spin_unlock(ptl);
                return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
        }
-       if (flags & FOLL_SPLIT) {
+       if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
                int ret;
                page = pmd_page(*pmd);
                if (is_huge_zero_page(page)) {
@@ -407,7 +407,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
                        split_huge_pmd(vma, pmd, address);
                        if (pmd_trans_unstable(pmd))
                                ret = -EBUSY;
-               } else {
+               } else if (flags & FOLL_SPLIT) {
                        if (unlikely(!try_get_page(page))) {
                                spin_unlock(ptl);
                                return ERR_PTR(-ENOMEM);
@@ -419,6 +419,11 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
                        put_page(page);
                        if (pmd_none(*pmd))
                                return no_page_table(vma, flags);
+               } else {  /* flags & FOLL_SPLIT_PMD */
+                       spin_unlock(ptl);
+                       ret = 0;
+                       split_huge_pmd(vma, pmd, address);
+                       pte_alloc(mm, pmd);
                }

Thanks again for the suggestions. I will send v4 soon. 

Song


> 
> 
>> 
>>> 
>>> -- 
>>> Kirill A. Shutemov


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-06-13 17:43 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-12 22:03 [PATCH v3 0/5] THP aware uprobe Song Liu
2019-06-12 22:03 ` [PATCH v3 1/5] mm: move memcmp_pages() and pages_identical() Song Liu
2019-06-12 22:03 ` [PATCH v3 2/5] uprobe: use original page when all uprobes are removed Song Liu
2019-06-12 22:03 ` [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD Song Liu
2019-06-13 12:57   ` Kirill A. Shutemov
2019-06-13 13:57     ` Song Liu
2019-06-13 14:16       ` Kirill A. Shutemov
2019-06-13 15:03         ` Song Liu
2019-06-13 15:14           ` Kirill A. Shutemov
2019-06-13 15:24             ` Song Liu
2019-06-13 16:47               ` Song Liu
2019-06-13 17:42                 ` Song Liu
2019-06-12 22:03 ` [PATCH v3 4/5] uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT Song Liu
2019-06-12 22:03 ` [PATCH v3 5/5] uprobe: collapse THP pmd after removing all uprobes Song Liu
2019-06-12 22:03 ` [PATCH v3 6/6] mm, thp: introduce FOLL_SPLIT_PMD Song Liu
2019-06-12 22:16   ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).