linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] mm: Break COW for pinned pages during fork()
@ 2020-09-25 22:25 Peter Xu
  2020-09-25 22:25 ` [PATCH v2 1/4] mm: Introduce mm_struct.has_pinned Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Peter Xu @ 2020-09-25 22:25 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: peterx, Jason Gunthorpe, John Hubbard, Andrew Morton,
	Christoph Hellwig, Yang Shi, Oleg Nesterov, Kirill Tkhai,
	Kirill Shutemov, Hugh Dickins, Jann Horn, Linus Torvalds,
	Michal Hocko, Jan Kara, Andrea Arcangeli, Leon Romanovsky

Due to the rebase to latest rc6, the major pte copy patch changed a lot.  So
maybe not that useful to write a changelog any more.  However all the comments
should be addressed as long as discussed in previous thread.  Please shoot if I
missed anything important.

This series is majorly inspired by the previous discussion on the list [1],
starting from the report from Jason on the rdma test failure.  Linus proposed
the solution, which seems to be a very nice approach to avoid the breakage of
userspace apps that didn't use MADV_DONTFORK properly before.  More information
can be found in that thread too.

I tested it myself with fork() after vfio pinning a bunch of device pages, and
I verified that the new copy pte logic worked as expected at least in the most
general path.  However I didn't test thp case yet because afaict vfio does not
support thp backed dma pages.  Luckily, the pmd/pud thp patch is much more
straightforward than the pte one, so hopefully it can be directly verified by
some code review plus some more heavy-weight rdma tests.

Patch 1:      Introduce mm.has_pinned
Patch 2:      Preparation patch
Patch 3:      Early cow solution for pte copy for pinned pages
Patch 4:      Same as above, but for thp (pmd/pud).

Hugetlbfs fix is still missing, but as planned, that's not urgent so we can
work upon.  Comments greatly welcomed.

[1] https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/

Thanks.

Peter Xu (4):
  mm: Introduce mm_struct.has_pinned
  mm/fork: Pass new vma pointer into copy_page_range()
  mm: Do early cow for pinned pages during fork() for ptes
  mm/thp: Split huge pmds/puds if they're pinned when fork()

 include/linux/mm.h       |   2 +-
 include/linux/mm_types.h |  10 +++
 kernel/fork.c            |   3 +-
 mm/gup.c                 |   6 ++
 mm/huge_memory.c         |  28 ++++++
 mm/memory.c              | 186 ++++++++++++++++++++++++++++++++++-----
 6 files changed, 212 insertions(+), 23 deletions(-)

-- 
2.26.2




^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/4] mm: Introduce mm_struct.has_pinned
  2020-09-25 22:25 [PATCH v2 0/4] mm: Break COW for pinned pages during fork() Peter Xu
@ 2020-09-25 22:25 ` Peter Xu
  2020-09-25 22:25 ` [PATCH v2 2/4] mm/fork: Pass new vma pointer into copy_page_range() Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2020-09-25 22:25 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: peterx, Jason Gunthorpe, John Hubbard, Andrew Morton,
	Christoph Hellwig, Yang Shi, Oleg Nesterov, Kirill Tkhai,
	Kirill Shutemov, Hugh Dickins, Jann Horn, Linus Torvalds,
	Michal Hocko, Jan Kara, Andrea Arcangeli, Leon Romanovsky

(Commit message majorly collected from Jason Gunthorpe)

Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
track if the mm_struct has ever been used with pin_user_pages().  This allows
cases that might drive up the page ref_count to avoid any penalty from handling
dma_pinned pages.

Future work is planned, to provide a more sophisticated solution, likely to
turn it into a real counter.  For now, make it atomic_t but use it as a boolean
for simplicity.

Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm_types.h | 10 ++++++++++
 kernel/fork.c            |  1 +
 mm/gup.c                 |  6 ++++++
 3 files changed, 17 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 496c3ff97cce..ed028af3cb19 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -436,6 +436,16 @@ struct mm_struct {
 		 */
 		atomic_t mm_count;
 
+		/**
+		 * @has_pinned: Whether this mm has pinned any pages.  This can
+		 * be either replaced in the future by @pinned_vm when it
+		 * becomes stable, or grow into a counter on its own. We're
+		 * aggresive on this bit now - even if the pinned pages were
+		 * unpinned later on, we'll still keep this bit set for the
+		 * lifecycle of this mm just for simplicity.
+		 */
+		atomic_t has_pinned;
+
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* PTE page table pages */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 49677d668de4..e65d8192d080 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1011,6 +1011,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
 	mm->locked_vm = 0;
+	atomic_set(&mm->has_pinned, 0);
 	atomic64_set(&mm->pinned_vm, 0);
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..238667445337 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1255,6 +1255,9 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
 		BUG_ON(*locked != 1);
 	}
 
+	if (flags & FOLL_PIN)
+		atomic_set(&current->mm->has_pinned, 1);
+
 	/*
 	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
 	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
@@ -2660,6 +2663,9 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
 				       FOLL_FAST_ONLY)))
 		return -EINVAL;
 
+	if (gup_flags & FOLL_PIN)
+		atomic_set(&current->mm->has_pinned, 1);
+
 	if (!(gup_flags & FOLL_FAST_ONLY))
 		might_lock_read(&current->mm->mmap_lock);
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/4] mm/fork: Pass new vma pointer into copy_page_range()
  2020-09-25 22:25 [PATCH v2 0/4] mm: Break COW for pinned pages during fork() Peter Xu
  2020-09-25 22:25 ` [PATCH v2 1/4] mm: Introduce mm_struct.has_pinned Peter Xu
@ 2020-09-25 22:25 ` Peter Xu
  2020-09-30 13:30   ` Kirill A. Shutemov
  2020-09-25 22:25 ` [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2020-09-25 22:25 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: peterx, Jason Gunthorpe, John Hubbard, Andrew Morton,
	Christoph Hellwig, Yang Shi, Oleg Nesterov, Kirill Tkhai,
	Kirill Shutemov, Hugh Dickins, Jann Horn, Linus Torvalds,
	Michal Hocko, Jan Kara, Andrea Arcangeli, Leon Romanovsky

This prepares for the future work to trigger early cow on pinned pages during
fork().  No functional change intended.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  2 +-
 kernel/fork.c      |  2 +-
 mm/memory.c        | 14 +++++++++-----
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b2f370f0b420..ae914be004b3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1646,7 +1646,7 @@ struct mmu_notifier_range;
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+		    struct vm_area_struct *vma, struct vm_area_struct *new);
 int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 		   struct mmu_notifier_range *range,
 		   pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
diff --git a/kernel/fork.c b/kernel/fork.c
index e65d8192d080..da8d360fb032 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -589,7 +589,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 
 		mm->map_count++;
 		if (!(tmp->vm_flags & VM_WIPEONFORK))
-			retval = copy_page_range(mm, oldmm, mpnt);
+			retval = copy_page_range(mm, oldmm, mpnt, tmp);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
diff --git a/mm/memory.c b/mm/memory.c
index e315b1f1ef08..4c56d7b92b0e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -819,6 +819,7 @@ copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		   struct vm_area_struct *new,
 		   unsigned long addr, unsigned long end)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
@@ -889,6 +890,7 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
+		struct vm_area_struct *new,
 		unsigned long addr, unsigned long end)
 {
 	pmd_t *src_pmd, *dst_pmd;
@@ -915,7 +917,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-						vma, addr, next))
+				   vma, new, addr, next))
 			return -ENOMEM;
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
@@ -923,6 +925,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 
 static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		p4d_t *dst_p4d, p4d_t *src_p4d, struct vm_area_struct *vma,
+		struct vm_area_struct *new,
 		unsigned long addr, unsigned long end)
 {
 	pud_t *src_pud, *dst_pud;
@@ -949,7 +952,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		if (pud_none_or_clear_bad(src_pud))
 			continue;
 		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-						vma, addr, next))
+				   vma, new, addr, next))
 			return -ENOMEM;
 	} while (dst_pud++, src_pud++, addr = next, addr != end);
 	return 0;
@@ -957,6 +960,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 
 static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+		struct vm_area_struct *new,
 		unsigned long addr, unsigned long end)
 {
 	p4d_t *src_p4d, *dst_p4d;
@@ -971,14 +975,14 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src
 		if (p4d_none_or_clear_bad(src_p4d))
 			continue;
 		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
-						vma, addr, next))
+				   vma, new, addr, next))
 			return -ENOMEM;
 	} while (dst_p4d++, src_p4d++, addr = next, addr != end);
 	return 0;
 }
 
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		struct vm_area_struct *vma)
+		    struct vm_area_struct *vma, struct vm_area_struct *new)
 {
 	pgd_t *src_pgd, *dst_pgd;
 	unsigned long next;
@@ -1033,7 +1037,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
-					    vma, addr, next))) {
+					    vma, new, addr, next))) {
 			ret = -ENOMEM;
 			break;
 		}
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes
  2020-09-25 22:25 [PATCH v2 0/4] mm: Break COW for pinned pages during fork() Peter Xu
  2020-09-25 22:25 ` [PATCH v2 1/4] mm: Introduce mm_struct.has_pinned Peter Xu
  2020-09-25 22:25 ` [PATCH v2 2/4] mm/fork: Pass new vma pointer into copy_page_range() Peter Xu
@ 2020-09-25 22:25 ` Peter Xu
  2020-09-26 23:23   ` Jason Gunthorpe
  2020-09-25 22:26 ` [PATCH v2 4/4] mm/thp: Split huge pmds/puds if they're pinned when fork() Peter Xu
  2020-09-27 19:35 ` [PATCH v2 0/4] mm: Break COW for pinned pages during fork() Linus Torvalds
  4 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2020-09-25 22:25 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: peterx, Jason Gunthorpe, John Hubbard, Andrew Morton,
	Christoph Hellwig, Yang Shi, Oleg Nesterov, Kirill Tkhai,
	Kirill Shutemov, Hugh Dickins, Jann Horn, Linus Torvalds,
	Michal Hocko, Jan Kara, Andrea Arcangeli, Leon Romanovsky

It allows copy_pte_range() to do early cow if the pages were pinned on the
source mm.  Currently we don't have an accurate way to know whether a page is
pinned or not.  The only thing we have is page_maybe_dma_pinned().  However
that's good enough for now.  Especially, with the newly added mm->has_pinned
flag to make sure we won't affect processes that never pinned any pages.

It would be easier if we can do GFP_KERNEL allocation within copy_one_pte().
Unluckily, we can't because we're with the page table locks held for both the
parent and child processes.  So the page allocation needs to be done outside
copy_one_pte().

Some trick is there in copy_present_pte(), majorly the wrprotect trick to block
concurrent fast-gup.  Comments in the function should explain better in place.

Oleg Nesterov reported a (probably harmless) bug during review that we didn't
reset entry.val properly in copy_pte_range() so that potentially there's chance
to call add_swap_count_continuation() multiple times on the same swp entry.
However that should be harmless since even if it happens, the same function
(add_swap_count_continuation()) will return directly noticing that there're
enough space for the swp counter.  So instead of a standalone stable patch, it
is touched up in this patch directly.

Reference discussions:

  https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 156 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4c56d7b92b0e..92ad08616e60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -773,15 +773,109 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	return 0;
 }
 
-static inline void
+/*
+ * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
+ * is required to copy this pte.
+ */
+static inline int
 copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-		unsigned long addr, int *rss)
+		struct vm_area_struct *new,
+		unsigned long addr, int *rss, struct page **prealloc)
 {
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
 
+	page = vm_normal_page(vma, addr, pte);
+	if (page) {
+		if (is_cow_mapping(vm_flags)) {
+			bool is_write = pte_write(pte);
+
+			/*
+			 * The trick starts.
+			 *
+			 * What we want to do is to check whether this page may
+			 * have been pinned by the parent process.  If so,
+			 * instead of wrprotect the pte on both sides, we copy
+			 * the page immediately so that we'll always guarantee
+			 * the pinned page won't be randomly replaced in the
+			 * future.
+			 *
+			 * To achieve this, we do the following:
+			 *
+			 * 1. Write-protect the pte if it's writable.  This is
+			 *    to protect concurrent write fast-gup with
+			 *    FOLL_PIN, so that we'll fail the fast-gup with
+			 *    the write bit removed.
+			 *
+			 * 2. Check page_maybe_dma_pinned() to see whether this
+			 *    page may have been pinned.
+			 *
+			 * The order of these steps is important to serialize
+			 * against the fast-gup code (gup_pte_range()) on the
+			 * pte check and try_grab_compound_head(), so that
+			 * we'll make sure either we'll capture that fast-gup
+			 * so we'll copy the pinned page here, or we'll fail
+			 * that fast-gup.
+			 */
+			if (is_write) {
+				ptep_set_wrprotect(src_mm, addr, src_pte);
+				/*
+				 * This is not needed for serializing fast-gup,
+				 * however always make it consistent with
+				 * src_pte, since we'll need it when current
+				 * page is not pinned.
+				 */
+				pte = pte_wrprotect(pte);
+			}
+
+			if (atomic_read(&src_mm->has_pinned) &&
+			    page_maybe_dma_pinned(page)) {
+				struct page *new_page = *prealloc;
+
+				/*
+				 * This is possibly pinned page, need to copy.
+				 * Safe to release the write bit if necessary.
+				 */
+				if (is_write)
+					set_pte_at(src_mm, addr, src_pte,
+						   pte_mkwrite(pte));
+
+				/* If we don't have a pre-allocated page, ask */
+				if (!new_page)
+					return -EAGAIN;
+
+				/*
+				 * We have a prealloc page, all good!  Take it
+				 * over and copy the page & arm it.
+				 */
+				*prealloc = NULL;
+				copy_user_highpage(new_page, page, addr, vma);
+				__SetPageUptodate(new_page);
+				pte = mk_pte(new_page, new->vm_page_prot);
+				pte = pte_sw_mkyoung(pte);
+				pte = maybe_mkwrite(pte_mkdirty(pte), new);
+				page_add_new_anon_rmap(new_page, new, addr, false);
+				rss[mm_counter(new_page)]++;
+				set_pte_at(dst_mm, addr, dst_pte, pte);
+				return 0;
+			}
+
+			/*
+			 * Logically we should recover the wrprotect() for
+			 * fast-gup, however when reach here it also means we
+			 * actually need to wrprotect() it again for cow.
+			 * Simply keep everything.  Note that there's another
+			 * chunk of cow logic below, but we should still need
+			 * that for !page case.
+			 */
+		}
+		get_page(page);
+		page_dup_rmap(page, false);
+		rss[mm_counter(page)]++;
+	}
+
 	/*
 	 * If it's a COW mapping, write protect it both
 	 * in the parent and the child
@@ -807,14 +901,27 @@ copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!(vm_flags & VM_UFFD_WP))
 		pte = pte_clear_uffd_wp(pte);
 
-	page = vm_normal_page(vma, addr, pte);
-	if (page) {
-		get_page(page);
-		page_dup_rmap(page, false);
-		rss[mm_counter(page)]++;
+	set_pte_at(dst_mm, addr, dst_pte, pte);
+	return 0;
+}
+
+static inline struct page *
+page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma,
+		   unsigned long addr)
+{
+	struct page *new_page;
+
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+	if (!new_page)
+		return NULL;
+
+	if (mem_cgroup_charge(new_page, src_mm, GFP_KERNEL)) {
+		put_page(new_page);
+		return NULL;
 	}
+	cgroup_throttle_swaprate(new_page, GFP_KERNEL);
 
-	set_pte_at(dst_mm, addr, dst_pte, pte);
+	return new_page;
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -825,16 +932,20 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
-	int progress = 0;
+	int progress, ret = 0;
 	int rss[NR_MM_COUNTERS];
 	swp_entry_t entry = (swp_entry_t){0};
+	struct page *prealloc = NULL;
 
 again:
+	progress = 0;
 	init_rss_vec(rss);
 
 	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
-	if (!dst_pte)
-		return -ENOMEM;
+	if (!dst_pte) {
+		ret = -ENOMEM;
+		goto out;
+	}
 	src_pte = pte_offset_map(src_pmd, addr);
 	src_ptl = pte_lockptr(src_mm, src_pmd);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -866,8 +977,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			progress += 8;
 			continue;
 		}
-		copy_present_pte(dst_mm, src_mm, dst_pte, src_pte,
-				 vma, addr, rss);
+		/* copy_present_pte() will clear `*prealloc' if consumed */
+		ret = copy_present_pte(dst_mm, src_mm, dst_pte, src_pte,
+				       vma, new, addr, rss, &prealloc);
+		/*
+		 * If we need a pre-allocated page for this pte, drop the
+		 * locks, allocate, and try again.
+		 */
+		if (unlikely(ret == -EAGAIN))
+			break;
+		if (unlikely(prealloc)) {
+			/*
+			 * pre-alloc page cannot be reused by next time so as
+			 * to strictly follow mempolicy (e.g., alloc_page_vma()
+			 * will allocate page according to address).  This
+			 * could only happen if one pinned pte changed.
+			 */
+			put_page(prealloc);
+			prealloc = NULL;
+		}
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
@@ -879,13 +1007,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	cond_resched();
 
 	if (entry.val) {
-		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
+		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		entry.val = 0;
+	} else if (ret) {
+		WARN_ON_ONCE(ret != -EAGAIN);
+		prealloc = page_copy_prealloc(src_mm, vma, addr);
+		if (!prealloc)
 			return -ENOMEM;
-		progress = 0;
+		/* We've captured and resolved the error. Reset, try again. */
+		ret = 0;
 	}
 	if (addr != end)
 		goto again;
-	return 0;
+out:
+	if (unlikely(prealloc))
+		put_page(prealloc);
+	return ret;
 }
 
 static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/4] mm/thp: Split huge pmds/puds if they're pinned when fork()
  2020-09-25 22:25 [PATCH v2 0/4] mm: Break COW for pinned pages during fork() Peter Xu
                   ` (2 preceding siblings ...)
  2020-09-25 22:25 ` [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes Peter Xu
@ 2020-09-25 22:26 ` Peter Xu
  2020-09-27 19:35 ` [PATCH v2 0/4] mm: Break COW for pinned pages during fork() Linus Torvalds
  4 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2020-09-25 22:26 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: peterx, Jason Gunthorpe, John Hubbard, Andrew Morton,
	Christoph Hellwig, Yang Shi, Oleg Nesterov, Kirill Tkhai,
	Kirill Shutemov, Hugh Dickins, Jann Horn, Linus Torvalds,
	Michal Hocko, Jan Kara, Andrea Arcangeli, Leon Romanovsky

Pinned pages shouldn't be write-protected when fork() happens, because follow
up copy-on-write on these pages could cause the pinned pages to be replaced by
random newly allocated pages.

For huge PMDs, we split the huge pmd if pinning is detected.  So that future
handling will be done by the PTE level (with our latest changes, each of the
small pages will be copied).  We can achieve this by let copy_huge_pmd() return
-EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
finally land the next copy_pte_range() call.

Huge PUDs will be even more special - so far it does not support anonymous
pages.  But it can actually be done the same as the huge PMDs even if the split
huge PUDs means to erase the PUD entries.  It'll guarantee the follow up fault
ins will remap the same pages in either parent/child later.

This might not be the most efficient way, but it should be easy and clean
enough.  It should be fine, since we're tackling with a very rare case just to
make sure userspaces that pinned some thps will still work even without
MADV_DONTFORK and after they fork()ed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faadc449cca5..da397779a6d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1074,6 +1074,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+
+	/*
+	 * If this page is a potentially pinned page, split and retry the fault
+	 * with smaller page size.  Normally this should not happen because the
+	 * userspace should use MADV_DONTFORK upon pinned regions.  This is a
+	 * best effort that the pinned pages won't be replaced by another
+	 * random page during the coming copy-on-write.
+	 */
+	if (unlikely(is_cow_mapping(vma->vm_flags) &&
+		     atomic_read(&src_mm->has_pinned) &&
+		     page_maybe_dma_pinned(src_page))) {
+		pte_free(dst_mm, pgtable);
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
+		__split_huge_pmd(vma, src_pmd, addr, false, NULL);
+		return -EAGAIN;
+	}
+
 	get_page(src_page);
 	page_dup_rmap(src_page, true);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -1177,6 +1195,16 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		/* No huge zero pud yet */
 	}
 
+	/* Please refer to comments in copy_huge_pmd() */
+	if (unlikely(is_cow_mapping(vma->vm_flags) &&
+		     atomic_read(&src_mm->has_pinned) &&
+		     page_maybe_dma_pinned(pud_page(pud)))) {
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
+		__split_huge_pud(vma, src_pud, addr);
+		return -EAGAIN;
+	}
+
 	pudp_set_wrprotect(src_mm, addr, src_pud);
 	pud = pud_mkold(pud_wrprotect(pud));
 	set_pud_at(dst_mm, addr, dst_pud, pud);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes
  2020-09-25 22:25 ` [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes Peter Xu
@ 2020-09-26 23:23   ` Jason Gunthorpe
  2020-09-27  0:04     ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2020-09-26 23:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, John Hubbard, Andrew Morton,
	Christoph Hellwig, Yang Shi, Oleg Nesterov, Kirill Tkhai,
	Kirill Shutemov, Hugh Dickins, Jann Horn, Linus Torvalds,
	Michal Hocko, Jan Kara, Andrea Arcangeli, Leon Romanovsky

On Fri, Sep 25, 2020 at 06:25:59PM -0400, Peter Xu wrote:
> -static inline void
> +/*
> + * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
> + * is required to copy this pte.
> + */
> +static inline int
>  copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
> -		unsigned long addr, int *rss)
> +		struct vm_area_struct *new,
> +		unsigned long addr, int *rss, struct page **prealloc)
>  {
>  	unsigned long vm_flags = vma->vm_flags;
>  	pte_t pte = *src_pte;
>  	struct page *page;
>  
> +	page = vm_normal_page(vma, addr, pte);
> +	if (page) {
> +		if (is_cow_mapping(vm_flags)) {
> +			bool is_write = pte_write(pte);

Very minor, but I liked the readability to put this chunk in a
function 'copy_normal_page' with the src/dst naming

> +
> +				/*
> +				 * We have a prealloc page, all good!  Take it
> +				 * over and copy the page & arm it.
> +				 */
> +				*prealloc = NULL;
> +				copy_user_highpage(new_page, page, addr, vma);
> +				__SetPageUptodate(new_page);
> +				pte = mk_pte(new_page, new->vm_page_prot);
> +				pte = pte_sw_mkyoung(pte);

Linus's version doesn't do pte_sw_mkyoung(), but looks OK to have it

> +				pte = maybe_mkwrite(pte_mkdirty(pte), new);

maybe_mkwrite() was not in Linus's version, but is in
wp_page_copy(). It seemed like mk_pte() should set the proper write
bit already from the vm_page_prot? Perhaps this is harmless but
redundant?

> +				page_add_new_anon_rmap(new_page, new, addr, false);
> +				rss[mm_counter(new_page)]++;
> +				set_pte_at(dst_mm, addr, dst_pte, pte);

Linus's patch had a lru_cache_add_inactive_or_unevictable() here, like
wp_page_copy()

Didn't think of anything profound to say, looks good thanks!

I'll forward this for testing as well, there are some holidays next
week so I may have been optimistic to think by Monday.

Jason


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes
  2020-09-26 23:23   ` Jason Gunthorpe
@ 2020-09-27  0:04     ` Linus Torvalds
  2020-09-27  4:09       ` Peter Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2020-09-27  0:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, Linux Kernel Mailing List, Linux-MM, John Hubbard,
	Andrew Morton, Christoph Hellwig, Yang Shi, Oleg Nesterov,
	Kirill Tkhai, Kirill Shutemov, Hugh Dickins, Jann Horn,
	Michal Hocko, Jan Kara, Andrea Arcangeli, Leon Romanovsky

On Sat, Sep 26, 2020 at 4:23 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> Linus's version doesn't do pte_sw_mkyoung(), but looks OK to have it

I don't think it matters. But I don't think it should make it young,
since there's no access, but it's not like it's a big deal.

> > +                             pte = maybe_mkwrite(pte_mkdirty(pte), new);
>
> maybe_mkwrite() was not in Linus's version but it is wp_page_copy().

Actually, it is in my version too, just in a different form.

I did it using

        if (vma->vm_flags & VM_WRITE)
                *src_pte = pte_mkwrite(*src_pte);

instead, ie avoiding the write to src_pte if it wasn't a writable vma
(and I had checked that it was dirty and not done this at all if not,
so no need for the mkdirty).

> It seemed like mk_pte() should set the proper write
> bit already from the vm_page_prot?

No, vm_page_prot won't have the writable bit for a COW mapping.

The write bit gets set when the write happens (which may be on the
first access, of course), by the code that makes sure it's a private
copy.

> Perhaps this is harmless but redundant?

No, the pte_mkwrite() is required in some form, whether it's that
"maybe_mkwrite()" that looks at the vm flags, or in that explicit
form.

> > +                             page_add_new_anon_rmap(new_page, new, addr, false);
> > +                             rss[mm_counter(new_page)]++;
> > +                             set_pte_at(dst_mm, addr, dst_pte, pte);
>
> Linus's patch had a lru_cache_add_inactive_or_unevictable() here, like
> wp_page_copy()

Yeah, I do think that is needed so that we have the new page on the
LRU and it gets properly evicted under memory pressure.

                   Linus


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes
  2020-09-27  0:04     ` Linus Torvalds
@ 2020-09-27  4:09       ` Peter Xu
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2020-09-27  4:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason Gunthorpe, Linux Kernel Mailing List, Linux-MM,
	John Hubbard, Andrew Morton, Christoph Hellwig, Yang Shi,
	Oleg Nesterov, Kirill Tkhai, Kirill Shutemov, Hugh Dickins,
	Jann Horn, Michal Hocko, Jan Kara, Andrea Arcangeli,
	Leon Romanovsky

[-- Attachment #1: Type: text/plain, Size: 622 bytes --]

Linus Torvalds <torvalds@linux-foundation.org> 于 2020年9月26日周六 20:05写道:

> > > +                             page_add_new_anon_rmap(new_page, new,
> addr, false);
> > > +                             rss[mm_counter(new_page)]++;
> > > +                             set_pte_at(dst_mm, addr, dst_pte, pte);
> >
> > Linus's patch had a lru_cache_add_inactive_or_unevictable() here, like
> > wp_page_copy()
>
> Yeah, I do think that is needed so that we have the new page on the
> LRU and it gets properly evicted under memory pressure.
>

Oops, yes we definitely need this one.

Thanks,

>

[-- Attachment #2: Type: text/html, Size: 1240 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/4] mm: Break COW for pinned pages during fork()
  2020-09-25 22:25 [PATCH v2 0/4] mm: Break COW for pinned pages during fork() Peter Xu
                   ` (3 preceding siblings ...)
  2020-09-25 22:26 ` [PATCH v2 4/4] mm/thp: Split huge pmds/puds if they're pinned when fork() Peter Xu
@ 2020-09-27 19:35 ` Linus Torvalds
  2020-09-29 11:02   ` Leon Romanovsky
  4 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2020-09-27 19:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux Kernel Mailing List, Linux-MM, Jason Gunthorpe,
	John Hubbard, Andrew Morton, Christoph Hellwig, Yang Shi,
	Oleg Nesterov, Kirill Tkhai, Kirill Shutemov, Hugh Dickins,
	Jann Horn, Michal Hocko, Jan Kara, Andrea Arcangeli,
	Leon Romanovsky

On Fri, Sep 25, 2020 at 3:26 PM Peter Xu <peterx@redhat.com> wrote:
>
> This series is majorly inspired by the previous discussion on the list [1],
> starting from the report from Jason on the rdma test failure.

Ok, this is now in my git tree with the changes I outlined in the other email.

> I tested it myself with fork() after vfio pinning a bunch of device pages,

.. but _my_ only testing was to just add a nasty hack that said that
all pages are pinned, and made fork() much slower, but hey, it at
least tests the preallocation paths etc. And I'm not seeing any
obvious failures due to taking that slow-path that is supposed to be a
special case.

Let's hope this closes the rdma issues.

                Linus


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/4] mm: Break COW for pinned pages during fork()
  2020-09-27 19:35 ` [PATCH v2 0/4] mm: Break COW for pinned pages during fork() Linus Torvalds
@ 2020-09-29 11:02   ` Leon Romanovsky
  0 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2020-09-29 11:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Xu, Linux Kernel Mailing List, Linux-MM, Jason Gunthorpe,
	John Hubbard, Andrew Morton, Christoph Hellwig, Yang Shi,
	Oleg Nesterov, Kirill Tkhai, Kirill Shutemov, Hugh Dickins,
	Jann Horn, Michal Hocko, Jan Kara, Andrea Arcangeli

On Sun, Sep 27, 2020 at 12:35:59PM -0700, Linus Torvalds wrote:
> On Fri, Sep 25, 2020 at 3:26 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > This series is majorly inspired by the previous discussion on the list [1],
> > starting from the report from Jason on the rdma test failure.
>
> Ok, this is now in my git tree with the changes I outlined in the other email.
>
> > I tested it myself with fork() after vfio pinning a bunch of device pages,
>
> .. but _my_ only testing was to just add a nasty hack that said that
> all pages are pinned, and made fork() much slower, but hey, it at
> least tests the preallocation paths etc. And I'm not seeing any
> obvious failures due to taking that slow-path that is supposed to be a
> special case.
>
> Let's hope this closes the rdma issues.

Hi Linus,

We tested your tree upto commit "fb0155a09b02 Merge tag 'nfs-for-5.9-3' of
git://git.linux-nfs.org/projects/trondmy/linux-nfs" and our RDMA tests passed.

Thanks

>
>                 Linus


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/4] mm/fork: Pass new vma pointer into copy_page_range()
  2020-09-25 22:25 ` [PATCH v2 2/4] mm/fork: Pass new vma pointer into copy_page_range() Peter Xu
@ 2020-09-30 13:30   ` Kirill A. Shutemov
  2020-09-30 17:05     ` Peter Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Kirill A. Shutemov @ 2020-09-30 13:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Jason Gunthorpe, John Hubbard,
	Andrew Morton, Christoph Hellwig, Yang Shi, Oleg Nesterov,
	Kirill Tkhai, Hugh Dickins, Jann Horn, Linus Torvalds,
	Michal Hocko, Jan Kara, Andrea Arcangeli, Leon Romanovsky

On Fri, Sep 25, 2020 at 06:25:58PM -0400, Peter Xu wrote:
> This prepares for the future work to trigger early cow on pinned pages during
> fork().  No functional change intended.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/mm.h |  2 +-
>  kernel/fork.c      |  2 +-
>  mm/memory.c        | 14 +++++++++-----
>  3 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b2f370f0b420..ae914be004b3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1646,7 +1646,7 @@ struct mmu_notifier_range;
>  void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
>  		unsigned long end, unsigned long floor, unsigned long ceiling);
>  int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
> -			struct vm_area_struct *vma);
> +		    struct vm_area_struct *vma, struct vm_area_struct *new);

It makes dst/src mm_struct arguments redundant. There's always vma->vm_mm.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/4] mm/fork: Pass new vma pointer into copy_page_range()
  2020-09-30 13:30   ` Kirill A. Shutemov
@ 2020-09-30 17:05     ` Peter Xu
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2020-09-30 17:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, linux-mm, Jason Gunthorpe, John Hubbard,
	Andrew Morton, Christoph Hellwig, Yang Shi, Oleg Nesterov,
	Kirill Tkhai, Hugh Dickins, Jann Horn, Linus Torvalds,
	Michal Hocko, Jan Kara, Andrea Arcangeli, Leon Romanovsky

On Wed, Sep 30, 2020 at 04:30:55PM +0300, Kirill A. Shutemov wrote:
> >  int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
> > -			struct vm_area_struct *vma);
> > +		    struct vm_area_struct *vma, struct vm_area_struct *new);
> 
> It makes dst/src mm_struct arguments redundant. There's always vma->vm_mm.

Indeed.  I'll post a patch later for it.  Thanks.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-09-30 17:05 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-25 22:25 [PATCH v2 0/4] mm: Break COW for pinned pages during fork() Peter Xu
2020-09-25 22:25 ` [PATCH v2 1/4] mm: Introduce mm_struct.has_pinned Peter Xu
2020-09-25 22:25 ` [PATCH v2 2/4] mm/fork: Pass new vma pointer into copy_page_range() Peter Xu
2020-09-30 13:30   ` Kirill A. Shutemov
2020-09-30 17:05     ` Peter Xu
2020-09-25 22:25 ` [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes Peter Xu
2020-09-26 23:23   ` Jason Gunthorpe
2020-09-27  0:04     ` Linus Torvalds
2020-09-27  4:09       ` Peter Xu
2020-09-25 22:26 ` [PATCH v2 4/4] mm/thp: Split huge pmds/puds if they're pinned when fork() Peter Xu
2020-09-27 19:35 ` [PATCH v2 0/4] mm: Break COW for pinned pages during fork() Linus Torvalds
2020-09-29 11:02   ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).