linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] mm/hugetlb: Early cow on fork, and a few cleanups
@ 2021-02-05 16:54 Peter Xu
  2021-02-05 16:54 ` [PATCH v3 1/5] hugetlb: Dedup the code to add a new file_region Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Peter Xu @ 2021-02-05 16:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Gal Pressman, Andrea Arcangeli, Christoph Hellwig, Miaohe Lin,
	Kirill Shutemov, Jann Horn, Matthew Wilcox, Jan Kara,
	Jason Gunthorpe, Linus Torvalds, Mike Rapoport, David Gibson,
	Mike Kravetz, peterx, Kirill Tkhai, Wei Zhang, Andrew Morton

v3:
- rebase to linux-next/akpm, switch to the new HPAGE helpers [MikeK]
- correct error check for alloc_huge_page(); test it this time to make sure
  fork() fails gracefully when overcommit [MikeK]
- move page copy out of pgtable lock: this changed quite a bit of the logic in
  the last patch, prealloc is dropped since I found it easier to understand
  without looping at all [MikeK]

v2:
- pass in 1 to alloc_huge_page() last param [Mike]
- reduce comment, unify the comment in one place [Linus]
- add r-bs for Mike and Miaohe

---- original cover letter ----

As reported by Gal [1], we still miss the code clip to handle early cow for
hugetlb case, which is true.  Again, it still feels odd to fork() after using a
few huge pages, especially if they're privately mapped to me..  However I do
agree with Gal and Jason in that we should still have that since that'll
complete the early cow on fork effort at least, and it'll still fix issues
where buffers are not well under control and not easy to apply MADV_DONTFORK.

The first two patches (1-2) are some cleanups I noticed when reading into the
hugetlb reserve map code.  I think it's good to have but they're not necessary
for fixing the fork issue.

The last two patches (3-4) is the real fix.

I tested this with a fork() after some vfio-pci assignment, so I'm pretty sure
the page copy path could trigger well (page will be accounted right after the
fork()), but I didn't do data check since the card I assigned is some random
nic.  Gal, please feel free to try this if you have better way to verify the
series.

  https://github.com/xzpeter/linux/tree/fork-cow-pin-huge

Please review, thanks!

[1] https://lore.kernel.org/lkml/27564187-4a08-f187-5a84-3df50009f6ca@amazon.com/

Peter Xu (5):
  hugetlb: Dedup the code to add a new file_region
  hugetlg: Break earlier in add_reservation_in_range() when we can
  mm: Introduce page_needs_cow_for_dma() for deciding whether cow
  mm: Use is_cow_mapping() across tree where proper
  hugetlb: Do early cow when page pinned on src mm

 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c |   4 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c   |   2 +-
 fs/proc/task_mmu.c                         |   2 -
 include/linux/mm.h                         |  21 ++++
 mm/huge_memory.c                           |   8 +-
 mm/hugetlb.c                               | 123 +++++++++++++++------
 mm/internal.h                              |   5 -
 mm/memory.c                                |   7 +-
 8 files changed, 117 insertions(+), 55 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/5] hugetlb: Dedup the code to add a new file_region
  2021-02-05 16:54 [PATCH v3 0/5] mm/hugetlb: Early cow on fork, and a few cleanups Peter Xu
@ 2021-02-05 16:54 ` Peter Xu
  2021-02-05 16:54 ` [PATCH v3 2/5] hugetlg: Break earlier in add_reservation_in_range() when we can Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Peter Xu @ 2021-02-05 16:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Gal Pressman, Andrea Arcangeli, Christoph Hellwig, Miaohe Lin,
	Kirill Shutemov, Jann Horn, Matthew Wilcox, Jan Kara,
	Jason Gunthorpe, Linus Torvalds, Mike Rapoport, David Gibson,
	Mike Kravetz, peterx, Kirill Tkhai, Wei Zhang, Andrew Morton

Introduce hugetlb_resv_map_add() helper to add a new file_region rather than
duplication the similar code twice in add_reservation_in_range().

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 51 +++++++++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6ef278ecf7ff..ec8e29c805fe 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -331,6 +331,24 @@ static void coalesce_file_region(struct resv_map *resv, struct file_region *rg)
 	}
 }
 
+static inline long
+hugetlb_resv_map_add(struct resv_map *map, struct file_region *rg, long from,
+		     long to, struct hstate *h, struct hugetlb_cgroup *cg,
+		     long *regions_needed)
+{
+	struct file_region *nrg;
+
+	if (!regions_needed) {
+		nrg = get_file_region_entry_from_cache(map, from, to);
+		record_hugetlb_cgroup_uncharge_info(cg, h, map, nrg);
+		list_add(&nrg->link, rg->link.prev);
+		coalesce_file_region(map, nrg);
+	} else
+		*regions_needed += 1;
+
+	return to - from;
+}
+
 /*
  * Must be called with resv->lock held.
  *
@@ -346,7 +364,7 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 	long add = 0;
 	struct list_head *head = &resv->regions;
 	long last_accounted_offset = f;
-	struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
+	struct file_region *rg = NULL, *trg = NULL;
 
 	if (regions_needed)
 		*regions_needed = 0;
@@ -375,18 +393,11 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 		/* Add an entry for last_accounted_offset -> rg->from, and
 		 * update last_accounted_offset.
 		 */
-		if (rg->from > last_accounted_offset) {
-			add += rg->from - last_accounted_offset;
-			if (!regions_needed) {
-				nrg = get_file_region_entry_from_cache(
-					resv, last_accounted_offset, rg->from);
-				record_hugetlb_cgroup_uncharge_info(h_cg, h,
-								    resv, nrg);
-				list_add(&nrg->link, rg->link.prev);
-				coalesce_file_region(resv, nrg);
-			} else
-				*regions_needed += 1;
-		}
+		if (rg->from > last_accounted_offset)
+			add += hugetlb_resv_map_add(resv, rg,
+						    last_accounted_offset,
+						    rg->from, h, h_cg,
+						    regions_needed);
 
 		last_accounted_offset = rg->to;
 	}
@@ -394,17 +405,9 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 	/* Handle the case where our range extends beyond
 	 * last_accounted_offset.
 	 */
-	if (last_accounted_offset < t) {
-		add += t - last_accounted_offset;
-		if (!regions_needed) {
-			nrg = get_file_region_entry_from_cache(
-				resv, last_accounted_offset, t);
-			record_hugetlb_cgroup_uncharge_info(h_cg, h, resv, nrg);
-			list_add(&nrg->link, rg->link.prev);
-			coalesce_file_region(resv, nrg);
-		} else
-			*regions_needed += 1;
-	}
+	if (last_accounted_offset < t)
+		add += hugetlb_resv_map_add(resv, rg, last_accounted_offset,
+					    t, h, h_cg, regions_needed);
 
 	VM_BUG_ON(add < 0);
 	return add;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/5] hugetlg: Break earlier in add_reservation_in_range() when we can
  2021-02-05 16:54 [PATCH v3 0/5] mm/hugetlb: Early cow on fork, and a few cleanups Peter Xu
  2021-02-05 16:54 ` [PATCH v3 1/5] hugetlb: Dedup the code to add a new file_region Peter Xu
@ 2021-02-05 16:54 ` Peter Xu
  2021-02-05 16:54 ` [PATCH v3 3/5] mm: Introduce page_needs_cow_for_dma() for deciding whether cow Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Peter Xu @ 2021-02-05 16:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Gal Pressman, Andrea Arcangeli, Christoph Hellwig, Miaohe Lin,
	Kirill Shutemov, Jann Horn, Matthew Wilcox, Jan Kara,
	Jason Gunthorpe, Linus Torvalds, Mike Rapoport, David Gibson,
	Mike Kravetz, peterx, Kirill Tkhai, Wei Zhang, Andrew Morton

All the regions maintained in hugetlb reserved map is inclusive on "from" but
exclusive on "to".  We can break earlier even if rg->from==t because it already
means no possible intersection.

This does not need a Fixes in all cases because when it happens (rg->from==t)
we'll not break out of the loop while we should, however the next thing we'd do
is still add the last file_region we'd need and quit the loop in the next
round.  So this change is not a bugfix (since the old code should still run
okay iiuc), but we'd better still touch it up to make it logically sane.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ec8e29c805fe..71ccec5c3817 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -387,7 +387,7 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 		/* When we find a region that starts beyond our range, we've
 		 * finished.
 		 */
-		if (rg->from > t)
+		if (rg->from >= t)
 			break;
 
 		/* Add an entry for last_accounted_offset -> rg->from, and
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 3/5] mm: Introduce page_needs_cow_for_dma() for deciding whether cow
  2021-02-05 16:54 [PATCH v3 0/5] mm/hugetlb: Early cow on fork, and a few cleanups Peter Xu
  2021-02-05 16:54 ` [PATCH v3 1/5] hugetlb: Dedup the code to add a new file_region Peter Xu
  2021-02-05 16:54 ` [PATCH v3 2/5] hugetlg: Break earlier in add_reservation_in_range() when we can Peter Xu
@ 2021-02-05 16:54 ` Peter Xu
  2021-02-05 16:54 ` [PATCH v3 4/5] mm: Use is_cow_mapping() across tree where proper Peter Xu
  2021-02-05 16:54 ` [PATCH v3 5/5] hugetlb: Do early cow when page pinned on src mm Peter Xu
  4 siblings, 0 replies; 8+ messages in thread
From: Peter Xu @ 2021-02-05 16:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Gal Pressman, Andrea Arcangeli, Christoph Hellwig, Miaohe Lin,
	Kirill Shutemov, Jann Horn, Matthew Wilcox, Jan Kara,
	Jason Gunthorpe, Linus Torvalds, Mike Rapoport, David Gibson,
	Mike Kravetz, peterx, Kirill Tkhai, Wei Zhang, Andrew Morton

We've got quite a few places (pte, pmd, pud) that explicitly checked against
whether we should break the cow right now during fork().  It's easier to
provide a helper, especially before we work the same thing on hugetlbfs.

Since we'll reference is_cow_mapping() in mm.h, move it there too.  Actually it
suites mm.h more since internal.h is mm/ only, but mm.h is exported to the
whole kernel.  With that we should expect another patch to use is_cow_mapping()
whenever we can across the kernel since we do use it quite a lot but it's
always done with raw code against VM_* flags.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h | 21 +++++++++++++++++++++
 mm/huge_memory.c   |  8 ++------
 mm/internal.h      |  5 -----
 mm/memory.c        |  7 +------
 4 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 904e2517cd45..2e555d57631f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1314,6 +1314,27 @@ static inline bool page_maybe_dma_pinned(struct page *page)
 		GUP_PIN_COUNTING_BIAS;
 }
 
+static inline bool is_cow_mapping(vm_flags_t flags)
+{
+	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
+}
+
+/*
+ * This should most likely only be called during fork() to see whether we
+ * should break the cow immediately for a page on the src mm.
+ */
+static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
+					  struct page *page)
+{
+	if (!is_cow_mapping(vma->vm_flags))
+		return false;
+
+	if (!atomic_read(&vma->vm_mm->has_pinned))
+		return false;
+
+	return page_maybe_dma_pinned(page);
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 987cf5e4cf90..57f5c7d3a328 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1097,9 +1097,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * best effort that the pinned pages won't be replaced by another
 	 * random page during the coming copy-on-write.
 	 */
-	if (unlikely(is_cow_mapping(vma->vm_flags) &&
-		     atomic_read(&src_mm->has_pinned) &&
-		     page_maybe_dma_pinned(src_page))) {
+	if (unlikely(page_needs_cow_for_dma(vma, src_page))) {
 		pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
@@ -1211,9 +1209,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 	/* Please refer to comments in copy_huge_pmd() */
-	if (unlikely(is_cow_mapping(vma->vm_flags) &&
-		     atomic_read(&src_mm->has_pinned) &&
-		     page_maybe_dma_pinned(pud_page(pud)))) {
+	if (unlikely(page_needs_cow_for_dma(vma, pud_page(pud)))) {
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
 		__split_huge_pud(vma, src_pud, addr);
diff --git a/mm/internal.h b/mm/internal.h
index 8e9c660f33ca..a24847e48081 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -296,11 +296,6 @@ static inline unsigned int buddy_order(struct page *page)
  */
 #define buddy_order_unsafe(page)	READ_ONCE(page_private(page))
 
-static inline bool is_cow_mapping(vm_flags_t flags)
-{
-	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
-}
-
 /*
  * These three helpers classifies VMAs for virtual memory accounting.
  */
diff --git a/mm/memory.c b/mm/memory.c
index 9d68a2340589..cd28871be559 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -812,9 +812,6 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	struct mm_struct *src_mm = src_vma->vm_mm;
 	struct page *new_page;
 
-	if (!is_cow_mapping(src_vma->vm_flags))
-		return 1;
-
 	/*
 	 * What we want to do is to check whether this page may
 	 * have been pinned by the parent process.  If so,
@@ -828,9 +825,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	 * the page count. That might give false positives for
 	 * for pinning, but it will work correctly.
 	 */
-	if (likely(!atomic_read(&src_mm->has_pinned)))
-		return 1;
-	if (likely(!page_maybe_dma_pinned(page)))
+	if (likely(!page_needs_cow_for_dma(src_vma, page)))
 		return 1;
 
 	new_page = *prealloc;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 4/5] mm: Use is_cow_mapping() across tree where proper
  2021-02-05 16:54 [PATCH v3 0/5] mm/hugetlb: Early cow on fork, and a few cleanups Peter Xu
                   ` (2 preceding siblings ...)
  2021-02-05 16:54 ` [PATCH v3 3/5] mm: Introduce page_needs_cow_for_dma() for deciding whether cow Peter Xu
@ 2021-02-05 16:54 ` Peter Xu
  2021-02-05 16:54 ` [PATCH v3 5/5] hugetlb: Do early cow when page pinned on src mm Peter Xu
  4 siblings, 0 replies; 8+ messages in thread
From: Peter Xu @ 2021-02-05 16:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Gal Pressman, Andrea Arcangeli, Christoph Hellwig, Miaohe Lin,
	Kirill Shutemov, Jann Horn, Matthew Wilcox, Jan Kara,
	Jason Gunthorpe, Linus Torvalds, Mike Rapoport, David Gibson,
	Mike Kravetz, peterx, Kirill Tkhai, Wei Zhang, Andrew Morton,
	VMware Graphics, Roland Scheidegger, David Airlie, Daniel Vetter,
	Alexey Dobriyan

After is_cow_mapping() is exported in mm.h, replace some manual checks
elsewhere throughout the tree but start to use the new helper.

Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c | 4 +---
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c   | 2 +-
 fs/proc/task_mmu.c                         | 2 --
 mm/hugetlb.c                               | 4 +---
 4 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
index 0a900afc66ff..45c9c6a7f1d6 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_page_dirty.c
@@ -500,8 +500,6 @@ vm_fault_t vmw_bo_vm_huge_fault(struct vm_fault *vmf,
 	vm_fault_t ret;
 	pgoff_t fault_page_size;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
-	bool is_cow_mapping =
-		(vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
 	switch (pe_size) {
 	case PE_SIZE_PMD:
@@ -518,7 +516,7 @@ vm_fault_t vmw_bo_vm_huge_fault(struct vm_fault *vmf,
 	}
 
 	/* Always do write dirty-tracking and COW on PTE level. */
-	if (write && (READ_ONCE(vbo->dirty) || is_cow_mapping))
+	if (write && (READ_ONCE(vbo->dirty) || is_cow_mapping(vma->vm_flags)))
 		return VM_FAULT_FALLBACK;
 
 	ret = ttm_bo_vm_reserve(bo, vmf);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
index 3c03b1746661..cb9975889e2f 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c
@@ -49,7 +49,7 @@ int vmw_mmap(struct file *filp, struct vm_area_struct *vma)
 	vma->vm_ops = &vmw_vm_ops;
 
 	/* Use VM_PFNMAP rather than VM_MIXEDMAP if not a COW mapping */
-	if ((vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) != VM_MAYWRITE)
+	if (!is_cow_mapping(vma->vm_flags))
 		vma->vm_flags = (vma->vm_flags & ~VM_MIXEDMAP) | VM_PFNMAP;
 
 	return 0;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 602e3a52884d..96c1682025f9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1036,8 +1036,6 @@ struct clear_refs_private {
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
 
-#define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE)
-
 static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
 {
 	struct page *page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 71ccec5c3817..620700f05ff4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3733,15 +3733,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	pte_t *src_pte, *dst_pte, entry, dst_entry;
 	struct page *ptepage;
 	unsigned long addr;
-	int cow;
+	int cow = is_cow_mapping(vma->vm_flags);
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct mmu_notifier_range range;
 	int ret = 0;
 
-	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
-
 	if (cow) {
 		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, src,
 					vma->vm_start,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 5/5] hugetlb: Do early cow when page pinned on src mm
  2021-02-05 16:54 [PATCH v3 0/5] mm/hugetlb: Early cow on fork, and a few cleanups Peter Xu
                   ` (3 preceding siblings ...)
  2021-02-05 16:54 ` [PATCH v3 4/5] mm: Use is_cow_mapping() across tree where proper Peter Xu
@ 2021-02-05 16:54 ` Peter Xu
  2021-02-09  0:09   ` Mike Kravetz
  4 siblings, 1 reply; 8+ messages in thread
From: Peter Xu @ 2021-02-05 16:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Gal Pressman, Andrea Arcangeli, Christoph Hellwig, Miaohe Lin,
	Kirill Shutemov, Jann Horn, Matthew Wilcox, Jan Kara,
	Jason Gunthorpe, Linus Torvalds, Mike Rapoport, David Gibson,
	Mike Kravetz, peterx, Kirill Tkhai, Wei Zhang, Andrew Morton

This is the last missing piece of the COW-during-fork effort when there're
pinned pages found.  One can reference 70e806e4e645 ("mm: Do early cow for
pinned pages during fork() for ptes", 2020-09-27) for more information, since
we do similar things here rather than pte this time, but just for hugetlb.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 62 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 620700f05ff4..7c1a0ecc130e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3727,6 +3727,18 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
 		return false;
 }
 
+static void
+hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr,
+		     struct page *new_page)
+{
+	__SetPageUptodate(new_page);
+	set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1));
+	hugepage_add_new_anon_rmap(new_page, vma, addr);
+	hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
+	ClearHPageRestoreReserve(new_page);
+	SetHPageMigratable(new_page);
+}
+
 int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			    struct vm_area_struct *vma)
 {
@@ -3736,6 +3748,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	int cow = is_cow_mapping(vma->vm_flags);
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
+	unsigned long npages = pages_per_huge_page(h);
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct mmu_notifier_range range;
 	int ret = 0;
@@ -3784,6 +3797,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 		entry = huge_ptep_get(src_pte);
 		dst_entry = huge_ptep_get(dst_pte);
+again:
 		if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
 			/*
 			 * Skip if src entry none.  Also, skip in the
@@ -3807,6 +3821,52 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			}
 			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
 		} else {
+			entry = huge_ptep_get(src_pte);
+			ptepage = pte_page(entry);
+			get_page(ptepage);
+
+			/*
+			 * This is a rare case where we see pinned hugetlb
+			 * pages while they're prone to COW.  We need to do the
+			 * COW earlier during fork.
+			 *
+			 * When pre-allocating the page or copying data, we
+			 * need to be without the pgtable locks since we could
+			 * sleep during the process.
+			 */
+			if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
+				pte_t src_pte_old = entry;
+				struct page *new;
+
+				spin_unlock(src_ptl);
+				spin_unlock(dst_ptl);
+				/* Do not use reserve as it's private owned */
+				new = alloc_huge_page(vma, addr, 1);
+				if (IS_ERR(new)) {
+					put_page(ptepage);
+					ret = PTR_ERR(new);
+					break;
+				}
+				copy_user_huge_page(new, ptepage, addr, vma,
+						    npages);
+				put_page(ptepage);
+
+				/* Install the new huge page if src pte stable */
+				dst_ptl = huge_pte_lock(h, dst, dst_pte);
+				src_ptl = huge_pte_lockptr(h, src, src_pte);
+				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+				entry = huge_ptep_get(src_pte);
+				if (!pte_same(src_pte_old, entry)) {
+					put_page(new);
+					/* dst_entry won't change as in child */
+					goto again;
+				}
+				hugetlb_install_page(vma, dst_pte, addr, new);
+				spin_unlock(src_ptl);
+				spin_unlock(dst_ptl);
+				continue;
+			}
+
 			if (cow) {
 				/*
 				 * No need to notify as we are downgrading page
@@ -3817,12 +3877,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				 */
 				huge_ptep_set_wrprotect(src, addr, src_pte);
 			}
-			entry = huge_ptep_get(src_pte);
-			ptepage = pte_page(entry);
-			get_page(ptepage);
+
 			page_dup_rmap(ptepage, true);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
-			hugetlb_count_add(pages_per_huge_page(h), dst);
+			hugetlb_count_add(npages, dst);
 		}
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 5/5] hugetlb: Do early cow when page pinned on src mm
  2021-02-05 16:54 ` [PATCH v3 5/5] hugetlb: Do early cow when page pinned on src mm Peter Xu
@ 2021-02-09  0:09   ` Mike Kravetz
  2021-02-09  2:55     ` Peter Xu
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Kravetz @ 2021-02-09  0:09 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Gal Pressman, Andrea Arcangeli, Christoph Hellwig, Miaohe Lin,
	Kirill Shutemov, Jann Horn, Matthew Wilcox, Jan Kara,
	Jason Gunthorpe, Linus Torvalds, Mike Rapoport, David Gibson,
	Kirill Tkhai, Wei Zhang, Andrew Morton

On 2/5/21 8:54 AM, Peter Xu wrote:
> This is the last missing piece of the COW-during-fork effort when there're
> pinned pages found.  One can reference 70e806e4e645 ("mm: Do early cow for
> pinned pages during fork() for ptes", 2020-09-27) for more information, since
> we do similar things here rather than pte this time, but just for hugetlb.

Thanks for all the changes, the patch looks much better.

I did not look at 70e806e4e645 in detail until now.  That commit had the
'write protect trick' which was removed in subsequent commits.  It took me
a bit of git history tracking to figure out the state of that code today and
the reasons for the subsequent changes.  I guess that was a good way to
educate me. :) 

> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 62 insertions(+), 4 deletions(-)

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 5/5] hugetlb: Do early cow when page pinned on src mm
  2021-02-09  0:09   ` Mike Kravetz
@ 2021-02-09  2:55     ` Peter Xu
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Xu @ 2021-02-09  2:55 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Gal Pressman, Andrea Arcangeli,
	Christoph Hellwig, Miaohe Lin, Kirill Shutemov, Jann Horn,
	Matthew Wilcox, Jan Kara, Jason Gunthorpe, Linus Torvalds,
	Mike Rapoport, David Gibson, Kirill Tkhai, Wei Zhang,
	Andrew Morton

On Mon, Feb 08, 2021 at 04:09:26PM -0800, Mike Kravetz wrote:
> On 2/5/21 8:54 AM, Peter Xu wrote:
> > This is the last missing piece of the COW-during-fork effort when there're
> > pinned pages found.  One can reference 70e806e4e645 ("mm: Do early cow for
> > pinned pages during fork() for ptes", 2020-09-27) for more information, since
> > we do similar things here rather than pte this time, but just for hugetlb.
> 
> Thanks for all the changes, the patch looks much better.
> 
> I did not look at 70e806e4e645 in detail until now.  That commit had the
> 'write protect trick' which was removed in subsequent commits.  It took me
> a bit of git history tracking to figure out the state of that code today and
> the reasons for the subsequent changes.  I guess that was a good way to
> educate me. :) 

Thanks for looking into those details.  I didn't expect that to happen since
after Jason's rework with 57efa1fe5957 ("mm/gup: prevent gup_fast from racing
with COW during fork", 2020-12-15) we can ignore the whole wr-protect idea as a
whole.  I referenced 70e806e4e645 more for the idea of why we do that, and also
copy_present_page() on how it is generally implemented.  That gup-fast race is
definitely tricky on its own.

> 
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/hugetlb.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 62 insertions(+), 4 deletions(-)
> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

Thanks!

I'll post a new version very soon with your r-b, and also a compile warning
fixed in the other patch as reported by Gal.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-02-09  2:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-05 16:54 [PATCH v3 0/5] mm/hugetlb: Early cow on fork, and a few cleanups Peter Xu
2021-02-05 16:54 ` [PATCH v3 1/5] hugetlb: Dedup the code to add a new file_region Peter Xu
2021-02-05 16:54 ` [PATCH v3 2/5] hugetlg: Break earlier in add_reservation_in_range() when we can Peter Xu
2021-02-05 16:54 ` [PATCH v3 3/5] mm: Introduce page_needs_cow_for_dma() for deciding whether cow Peter Xu
2021-02-05 16:54 ` [PATCH v3 4/5] mm: Use is_cow_mapping() across tree where proper Peter Xu
2021-02-05 16:54 ` [PATCH v3 5/5] hugetlb: Do early cow when page pinned on src mm Peter Xu
2021-02-09  0:09   ` Mike Kravetz
2021-02-09  2:55     ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).