All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups
@ 2021-02-03 21:08 Peter Xu
  2021-02-03 21:08 ` [PATCH 1/4] hugetlb: Dedup the code to add a new file_region Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Peter Xu @ 2021-02-03 21:08 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Wei Zhang, Matthew Wilcox, Linus Torvalds, Jason Gunthorpe,
	Gal Pressman, peterx, Christoph Hellwig, Andrea Arcangeli,
	Jan Kara, Kirill Shutemov, David Gibson, Mike Rapoport,
	Mike Kravetz, Kirill Tkhai, Jann Horn, Andrew Morton

As reported by Gal [1], we still miss the code clip to handle early cow for
hugetlb case, which is true.  Again, it still feels odd to fork() after using a
few huge pages, especially if they're privately mapped to me..  However I do
agree with Gal and Jason in that we should still have that since that'll
complete the early cow on fork effort at least, and it'll still fix issues
where buffers are not well under control and not easy to apply MADV_DONTFORK.

The first two patches (1-2) are some cleanups I noticed when reading into the
hugetlb reserve map code.  I think it's good to have but they're not necessary
for fixing the fork issue.

The last two patches (3-4) is the real fix.

I tested this with a fork() after some vfio-pci assignment, so I'm pretty sure
the page copy path could trigger well (page will be accounted right after the
fork()), but I didn't do data check since the card I assigned is some random
nic.  Gal, please feel free to try this if you have better way to verify the
series.

  https://github.com/xzpeter/linux/tree/fork-cow-pin-huge

Please review, thanks!

[1] https://lore.kernel.org/lkml/27564187-4a08-f187-5a84-3df50009f6ca@amazon.com/

Peter Xu (4):
  hugetlb: Dedup the code to add a new file_region
  hugetlg: Break earlier in add_reservation_in_range() when we can
  mm: Introduce page_needs_cow_for_dma() for deciding whether cow
  hugetlb: Do early cow when page pinned on src mm

 include/linux/mm.h |  21 ++++++++
 mm/huge_memory.c   |   8 +--
 mm/hugetlb.c       | 129 ++++++++++++++++++++++++++++++++++-----------
 mm/internal.h      |   5 --
 mm/memory.c        |   7 +--
 5 files changed, 123 insertions(+), 47 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] hugetlb: Dedup the code to add a new file_region
  2021-02-03 21:08 [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups Peter Xu
@ 2021-02-03 21:08 ` Peter Xu
  2021-02-03 23:01   ` Mike Kravetz
  2021-02-04  1:59   ` Miaohe Lin
  2021-02-03 21:08 ` [PATCH 2/4] hugetlg: Break earlier in add_reservation_in_range() when we can Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Peter Xu @ 2021-02-03 21:08 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Wei Zhang, Matthew Wilcox, Linus Torvalds, Jason Gunthorpe,
	Gal Pressman, peterx, Christoph Hellwig, Andrea Arcangeli,
	Jan Kara, Kirill Shutemov, David Gibson, Mike Rapoport,
	Mike Kravetz, Kirill Tkhai, Jann Horn, Andrew Morton

Introduce hugetlb_resv_map_add() helper to add a new file_region rather than
duplication the similar code twice in add_reservation_in_range().

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 51 +++++++++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 18f6ee317900..d2859c2aecc9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -321,6 +321,24 @@ static void coalesce_file_region(struct resv_map *resv, struct file_region *rg)
 	}
 }
 
+static inline long
+hugetlb_resv_map_add(struct resv_map *map, struct file_region *rg, long from,
+		     long to, struct hstate *h, struct hugetlb_cgroup *cg,
+		     long *regions_needed)
+{
+	struct file_region *nrg;
+
+	if (!regions_needed) {
+		nrg = get_file_region_entry_from_cache(map, from, to);
+		record_hugetlb_cgroup_uncharge_info(cg, h, map, nrg);
+		list_add(&nrg->link, rg->link.prev);
+		coalesce_file_region(map, nrg);
+	} else
+		*regions_needed += 1;
+
+	return to - from;
+}
+
 /*
  * Must be called with resv->lock held.
  *
@@ -336,7 +354,7 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 	long add = 0;
 	struct list_head *head = &resv->regions;
 	long last_accounted_offset = f;
-	struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
+	struct file_region *rg = NULL, *trg = NULL;
 
 	if (regions_needed)
 		*regions_needed = 0;
@@ -365,18 +383,11 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 		/* Add an entry for last_accounted_offset -> rg->from, and
 		 * update last_accounted_offset.
 		 */
-		if (rg->from > last_accounted_offset) {
-			add += rg->from - last_accounted_offset;
-			if (!regions_needed) {
-				nrg = get_file_region_entry_from_cache(
-					resv, last_accounted_offset, rg->from);
-				record_hugetlb_cgroup_uncharge_info(h_cg, h,
-								    resv, nrg);
-				list_add(&nrg->link, rg->link.prev);
-				coalesce_file_region(resv, nrg);
-			} else
-				*regions_needed += 1;
-		}
+		if (rg->from > last_accounted_offset)
+			add += hugetlb_resv_map_add(resv, rg,
+						    last_accounted_offset,
+						    rg->from, h, h_cg,
+						    regions_needed);
 
 		last_accounted_offset = rg->to;
 	}
@@ -384,17 +395,9 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 	/* Handle the case where our range extends beyond
 	 * last_accounted_offset.
 	 */
-	if (last_accounted_offset < t) {
-		add += t - last_accounted_offset;
-		if (!regions_needed) {
-			nrg = get_file_region_entry_from_cache(
-				resv, last_accounted_offset, t);
-			record_hugetlb_cgroup_uncharge_info(h_cg, h, resv, nrg);
-			list_add(&nrg->link, rg->link.prev);
-			coalesce_file_region(resv, nrg);
-		} else
-			*regions_needed += 1;
-	}
+	if (last_accounted_offset < t)
+		add += hugetlb_resv_map_add(resv, rg, last_accounted_offset,
+					    t, h, h_cg, regions_needed);
 
 	VM_BUG_ON(add < 0);
 	return add;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] hugetlg: Break earlier in add_reservation_in_range() when we can
  2021-02-03 21:08 [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups Peter Xu
  2021-02-03 21:08 ` [PATCH 1/4] hugetlb: Dedup the code to add a new file_region Peter Xu
@ 2021-02-03 21:08 ` Peter Xu
  2021-02-04  0:45   ` Mike Kravetz
  2021-02-04  2:20   ` Miaohe Lin
  2021-02-03 21:08 ` [PATCH 3/4] mm: Introduce page_needs_cow_for_dma() for deciding whether cow Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Peter Xu @ 2021-02-03 21:08 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Wei Zhang, Matthew Wilcox, Linus Torvalds, Jason Gunthorpe,
	Gal Pressman, peterx, Christoph Hellwig, Andrea Arcangeli,
	Jan Kara, Kirill Shutemov, David Gibson, Mike Rapoport,
	Mike Kravetz, Kirill Tkhai, Jann Horn, Andrew Morton

All the regions maintained in hugetlb reserved map is inclusive on "from" but
exclusive on "to".  We can break earlier even if rg->from==t because it already
means no possible intersection.

This does not need a Fixes in all cases because when it happens (rg->from==t)
we'll not break out of the loop while we should, however the next thing we'd do
is still add the last file_region we'd need and quit the loop in the next
round.  So this change is not a bugfix (since the old code should still run
okay iiuc), but we'd better still touch it up to make it logically sane.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d2859c2aecc9..9e6ea96bf33b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -377,7 +377,7 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 		/* When we find a region that starts beyond our range, we've
 		 * finished.
 		 */
-		if (rg->from > t)
+		if (rg->from >= t)
 			break;
 
 		/* Add an entry for last_accounted_offset -> rg->from, and
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] mm: Introduce page_needs_cow_for_dma() for deciding whether cow
  2021-02-03 21:08 [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups Peter Xu
  2021-02-03 21:08 ` [PATCH 1/4] hugetlb: Dedup the code to add a new file_region Peter Xu
  2021-02-03 21:08 ` [PATCH 2/4] hugetlg: Break earlier in add_reservation_in_range() when we can Peter Xu
@ 2021-02-03 21:08 ` Peter Xu
  2021-02-03 21:08 ` [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm Peter Xu
  2021-02-04 14:32 ` [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups Gal Pressman
  4 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2021-02-03 21:08 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Wei Zhang, Matthew Wilcox, Linus Torvalds, Jason Gunthorpe,
	Gal Pressman, peterx, Christoph Hellwig, Andrea Arcangeli,
	Jan Kara, Kirill Shutemov, David Gibson, Mike Rapoport,
	Mike Kravetz, Kirill Tkhai, Jann Horn, Andrew Morton

We've got quite a few places (pte, pmd, pud) that explicitly checked against
whether we should break the cow right now during fork().  It's easier to
provide a helper, especially before we work the same thing on hugetlbfs.

Since we'll reference is_cow_mapping() in mm.h, move it there too.  Actually it
suites mm.h more since internal.h is mm/ only, but mm.h is exported to the
whole kernel.  With that we should expect another patch to use is_cow_mapping()
whenever we can across the kernel since we do use it quite a lot but it's
always done with raw code against VM_* flags.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h | 21 +++++++++++++++++++++
 mm/huge_memory.c   |  8 ++------
 mm/internal.h      |  5 -----
 mm/memory.c        |  7 +------
 4 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..6ea20721d349 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1291,6 +1291,27 @@ static inline bool page_maybe_dma_pinned(struct page *page)
 		GUP_PIN_COUNTING_BIAS;
 }
 
+static inline bool is_cow_mapping(vm_flags_t flags)
+{
+	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
+}
+
+/*
+ * This should most likely only be called during fork() to see whether we
+ * should break the cow immediately for a page on the src mm.
+ */
+static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
+					  struct page *page)
+{
+	if (!is_cow_mapping(vma->vm_flags))
+		return false;
+
+	if (!atomic_read(&vma->vm_mm->has_pinned))
+		return false;
+
+	return page_maybe_dma_pinned(page);
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9237976abe72..dbff6c7eda67 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1095,9 +1095,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * best effort that the pinned pages won't be replaced by another
 	 * random page during the coming copy-on-write.
 	 */
-	if (unlikely(is_cow_mapping(vma->vm_flags) &&
-		     atomic_read(&src_mm->has_pinned) &&
-		     page_maybe_dma_pinned(src_page))) {
+	if (unlikely(page_needs_cow_for_dma(vma, src_page))) {
 		pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
@@ -1209,9 +1207,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 	/* Please refer to comments in copy_huge_pmd() */
-	if (unlikely(is_cow_mapping(vma->vm_flags) &&
-		     atomic_read(&src_mm->has_pinned) &&
-		     page_maybe_dma_pinned(pud_page(pud)))) {
+	if (unlikely(page_needs_cow_for_dma(vma, pud_page(pud)))) {
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
 		__split_huge_pud(vma, src_pud, addr);
diff --git a/mm/internal.h b/mm/internal.h
index 25d2b2439f19..24eec93d0dac 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -296,11 +296,6 @@ static inline unsigned int buddy_order(struct page *page)
  */
 #define buddy_order_unsafe(page)	READ_ONCE(page_private(page))
 
-static inline bool is_cow_mapping(vm_flags_t flags)
-{
-	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
-}
-
 /*
  * These three helpers classifies VMAs for virtual memory accounting.
  */
diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..b2849e1d4aab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -800,9 +800,6 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	struct mm_struct *src_mm = src_vma->vm_mm;
 	struct page *new_page;
 
-	if (!is_cow_mapping(src_vma->vm_flags))
-		return 1;
-
 	/*
 	 * What we want to do is to check whether this page may
 	 * have been pinned by the parent process.  If so,
@@ -816,9 +813,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	 * the page count. That might give false positives for
 	 * for pinning, but it will work correctly.
 	 */
-	if (likely(!atomic_read(&src_mm->has_pinned)))
-		return 1;
-	if (likely(!page_maybe_dma_pinned(page)))
+	if (likely(!page_needs_cow_for_dma(src_vma, page)))
 		return 1;
 
 	new_page = *prealloc;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm
  2021-02-03 21:08 [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups Peter Xu
                   ` (2 preceding siblings ...)
  2021-02-03 21:08 ` [PATCH 3/4] mm: Introduce page_needs_cow_for_dma() for deciding whether cow Peter Xu
@ 2021-02-03 21:08 ` Peter Xu
  2021-02-03 21:15     ` Linus Torvalds
  2021-02-03 22:04   ` Mike Kravetz
  2021-02-04 14:32 ` [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups Gal Pressman
  4 siblings, 2 replies; 15+ messages in thread
From: Peter Xu @ 2021-02-03 21:08 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Wei Zhang, Matthew Wilcox, Linus Torvalds, Jason Gunthorpe,
	Gal Pressman, peterx, Christoph Hellwig, Andrea Arcangeli,
	Jan Kara, Kirill Shutemov, David Gibson, Mike Rapoport,
	Mike Kravetz, Kirill Tkhai, Jann Horn, Andrew Morton

This is the last missing piece of the COW-during-fork effort when there're
pinned pages found.  One can reference 70e806e4e645 ("mm: Do early cow for
pinned pages during fork() for ptes", 2020-09-27) for more information, since
we do similar things here rather than pte this time, but just for hugetlb.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 71 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9e6ea96bf33b..931bf1a81c16 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3734,11 +3734,27 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
 		return false;
 }
 
+static void
+hugetlb_copy_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr,
+		  struct page *old_page, struct page *new_page)
+{
+	struct hstate *h = hstate_vma(vma);
+	unsigned int psize = pages_per_huge_page(h);
+
+	copy_user_huge_page(new_page, old_page, addr, vma, psize);
+	__SetPageUptodate(new_page);
+	ClearPagePrivate(new_page);
+	set_page_huge_active(new_page);
+	set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1));
+	hugepage_add_new_anon_rmap(new_page, vma, addr);
+	hugetlb_count_add(psize, vma->vm_mm);
+}
+
 int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			    struct vm_area_struct *vma)
 {
 	pte_t *src_pte, *dst_pte, entry, dst_entry;
-	struct page *ptepage;
+	struct page *ptepage, *prealloc = NULL;
 	unsigned long addr;
 	int cow;
 	struct hstate *h = hstate_vma(vma);
@@ -3787,7 +3803,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		dst_entry = huge_ptep_get(dst_pte);
 		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
 			continue;
-
+again:
 		dst_ptl = huge_pte_lock(h, dst, dst_pte);
 		src_ptl = huge_pte_lockptr(h, src, src_pte);
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -3816,6 +3832,54 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			}
 			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
 		} else {
+			entry = huge_ptep_get(src_pte);
+			ptepage = pte_page(entry);
+			get_page(ptepage);
+
+			if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
+				/* This is very possibly a pinned huge page */
+				if (!prealloc) {
+					/*
+					 * Preallocate the huge page without
+					 * tons of locks since we could sleep.
+					 * Note: we can't use any reservation
+					 * because the page will be exclusively
+					 * owned by the child later.
+					 */
+					put_page(ptepage);
+					spin_unlock(src_ptl);
+					spin_unlock(dst_ptl);
+					prealloc = alloc_huge_page(vma, addr, 0);
+					if (!prealloc) {
+						/*
+						 * hugetlb_cow() seems to be
+						 * more careful here than us.
+						 * However for fork() we could
+						 * be strict not only because
+						 * no one should be referencing
+						 * the child mm yet, but also
+						 * if resources are rare we'd
+						 * better simply fail the
+						 * fork() even earlier.
+						 */
+						ret = -ENOMEM;
+						break;
+					}
+					goto again;
+				}
+				/*
+				 * We have page preallocated so that we can do
+				 * the copy right now.
+				 */
+				hugetlb_copy_page(vma, dst_pte, addr, ptepage,
+						  prealloc);
+				put_page(ptepage);
+				spin_unlock(src_ptl);
+				spin_unlock(dst_ptl);
+				prealloc = NULL;
+				continue;
+			}
+
 			if (cow) {
 				/*
 				 * No need to notify as we are downgrading page
@@ -3826,9 +3890,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				 */
 				huge_ptep_set_wrprotect(src, addr, src_pte);
 			}
-			entry = huge_ptep_get(src_pte);
-			ptepage = pte_page(entry);
-			get_page(ptepage);
+
 			page_dup_rmap(ptepage, true);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 			hugetlb_count_add(pages_per_huge_page(h), dst);
@@ -3842,6 +3904,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	else
 		i_mmap_unlock_read(mapping);
 
+	/* Free the preallocated page if not used at last */
+	if (prealloc)
+		put_page(prealloc);
+
 	return ret;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm
  2021-02-03 21:08 ` [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm Peter Xu
@ 2021-02-03 21:15     ` Linus Torvalds
  2021-02-03 22:04   ` Mike Kravetz
  1 sibling, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2021-02-03 21:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux Kernel Mailing List, Linux-MM, Wei Zhang, Matthew Wilcox,
	Jason Gunthorpe, Gal Pressman, Christoph Hellwig,
	Andrea Arcangeli, Jan Kara, Kirill Shutemov, David Gibson,
	Mike Rapoport, Mike Kravetz, Kirill Tkhai, Jann Horn,
	Andrew Morton

On Wed, Feb 3, 2021 at 1:08 PM Peter Xu <peterx@redhat.com> wrote:
>
> This is the last missing piece of the COW-during-fork effort when there're
> pinned pages found.  One can reference 70e806e4e645 ("mm: Do early cow for
> pinned pages during fork() for ptes", 2020-09-27) for more information, since
> we do similar things here rather than pte this time, but just for hugetlb.

No issues with the code itself, but..

Comments are good, but the comments inside this block of code actually
makes the code *much* harder to read, because now the actual logic is
much more spread out and you can't see what it does so well.

> +                       if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
> +                               /* This is very possibly a pinned huge page */
> +                               if (!prealloc) {
> +                                       /*
> +                                        * Preallocate the huge page without
> +                                        * tons of locks since we could sleep.
> +                                        * Note: we can't use any reservation
> +                                        * because the page will be exclusively
> +                                        * owned by the child later.
> +                                        */
> +                                       put_page(ptepage);
> +                                       spin_unlock(src_ptl);
> +                                       spin_unlock(dst_ptl);
> +                                       prealloc = alloc_huge_page(vma, addr, 0);
> +                                       if (!prealloc) {
> +                                               /*
> +                                                * hugetlb_cow() seems to be
> +                                                * more careful here than us.
> +                                                * However for fork() we could
> +                                                * be strict not only because
> +                                                * no one should be referencing
> +                                                * the child mm yet, but also
> +                                                * if resources are rare we'd
> +                                                * better simply fail the
> +                                                * fork() even earlier.
> +                                                */
> +                                               ret = -ENOMEM;
> +                                               break;
> +                                       }
> +                                       goto again;
> +                               }
> +                               /*
> +                                * We have page preallocated so that we can do
> +                                * the copy right now.
> +                                */
> +                               hugetlb_copy_page(vma, dst_pte, addr, ptepage,
> +                                                 prealloc);
> +                               put_page(ptepage);
> +                               spin_unlock(src_ptl);
> +                               spin_unlock(dst_ptl);
> +                               prealloc = NULL;
> +                               continue;
> +                       }

Can you move the comment above the code? And I _think_ the prealloc
conditional could be split up to a helper function (which would help
more), but maybe there are too many variables for that to be
practical.

           Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm
@ 2021-02-03 21:15     ` Linus Torvalds
  0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2021-02-03 21:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux Kernel Mailing List, Linux-MM, Wei Zhang, Matthew Wilcox,
	Jason Gunthorpe, Gal Pressman, Christoph Hellwig,
	Andrea Arcangeli, Jan Kara, Kirill Shutemov, David Gibson,
	Mike Rapoport, Mike Kravetz, Kirill Tkhai, Jann Horn,
	Andrew Morton

On Wed, Feb 3, 2021 at 1:08 PM Peter Xu <peterx@redhat.com> wrote:
>
> This is the last missing piece of the COW-during-fork effort when there're
> pinned pages found.  One can reference 70e806e4e645 ("mm: Do early cow for
> pinned pages during fork() for ptes", 2020-09-27) for more information, since
> we do similar things here rather than pte this time, but just for hugetlb.

No issues with the code itself, but..

Comments are good, but the comments inside this block of code actually
makes the code *much* harder to read, because now the actual logic is
much more spread out and you can't see what it does so well.

> +                       if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
> +                               /* This is very possibly a pinned huge page */
> +                               if (!prealloc) {
> +                                       /*
> +                                        * Preallocate the huge page without
> +                                        * tons of locks since we could sleep.
> +                                        * Note: we can't use any reservation
> +                                        * because the page will be exclusively
> +                                        * owned by the child later.
> +                                        */
> +                                       put_page(ptepage);
> +                                       spin_unlock(src_ptl);
> +                                       spin_unlock(dst_ptl);
> +                                       prealloc = alloc_huge_page(vma, addr, 0);
> +                                       if (!prealloc) {
> +                                               /*
> +                                                * hugetlb_cow() seems to be
> +                                                * more careful here than us.
> +                                                * However for fork() we could
> +                                                * be strict not only because
> +                                                * no one should be referencing
> +                                                * the child mm yet, but also
> +                                                * if resources are rare we'd
> +                                                * better simply fail the
> +                                                * fork() even earlier.
> +                                                */
> +                                               ret = -ENOMEM;
> +                                               break;
> +                                       }
> +                                       goto again;
> +                               }
> +                               /*
> +                                * We have page preallocated so that we can do
> +                                * the copy right now.
> +                                */
> +                               hugetlb_copy_page(vma, dst_pte, addr, ptepage,
> +                                                 prealloc);
> +                               put_page(ptepage);
> +                               spin_unlock(src_ptl);
> +                               spin_unlock(dst_ptl);
> +                               prealloc = NULL;
> +                               continue;
> +                       }

Can you move the comment above the code? And I _think_ the prealloc
conditional could be split up to a helper function (which would help
more), but maybe there are too many variables for that to be
practical.

           Linus


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm
  2021-02-03 21:08 ` [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm Peter Xu
  2021-02-03 21:15     ` Linus Torvalds
@ 2021-02-03 22:04   ` Mike Kravetz
  2021-02-03 22:30     ` Peter Xu
  1 sibling, 1 reply; 15+ messages in thread
From: Mike Kravetz @ 2021-02-03 22:04 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Wei Zhang, Matthew Wilcox, Linus Torvalds, Jason Gunthorpe,
	Gal Pressman, Christoph Hellwig, Andrea Arcangeli, Jan Kara,
	Kirill Shutemov, David Gibson, Mike Rapoport, Kirill Tkhai,
	Jann Horn, Andrew Morton

On 2/3/21 1:08 PM, Peter Xu wrote:
> This is the last missing piece of the COW-during-fork effort when there're
> pinned pages found.  One can reference 70e806e4e645 ("mm: Do early cow for
> pinned pages during fork() for ptes", 2020-09-27) for more information, since
> we do similar things here rather than pte this time, but just for hugetlb.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 71 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9e6ea96bf33b..931bf1a81c16 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3734,11 +3734,27 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
>  		return false;
>  }
>  
> +static void
> +hugetlb_copy_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr,
> +		  struct page *old_page, struct page *new_page)
> +{
> +	struct hstate *h = hstate_vma(vma);
> +	unsigned int psize = pages_per_huge_page(h);
> +
> +	copy_user_huge_page(new_page, old_page, addr, vma, psize);
> +	__SetPageUptodate(new_page);
> +	ClearPagePrivate(new_page);
> +	set_page_huge_active(new_page);
> +	set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1));
> +	hugepage_add_new_anon_rmap(new_page, vma, addr);
> +	hugetlb_count_add(psize, vma->vm_mm);
> +}
> +
>  int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  			    struct vm_area_struct *vma)
>  {
>  	pte_t *src_pte, *dst_pte, entry, dst_entry;
> -	struct page *ptepage;
> +	struct page *ptepage, *prealloc = NULL;
>  	unsigned long addr;
>  	int cow;
>  	struct hstate *h = hstate_vma(vma);
> @@ -3787,7 +3803,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  		dst_entry = huge_ptep_get(dst_pte);
>  		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
>  			continue;
> -
> +again:
>  		dst_ptl = huge_pte_lock(h, dst, dst_pte);
>  		src_ptl = huge_pte_lockptr(h, src, src_pte);
>  		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> @@ -3816,6 +3832,54 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  			}
>  			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
>  		} else {
> +			entry = huge_ptep_get(src_pte);
> +			ptepage = pte_page(entry);
> +			get_page(ptepage);
> +
> +			if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
> +				/* This is very possibly a pinned huge page */
> +				if (!prealloc) {
> +					/*
> +					 * Preallocate the huge page without
> +					 * tons of locks since we could sleep.
> +					 * Note: we can't use any reservation
> +					 * because the page will be exclusively
> +					 * owned by the child later.
> +					 */
> +					put_page(ptepage);
> +					spin_unlock(src_ptl);
> +					spin_unlock(dst_ptl);
> +					prealloc = alloc_huge_page(vma, addr, 0);

One quick question:

The comment says we can't use any reservation, and I agree.  However, the
alloc_huge_page call has 0 as the avoid_reserve argument.  Shouldn't that
be !0 to avoid reserves?

-- 
Mike Kravetz

> +					if (!prealloc) {
> +						/*
> +						 * hugetlb_cow() seems to be
> +						 * more careful here than us.
> +						 * However for fork() we could
> +						 * be strict not only because
> +						 * no one should be referencing
> +						 * the child mm yet, but also
> +						 * if resources are rare we'd
> +						 * better simply fail the
> +						 * fork() even earlier.
> +						 */
> +						ret = -ENOMEM;
> +						break;
> +					}
> +					goto again;
> +				}
> +				/*
> +				 * We have page preallocated so that we can do
> +				 * the copy right now.
> +				 */
> +				hugetlb_copy_page(vma, dst_pte, addr, ptepage,
> +						  prealloc);
> +				put_page(ptepage);
> +				spin_unlock(src_ptl);
> +				spin_unlock(dst_ptl);
> +				prealloc = NULL;
> +				continue;
> +			}
> +
>  			if (cow) {
>  				/*
>  				 * No need to notify as we are downgrading page
> @@ -3826,9 +3890,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  				 */
>  				huge_ptep_set_wrprotect(src, addr, src_pte);
>  			}
> -			entry = huge_ptep_get(src_pte);
> -			ptepage = pte_page(entry);
> -			get_page(ptepage);
> +
>  			page_dup_rmap(ptepage, true);
>  			set_huge_pte_at(dst, addr, dst_pte, entry);
>  			hugetlb_count_add(pages_per_huge_page(h), dst);
> @@ -3842,6 +3904,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	else
>  		i_mmap_unlock_read(mapping);
>  
> +	/* Free the preallocated page if not used at last */
> +	if (prealloc)
> +		put_page(prealloc);
> +
>  	return ret;
>  }
>  

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm
  2021-02-03 21:15     ` Linus Torvalds
  (?)
@ 2021-02-03 22:08     ` Peter Xu
  -1 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2021-02-03 22:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Linux-MM, Wei Zhang, Matthew Wilcox,
	Jason Gunthorpe, Gal Pressman, Christoph Hellwig,
	Andrea Arcangeli, Jan Kara, Kirill Shutemov, David Gibson,
	Mike Rapoport, Mike Kravetz, Kirill Tkhai, Jann Horn,
	Andrew Morton

On Wed, Feb 03, 2021 at 01:15:03PM -0800, Linus Torvalds wrote:
> On Wed, Feb 3, 2021 at 1:08 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > This is the last missing piece of the COW-during-fork effort when there're
> > pinned pages found.  One can reference 70e806e4e645 ("mm: Do early cow for
> > pinned pages during fork() for ptes", 2020-09-27) for more information, since
> > we do similar things here rather than pte this time, but just for hugetlb.
> 
> No issues with the code itself, but..
> 
> Comments are good, but the comments inside this block of code actually
> makes the code *much* harder to read, because now the actual logic is
> much more spread out and you can't see what it does so well.
> 
> > +                       if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
> > +                               /* This is very possibly a pinned huge page */
> > +                               if (!prealloc) {
> > +                                       /*
> > +                                        * Preallocate the huge page without
> > +                                        * tons of locks since we could sleep.
> > +                                        * Note: we can't use any reservation
> > +                                        * because the page will be exclusively
> > +                                        * owned by the child later.
> > +                                        */
> > +                                       put_page(ptepage);
> > +                                       spin_unlock(src_ptl);
> > +                                       spin_unlock(dst_ptl);
> > +                                       prealloc = alloc_huge_page(vma, addr, 0);
> > +                                       if (!prealloc) {
> > +                                               /*
> > +                                                * hugetlb_cow() seems to be
> > +                                                * more careful here than us.
> > +                                                * However for fork() we could
> > +                                                * be strict not only because
> > +                                                * no one should be referencing
> > +                                                * the child mm yet, but also
> > +                                                * if resources are rare we'd
> > +                                                * better simply fail the
> > +                                                * fork() even earlier.
> > +                                                */
> > +                                               ret = -ENOMEM;
> > +                                               break;
> > +                                       }
> > +                                       goto again;
> > +                               }
> > +                               /*
> > +                                * We have page preallocated so that we can do
> > +                                * the copy right now.
> > +                                */
> > +                               hugetlb_copy_page(vma, dst_pte, addr, ptepage,
> > +                                                 prealloc);
> > +                               put_page(ptepage);
> > +                               spin_unlock(src_ptl);
> > +                               spin_unlock(dst_ptl);
> > +                               prealloc = NULL;
> > +                               continue;
> > +                       }
> 
> Can you move the comment above the code?

Sure.

> And I _think_ the prealloc conditional could be split up to a helper function
> (which would help more), but maybe there are too many variables for that to
> be practical.

It's just that comparing to pte case where we introduced page_copy_prealloc(),
we've already got a very nice helper alloc_huge_page() for that for e.g. cgroup
charging and so on, so it seems already clean enough to use it.

The only difference comparing to the pte case is I moved the reset of
"prealloc" to be out of the copy function since we never fail after all, to
avoid passing a struct page** double pointer.

Would below look better (only comment change)?

---------------8<------------------
@@ -3816,6 +3832,39 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
                        }
                        set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
                } else {
+                       entry = huge_ptep_get(src_pte);
+                       ptepage = pte_page(entry);
+                       get_page(ptepage);
+
+                       /*
+                        * This is a rare case where we see pinned hugetlb
+                        * pages while they're prone to COW.  We need to do the
+                        * COW earlier during fork.
+                        *
+                        * When pre-allocating the page we need to be without
+                        * all the locks since we could sleep when allocate.
+                        */
+                       if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
+                               if (!prealloc) {
+                                       put_page(ptepage);
+                                       spin_unlock(src_ptl);
+                                       spin_unlock(dst_ptl);
+                                       prealloc = alloc_huge_page(vma, addr, 0);
+                                       if (!prealloc) {
+                                               ret = -ENOMEM;
+                                               break;
+                                       }
+                                       goto again;
+                               }
+                               hugetlb_copy_page(vma, dst_pte, addr, ptepage,
+                                                 prealloc);
+                               put_page(ptepage);
+                               spin_unlock(src_ptl);
+                               spin_unlock(dst_ptl);
+                               prealloc = NULL;
+                               continue;
+                       }
+
---------------8<------------------

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm
  2021-02-03 22:04   ` Mike Kravetz
@ 2021-02-03 22:30     ` Peter Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2021-02-03 22:30 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-kernel, linux-mm, Wei Zhang, Matthew Wilcox,
	Linus Torvalds, Jason Gunthorpe, Gal Pressman, Christoph Hellwig,
	Andrea Arcangeli, Jan Kara, Kirill Shutemov, David Gibson,
	Mike Rapoport, Kirill Tkhai, Jann Horn, Andrew Morton

On Wed, Feb 03, 2021 at 02:04:30PM -0800, Mike Kravetz wrote:
> > @@ -3816,6 +3832,54 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  			}
> >  			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> >  		} else {
> > +			entry = huge_ptep_get(src_pte);
> > +			ptepage = pte_page(entry);
> > +			get_page(ptepage);
> > +
> > +			if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
> > +				/* This is very possibly a pinned huge page */
> > +				if (!prealloc) {
> > +					/*
> > +					 * Preallocate the huge page without
> > +					 * tons of locks since we could sleep.
> > +					 * Note: we can't use any reservation
> > +					 * because the page will be exclusively
> > +					 * owned by the child later.
> > +					 */
> > +					put_page(ptepage);
> > +					spin_unlock(src_ptl);
> > +					spin_unlock(dst_ptl);
> > +					prealloc = alloc_huge_page(vma, addr, 0);
> 
> One quick question:
> 
> The comment says we can't use any reservation, and I agree.  However, the
> alloc_huge_page call has 0 as the avoid_reserve argument.  Shouldn't that
> be !0 to avoid reserves?

Good point..  so I obviously wanted to skip reservation check but successfully
got cheated by the inverted name. :)

Though I do checked the reservation, so it seems not extremely important - when
we fork and copy the vma, we have already dropped the vma resv map:

		if (is_vm_hugetlb_page(tmp))
			reset_vma_resv_huge_pages(tmp);

Then in alloc_huge_page() we checked vma_resv_map() mostly everywhere we'd
check avoid_reserve too (either in vma_needs_reservation, or calculating
deferred_reserve).  It seems to be mostly useful when vma_resv_map() existed.

But I completely agree I should pass in "1" here in v2.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] hugetlb: Dedup the code to add a new file_region
  2021-02-03 21:08 ` [PATCH 1/4] hugetlb: Dedup the code to add a new file_region Peter Xu
@ 2021-02-03 23:01   ` Mike Kravetz
  2021-02-04  1:59   ` Miaohe Lin
  1 sibling, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2021-02-03 23:01 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Wei Zhang, Matthew Wilcox, Linus Torvalds, Jason Gunthorpe,
	Gal Pressman, Christoph Hellwig, Andrea Arcangeli, Jan Kara,
	Kirill Shutemov, David Gibson, Mike Rapoport, Kirill Tkhai,
	Jann Horn, Andrew Morton

On 2/3/21 1:08 PM, Peter Xu wrote:
> Introduce hugetlb_resv_map_add() helper to add a new file_region rather than
> duplication the similar code twice in add_reservation_in_range().
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 51 +++++++++++++++++++++++++++------------------------
>  1 file changed, 27 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 18f6ee317900..d2859c2aecc9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c

Thanks, that is a pretty straight forward change.  A cleanup with no
functional change.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] hugetlg: Break earlier in add_reservation_in_range() when we can
  2021-02-03 21:08 ` [PATCH 2/4] hugetlg: Break earlier in add_reservation_in_range() when we can Peter Xu
@ 2021-02-04  0:45   ` Mike Kravetz
  2021-02-04  2:20   ` Miaohe Lin
  1 sibling, 0 replies; 15+ messages in thread
From: Mike Kravetz @ 2021-02-04  0:45 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Wei Zhang, Matthew Wilcox, Linus Torvalds, Jason Gunthorpe,
	Gal Pressman, Christoph Hellwig, Andrea Arcangeli, Jan Kara,
	Kirill Shutemov, David Gibson, Mike Rapoport, Kirill Tkhai,
	Jann Horn, Andrew Morton

On 2/3/21 1:08 PM, Peter Xu wrote:
> All the regions maintained in hugetlb reserved map is inclusive on "from" but
> exclusive on "to".  We can break earlier even if rg->from==t because it already
> means no possible intersection.
> 
> This does not need a Fixes in all cases because when it happens (rg->from==t)
> we'll not break out of the loop while we should, however the next thing we'd do
> is still add the last file_region we'd need and quit the loop in the next
> round.  So this change is not a bugfix (since the old code should still run
> okay iiuc), but we'd better still touch it up to make it logically sane.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d2859c2aecc9..9e6ea96bf33b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -377,7 +377,7 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
>  		/* When we find a region that starts beyond our range, we've
>  		 * finished.
>  		 */
> -		if (rg->from > t)
> +		if (rg->from >= t)
>  			break;
>  
>  		/* Add an entry for last_accounted_offset -> rg->from, and
> 

Changing any of this code makes me nervous.  However, I agree with your
analysis.  The change makes the code match the comment WRT the [from, to)
nature of regions.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] hugetlb: Dedup the code to add a new file_region
  2021-02-03 21:08 ` [PATCH 1/4] hugetlb: Dedup the code to add a new file_region Peter Xu
  2021-02-03 23:01   ` Mike Kravetz
@ 2021-02-04  1:59   ` Miaohe Lin
  1 sibling, 0 replies; 15+ messages in thread
From: Miaohe Lin @ 2021-02-04  1:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Wei Zhang, Matthew Wilcox, Linus Torvalds, Jason Gunthorpe,
	Gal Pressman, Christoph Hellwig, Andrea Arcangeli, Jan Kara,
	Kirill Shutemov, David Gibson, Mike Rapoport, Mike Kravetz,
	Kirill Tkhai, Jann Horn, Andrew Morton, linux-kernel, linux-mm

Hi:
On 2021/2/4 5:08, Peter Xu wrote:
> Introduce hugetlb_resv_map_add() helper to add a new file_region rather than
> duplication the similar code twice in add_reservation_in_range().
> 

This cleanup is also in my plan. But I was too sluggish to do this. Many thanks for doing this.
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 51 +++++++++++++++++++++++++++------------------------
>  1 file changed, 27 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 18f6ee317900..d2859c2aecc9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -321,6 +321,24 @@ static void coalesce_file_region(struct resv_map *resv, struct file_region *rg)
>  	}
>  }
>  
> +static inline long
> +hugetlb_resv_map_add(struct resv_map *map, struct file_region *rg, long from,
> +		     long to, struct hstate *h, struct hugetlb_cgroup *cg,
> +		     long *regions_needed)
> +{
> +	struct file_region *nrg;
> +
> +	if (!regions_needed) {
> +		nrg = get_file_region_entry_from_cache(map, from, to);
> +		record_hugetlb_cgroup_uncharge_info(cg, h, map, nrg);
> +		list_add(&nrg->link, rg->link.prev);
> +		coalesce_file_region(map, nrg);
> +	} else
> +		*regions_needed += 1;
> +
> +	return to - from;
> +}
> +
>  /*
>   * Must be called with resv->lock held.
>   *
> @@ -336,7 +354,7 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
>  	long add = 0;
>  	struct list_head *head = &resv->regions;
>  	long last_accounted_offset = f;
> -	struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
> +	struct file_region *rg = NULL, *trg = NULL;
>  
>  	if (regions_needed)
>  		*regions_needed = 0;
> @@ -365,18 +383,11 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
>  		/* Add an entry for last_accounted_offset -> rg->from, and
>  		 * update last_accounted_offset.
>  		 */
> -		if (rg->from > last_accounted_offset) {
> -			add += rg->from - last_accounted_offset;
> -			if (!regions_needed) {
> -				nrg = get_file_region_entry_from_cache(
> -					resv, last_accounted_offset, rg->from);
> -				record_hugetlb_cgroup_uncharge_info(h_cg, h,
> -								    resv, nrg);
> -				list_add(&nrg->link, rg->link.prev);
> -				coalesce_file_region(resv, nrg);
> -			} else
> -				*regions_needed += 1;
> -		}
> +		if (rg->from > last_accounted_offset)
> +			add += hugetlb_resv_map_add(resv, rg,
> +						    last_accounted_offset,
> +						    rg->from, h, h_cg,
> +						    regions_needed);
>  
>  		last_accounted_offset = rg->to;
>  	}
> @@ -384,17 +395,9 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
>  	/* Handle the case where our range extends beyond
>  	 * last_accounted_offset.
>  	 */
> -	if (last_accounted_offset < t) {
> -		add += t - last_accounted_offset;
> -		if (!regions_needed) {
> -			nrg = get_file_region_entry_from_cache(
> -				resv, last_accounted_offset, t);
> -			record_hugetlb_cgroup_uncharge_info(h_cg, h, resv, nrg);
> -			list_add(&nrg->link, rg->link.prev);
> -			coalesce_file_region(resv, nrg);
> -		} else
> -			*regions_needed += 1;
> -	}
> +	if (last_accounted_offset < t)
> +		add += hugetlb_resv_map_add(resv, rg, last_accounted_offset,
> +					    t, h, h_cg, regions_needed);
>  
>  	VM_BUG_ON(add < 0);
>  	return add;
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] hugetlg: Break earlier in add_reservation_in_range() when we can
  2021-02-03 21:08 ` [PATCH 2/4] hugetlg: Break earlier in add_reservation_in_range() when we can Peter Xu
  2021-02-04  0:45   ` Mike Kravetz
@ 2021-02-04  2:20   ` Miaohe Lin
  1 sibling, 0 replies; 15+ messages in thread
From: Miaohe Lin @ 2021-02-04  2:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: Wei Zhang, Matthew Wilcox, Linus Torvalds, Jason Gunthorpe,
	Gal Pressman, Christoph Hellwig, Andrea Arcangeli, Jan Kara,
	Kirill Shutemov, David Gibson, Mike Rapoport, Mike Kravetz,
	Kirill Tkhai, Jann Horn, Andrew Morton, linux-kernel, Linux-MM

Hi:
On 2021/2/4 5:08, Peter Xu wrote:
> All the regions maintained in hugetlb reserved map is inclusive on "from" but
> exclusive on "to".  We can break earlier even if rg->from==t because it already
> means no possible intersection.
> 
> This does not need a Fixes in all cases because when it happens (rg->from==t)
> we'll not break out of the loop while we should, however the next thing we'd do
> is still add the last file_region we'd need and quit the loop in the next
> round.  So this change is not a bugfix (since the old code should still run
> okay iiuc), but we'd better still touch it up to make it logically sane.
> 

I think the difference is when we handle the rg->from == t case. Previous one is in the loop, now below the
loop. But the result should be same.
Thanks.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d2859c2aecc9..9e6ea96bf33b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -377,7 +377,7 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
>  		/* When we find a region that starts beyond our range, we've
>  		 * finished.
>  		 */
> -		if (rg->from > t)
> +		if (rg->from >= t)
>  			break;
>  
>  		/* Add an entry for last_accounted_offset -> rg->from, and
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups
  2021-02-03 21:08 [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups Peter Xu
                   ` (3 preceding siblings ...)
  2021-02-03 21:08 ` [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm Peter Xu
@ 2021-02-04 14:32 ` Gal Pressman
  4 siblings, 0 replies; 15+ messages in thread
From: Gal Pressman @ 2021-02-04 14:32 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Wei Zhang, Matthew Wilcox, Linus Torvalds, Jason Gunthorpe,
	Christoph Hellwig, Andrea Arcangeli, Jan Kara, Kirill Shutemov,
	David Gibson, Mike Rapoport, Mike Kravetz, Kirill Tkhai,
	Jann Horn, Andrew Morton

On 03/02/2021 23:08, Peter Xu wrote:
> As reported by Gal [1], we still miss the code clip to handle early cow for
> 
> hugetlb case, which is true.  Again, it still feels odd to fork() after using a
> 
> few huge pages, especially if they're privately mapped to me..  However I do
> 
> agree with Gal and Jason in that we should still have that since that'll
> 
> complete the early cow on fork effort at least, and it'll still fix issues
> 
> where buffers are not well under control and not easy to apply MADV_DONTFORK.
> 
> 
> 
> The first two patches (1-2) are some cleanups I noticed when reading into the
> 
> hugetlb reserve map code.  I think it's good to have but they're not necessary
> 
> for fixing the fork issue.
> 
> 
> 
> The last two patches (3-4) is the real fix.
> 
> 
> 
> I tested this with a fork() after some vfio-pci assignment, so I'm pretty sure
> 
> the page copy path could trigger well (page will be accounted right after the
> 
> fork()), but I didn't do data check since the card I assigned is some random
> 
> nic.  Gal, please feel free to try this if you have better way to verify the
> 
> series.

Thanks Peter, once v2 is submitted I'll pull the patches and we'll run the tests
that discovered the issue to verify it works.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-02-04 14:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-03 21:08 [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups Peter Xu
2021-02-03 21:08 ` [PATCH 1/4] hugetlb: Dedup the code to add a new file_region Peter Xu
2021-02-03 23:01   ` Mike Kravetz
2021-02-04  1:59   ` Miaohe Lin
2021-02-03 21:08 ` [PATCH 2/4] hugetlg: Break earlier in add_reservation_in_range() when we can Peter Xu
2021-02-04  0:45   ` Mike Kravetz
2021-02-04  2:20   ` Miaohe Lin
2021-02-03 21:08 ` [PATCH 3/4] mm: Introduce page_needs_cow_for_dma() for deciding whether cow Peter Xu
2021-02-03 21:08 ` [PATCH 4/4] hugetlb: Do early cow when page pinned on src mm Peter Xu
2021-02-03 21:15   ` Linus Torvalds
2021-02-03 21:15     ` Linus Torvalds
2021-02-03 22:08     ` Peter Xu
2021-02-03 22:04   ` Mike Kravetz
2021-02-03 22:30     ` Peter Xu
2021-02-04 14:32 ` [PATCH 0/4] mm/hugetlb: Early cow on fork, and a few cleanups Gal Pressman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.