linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages
@ 2022-03-29 16:04 David Hildenbrand
  2022-03-29 16:04 ` [PATCH v3 01/16] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed David Hildenbrand
                   ` (16 more replies)
  0 siblings, 17 replies; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand, Khalid Aziz

This series is the result of the discussion on the previous approach [2].
More information on the general COW issues can be found there. It is based
on latest linus/master (post v5.17, with relevant core-MM changes for
v5.18-rc1).

v3 is located at:
	https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2_v3


This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
on an anonymous page and COW logic fails to detect exclusivity of the page
to then replacing the anonymous page by a copy in the page table: The
GUP pin lost synchronicity with the pages mapped into the page tables.

This issue, including other related COW issues, has been summarized in [3]
under 3):
"
  3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)

  page_maybe_dma_pinned() is used to check if a page may be pinned for
  DMA (using FOLL_PIN instead of FOLL_GET). While false positives are
  tolerable, false negatives are problematic: pages that are pinned for
  DMA must not be added to the swapcache. If it happens, the (now pinned)
  page could be faulted back from the swapcache into page tables
  read-only. Future write-access would detect the pinning and COW the
  page, losing synchronicity. For the interested reader, this is nicely
  documented in feb889fb40fa ("mm: don't put pinned pages into the swap
  cache").

  Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
  cases and can result in a violation of the documented semantics:
  giving false negatives because of the race.

  There are cases where we call it without properly taking a per-process
  sequence lock, turning the usage of page_maybe_dma_pinned() racy. While
  one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
  handle, there is especially one rmap case (shrink_page_list) that's hard
  to fix: in the rmap world, we're not limited to a single process.

  The shrink_page_list() issue is really subtle. If we race with
  someone pinning a page, we can trigger the same issue as in the FOLL_GET
  case. See the detail section at the end of this mail on a discussion how
  bad this can bite us with VFIO or other FOLL_PIN user.

  It's harder to reproduce, but I managed to modify the O_DIRECT
  reproducer to use io_uring fixed buffers [15] instead, which ends up
  using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
  similarly trigger a loss of synchronicity and consequently a memory
  corruption.

  Again, the root issue is that a write-fault on a page that has
  additional references results in a COW and thereby a loss of
  synchronicity and consequently a memory corruption if two parties
  believe they are referencing the same page.
"

This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable,
especially also taking care of concurrent pinning via GUP-fast,
for example, also fully fixing an issue reported regarding NUMA
balancing [4] recently. While doing that, it further reduces "unnecessary
COWs", especially when we don't fork()/KSM and don't swapout, and fixes the
COW security for hugetlb for FOLL_PIN.

In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
anonymous page is exclusive. Exclusive anonymous pages that are mapped
R/O can directly be mapped R/W by the COW logic in the write fault handler.
Exclusive anonymous pages that want to be shared (fork(), KSM) first have
to be marked shared -- which will fail if there are GUP pins on the page.
GUP is only allowed to take a pin on anonymous pages that are exclusive.
The PT lock is the primary mechanism to synchronize modifications of
PG_anon_exclusive. We synchronize against GUP-fast either via the
src_mm->write_protect_seq (during fork()) or via clear/invalidate+flush of
the relevant page table entry.

Special care has to be taken about swap, migration, and THPs (whereby a
PMD-mapping can be converted to a PTE mapping and we have to track
information for subpages). Besides these, we let the rmap code handle most
magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE
logic as part of our previous approach [2], however, it's now 100% mapcount
free and I further simplified it a bit.

  #1 is a fix
  #3-#10 are mostly rmap preparations for PG_anon_exclusive handling
  #11 introduces PG_anon_exclusive
  #12 uses PG_anon_exclusive and make R/W pins of anonymous pages
   reliable
  #13 is a preparation for reliable R/O pins
  #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins
   make R/O pins of anonymous pages reliable
  #16 adds sanity check when (un)pinning anonymous pages


[1] https://lkml.kernel.org/r/20220131162940.210846-1-david@redhat.com
[2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
[3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
[4] https://bugzilla.kernel.org/show_bug.cgi?id=215616


v2 -> v3:
* Note 1: Left the terminology "unshare" in place for now instead of
  switching to "make anon exclusive".
* Note 2: We might have to tackle undoing effects of arch_unmap_one() on
  sparc, to free some tag memory immediately instead of when tearing down
  the vma/mm; looks like this needs more care either way, so I'll ignore it
  for now.
* Rebased on top of core MM changes for v5.18-rc1 (most conflicts were due
  to folio and ZONE_DEVICE migration rework). No severe changes were
  necessary -- mostly folio conversion and code movement.
* Retested on aarch64, ppc64, s390x and x86_64
* "mm/rmap: convert RMAP flags to a proper distinct rmap_t type"
  -> Missed to convert one instance in restore_exclusive_pte()
* "mm/rmap: pass rmap flags to hugepage_add_anon_rmap()"
  -> Use "!!(flags & RMAP_EXCLUSIVE)" to avoid sparse warnings
* "mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page()"
  -> Added, as we can trigger that now more frequently
* "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"
  -> Use subpage in VM_BUG_ON_PAGE() in try_to_migrate_one()
  -> Move comment from folio_migrate_mapping() to folio_migrate_flags()
     regarding PG_anon_exclusive/PG_mappedtodisk
  -> s/int rmap_flags/rmap_t rmap_flags/ in remove_migration_pmd()
* "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
   exclusive when (un)pinning"
  -> Use IS_ENABLED(CONFIG_DEBUG_VM) instead of ifdef

v1 -> v2:
* Tested on aarch64, ppc64, s390x and x86_64
* "mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon()
   pages"
  -> Use PG_mappedtodisk instead of PG_slab (thanks Willy!), this simlifies
     the patch and necessary handling a lot. Add safety BUG_ON's
  -> Move most documentation to the patch description, to be placed in a
     proper documentation doc in the future, once everything's in place
* ""mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
  -> Skip check+clearing in page_try_dup_anon_rmap(), otherwise we might
     trigger a wrong VM_BUG_ON() for KSM pages in ClearPageAnonExclusive()
  -> In __split_huge_pmd_locked(), call page_try_share_anon_rmap() only
     for "anon_exclusive", otherwise we might trigger a wrong VM_BUG_ON()
  -> In __split_huge_page_tail(), drop any remaining PG_anon_exclusive on
     tail pages, and document why that is fine

RFC -> v1:
* Rephrased/extended some patch descriptions+comments
* Tested on aarch64, ppc64 and x86_64
* "mm/rmap: convert RMAP flags to a proper distinct rmap_t type"
 -> Added
* "mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()"
 -> Added
* "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"
 -> Fixed __do_huge_pmd_anonymous_page() to recheck after temporarily
    dropping the PT lock.
 -> Use "reuse" label in __do_huge_pmd_anonymous_page()
 -> Slightly simplify logic in hugetlb_cow()
 -> In remove_migration_pte(), remove unrelated changes around
    page_remove_rmap()
* "mm: support GUP-triggered unsharing of anonymous pages"
 -> In handle_pte_fault(), trigger pte_mkdirty() only with
    FAULT_FLAG_WRITE
 -> In __handle_mm_fault(), extend comment regarding anonymous PUDs
* "mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
   anonymous page"
   -> Added unsharing logic to gup_hugepte() and gup_huge_pud()
   -> Changed return logic in __follow_hugetlb_must_fault(), making sure
      that "unshare" is always set
* "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
   exclusive when (un)pinning"
  -> Slightly simplified sanity_check_pinned_pages()

David Hildenbrand (16):
  mm/rmap: fix missing swap_free() in try_to_unmap() after
    arch_unmap_one() failed
  mm/hugetlb: take src_mm->write_protect_seq in
    copy_hugetlb_page_range()
  mm/memory: slightly simplify copy_present_pte()
  mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and
    page_try_dup_anon_rmap()
  mm/rmap: convert RMAP flags to a proper distinct rmap_t type
  mm/rmap: remove do_page_add_anon_rmap()
  mm/rmap: pass rmap flags to hugepage_add_anon_rmap()
  mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
  mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon()
    page exclusively
  mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page()
  mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for
    PageAnon() pages
  mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
  mm/gup: disallow follow_page(FOLL_PIN)
  mm: support GUP-triggered unsharing of anonymous pages
  mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
    anonymous page
  mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
    exclusive when (un)pinning

 include/linux/mm.h         |  46 +++++++-
 include/linux/mm_types.h   |   8 ++
 include/linux/page-flags.h |  39 ++++++-
 include/linux/rmap.h       | 118 +++++++++++++++++--
 include/linux/swap.h       |  15 ++-
 include/linux/swapops.h    |  25 ++++
 kernel/events/uprobes.c    |   2 +-
 mm/gup.c                   | 106 ++++++++++++++++-
 mm/huge_memory.c           | 127 +++++++++++++++-----
 mm/hugetlb.c               | 135 ++++++++++++++-------
 mm/khugepaged.c            |   2 +-
 mm/ksm.c                   |  15 ++-
 mm/memory.c                | 234 +++++++++++++++++++++++--------------
 mm/memremap.c              |   9 ++
 mm/migrate.c               |  18 ++-
 mm/migrate_device.c        |  23 +++-
 mm/mprotect.c              |   8 +-
 mm/rmap.c                  |  97 +++++++++++----
 mm/swapfile.c              |   8 +-
 mm/userfaultfd.c           |   2 +-
 tools/vm/page-types.c      |   8 +-
 21 files changed, 825 insertions(+), 220 deletions(-)


base-commit: 1930a6e739c4b4a654a69164dbe39e554d228915
-- 
2.35.1


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 01/16] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-11 16:04   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 02/16] mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range() David Hildenbrand
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand, Khalid Aziz

In case arch_unmap_one() fails, we already did a swap_duplicate(). let's
undo that properly via swap_free().

Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap")
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/rmap.c b/mm/rmap.c
index 5cb970d51f0a..07f59bc6ffc1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1637,6 +1637,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				break;
 			}
 			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+				swap_free(entry);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 02/16] mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range()
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
  2022-03-29 16:04 ` [PATCH v3 01/16] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-11 16:15   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 03/16] mm/memory: slightly simplify copy_present_pte() David Hildenbrand
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

Let's do it just like copy_page_range(), taking the seqlock and making
sure the mmap_lock is held in write mode.

This allows for add a VM_BUG_ON to page_needs_cow_for_dma() and
properly synchronizes cocnurrent fork() with GUP-fast of hugetlb pages,
which will be relevant for further changes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h | 4 ++++
 mm/hugetlb.c       | 8 ++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..9e53458a5ed3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1576,6 +1576,8 @@ static inline bool page_maybe_dma_pinned(struct page *page)
 /*
  * This should most likely only be called during fork() to see whether we
  * should break the cow immediately for a page on the src mm.
+ *
+ * The caller has to hold the PT lock and the vma->vm_mm->->write_protect_seq.
  */
 static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
 					  struct page *page)
@@ -1583,6 +1585,8 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
 	if (!is_cow_mapping(vma->vm_flags))
 		return false;
 
+	VM_BUG_ON(!(raw_read_seqcount(&vma->vm_mm->write_protect_seq) & 1));
+
 	if (!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags))
 		return false;
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b34f50156f7e..ad5ffdc615c6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4714,6 +4714,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 					vma->vm_start,
 					vma->vm_end);
 		mmu_notifier_invalidate_range_start(&range);
+		mmap_assert_write_locked(src);
+		raw_write_seqcount_begin(&src->write_protect_seq);
 	} else {
 		/*
 		 * For shared mappings i_mmap_rwsem must be held to call
@@ -4846,10 +4848,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		spin_unlock(dst_ptl);
 	}
 
-	if (cow)
+	if (cow) {
+		raw_write_seqcount_end(&src->write_protect_seq);
 		mmu_notifier_invalidate_range_end(&range);
-	else
+	} else {
 		i_mmap_unlock_read(mapping);
+	}
 
 	return ret;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 03/16] mm/memory: slightly simplify copy_present_pte()
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
  2022-03-29 16:04 ` [PATCH v3 01/16] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed David Hildenbrand
  2022-03-29 16:04 ` [PATCH v3 02/16] mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range() David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-11 16:38   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 04/16] mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() David Hildenbrand
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

Let's move the pinning check into the caller, to simplify return code
logic and prepare for further changes: relocating the
page_needs_cow_for_dma() into rmap handling code.

While at it, remove the unused pte parameter and simplify the comments a
bit.

No functional change intended.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 53 ++++++++++++++++-------------------------------------
 1 file changed, 16 insertions(+), 37 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index be44d0b36b18..e3038cb6f212 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -862,19 +862,11 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 }
 
 /*
- * Copy a present and normal page if necessary.
+ * Copy a present and normal page.
  *
- * NOTE! The usual case is that this doesn't need to do
- * anything, and can just return a positive value. That
- * will let the caller know that it can just increase
- * the page refcount and re-use the pte the traditional
- * way.
- *
- * But _if_ we need to copy it because it needs to be
- * pinned in the parent (and the child should get its own
- * copy rather than just a reference to the same page),
- * we'll do that here and return zero to let the caller
- * know we're done.
+ * NOTE! The usual case is that this isn't required;
+ * instead, the caller can just increase the page refcount
+ * and re-use the pte the traditional way.
  *
  * And if we need a pre-allocated page but don't yet have
  * one, return a negative error to let the preallocation
@@ -884,25 +876,10 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 static inline int
 copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		  pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
-		  struct page **prealloc, pte_t pte, struct page *page)
+		  struct page **prealloc, struct page *page)
 {
 	struct page *new_page;
-
-	/*
-	 * What we want to do is to check whether this page may
-	 * have been pinned by the parent process.  If so,
-	 * instead of wrprotect the pte on both sides, we copy
-	 * the page immediately so that we'll always guarantee
-	 * the pinned page won't be randomly replaced in the
-	 * future.
-	 *
-	 * The page pinning checks are just "has this mm ever
-	 * seen pinning", along with the (inexact) check of
-	 * the page count. That might give false positives for
-	 * for pinning, but it will work correctly.
-	 */
-	if (likely(!page_needs_cow_for_dma(src_vma, page)))
-		return 1;
+	pte_t pte;
 
 	new_page = *prealloc;
 	if (!new_page)
@@ -944,14 +921,16 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	struct page *page;
 
 	page = vm_normal_page(src_vma, addr, pte);
-	if (page) {
-		int retval;
-
-		retval = copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
-					   addr, rss, prealloc, pte, page);
-		if (retval <= 0)
-			return retval;
-
+	if (page && unlikely(page_needs_cow_for_dma(src_vma, page))) {
+		/*
+		 * If this page may have been pinned by the parent process,
+		 * copy the page immediately for the child so that we'll always
+		 * guarantee the pinned page won't be randomly replaced in the
+		 * future.
+		 */
+		return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
+					 addr, rss, prealloc, page);
+	} else if (page) {
 		get_page(page);
 		page_dup_rmap(page, false);
 		rss[mm_counter(page)]++;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 04/16] mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (2 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 03/16] mm/memory: slightly simplify copy_present_pte() David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-11 18:18   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 05/16] mm/rmap: convert RMAP flags to a proper distinct rmap_t type David Hildenbrand
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

... and move the special check for pinned pages into
page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous
pages via a new pageflag, clearing it only after making sure that there
are no GUP pins on the anonymous page.

We really only care about pins on anonymous pages, because they are
prone to getting replaced in the COW handler once mapped R/O. For !anon
pages in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really
care about that, at least not that I could come up with an example.

Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we
know we're dealing with anonymous pages. Also, drop the handling of
pinned pages from copy_huge_pud() and add a comment if ever supporting
anonymous pages on the PUD level.

This is a preparation for tracking exclusivity of anonymous pages in
the rmap code, and disallowing marking a page shared (-> failing to
duplicate) if there are GUP pins on a page.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h   |  5 +----
 include/linux/rmap.h | 49 +++++++++++++++++++++++++++++++++++++++++++-
 mm/huge_memory.c     | 27 ++++++++----------------
 mm/hugetlb.c         | 16 ++++++++-------
 mm/memory.c          | 17 ++++++++++-----
 mm/migrate.c         |  2 +-
 6 files changed, 79 insertions(+), 37 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9e53458a5ed3..dfc4ec83f76e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1575,16 +1575,13 @@ static inline bool page_maybe_dma_pinned(struct page *page)
 
 /*
  * This should most likely only be called during fork() to see whether we
- * should break the cow immediately for a page on the src mm.
+ * should break the cow immediately for an anon page on the src mm.
  *
  * The caller has to hold the PT lock and the vma->vm_mm->->write_protect_seq.
  */
 static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
 					  struct page *page)
 {
-	if (!is_cow_mapping(vma->vm_flags))
-		return false;
-
 	VM_BUG_ON(!(raw_read_seqcount(&vma->vm_mm->write_protect_seq) & 1));
 
 	if (!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags))
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 17230c458341..9d602fc34063 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -12,6 +12,7 @@
 #include <linux/memcontrol.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/memremap.h>
 
 /*
  * The anon_vma heads a list of private "related" vmas, to scan if
@@ -182,11 +183,57 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 
-static inline void page_dup_rmap(struct page *page, bool compound)
+static inline void __page_dup_rmap(struct page *page, bool compound)
 {
 	atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
 }
 
+static inline void page_dup_file_rmap(struct page *page, bool compound)
+{
+	__page_dup_rmap(page, compound);
+}
+
+/**
+ * page_try_dup_anon_rmap - try duplicating a mapping of an already mapped
+ *			    anonymous page
+ * @page: the page to duplicate the mapping for
+ * @compound: the page is mapped as compound or as a small page
+ * @vma: the source vma
+ *
+ * The caller needs to hold the PT lock and the vma->vma_mm->write_protect_seq.
+ *
+ * Duplicating the mapping can only fail if the page may be pinned; device
+ * private pages cannot get pinned and consequently this function cannot fail.
+ *
+ * If duplicating the mapping succeeds, the page has to be mapped R/O into
+ * the parent and the child. It must *not* get mapped writable after this call.
+ *
+ * Returns 0 if duplicating the mapping succeeded. Returns -EBUSY otherwise.
+ */
+static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
+					 struct vm_area_struct *vma)
+{
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+
+	/*
+	 * If this page may have been pinned by the parent process,
+	 * don't allow to duplicate the mapping but instead require to e.g.,
+	 * copy the page immediately for the child so that we'll always
+	 * guarantee the pinned page won't be randomly replaced in the
+	 * future on write faults.
+	 */
+	if (likely(!is_device_private_page(page) &&
+	    unlikely(page_needs_cow_for_dma(vma, page))))
+		return -EBUSY;
+
+	/*
+	 * It's okay to share the anon page between both processes, mapping
+	 * the page R/O into both processes.
+	 */
+	__page_dup_rmap(page, compound);
+	return 0;
+}
+
 /*
  * Called from mm/vmscan.c to handle paging out
  */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2fe38212e07c..710e56087139 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1097,23 +1097,16 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 
-	/*
-	 * If this page is a potentially pinned page, split and retry the fault
-	 * with smaller page size.  Normally this should not happen because the
-	 * userspace should use MADV_DONTFORK upon pinned regions.  This is a
-	 * best effort that the pinned pages won't be replaced by another
-	 * random page during the coming copy-on-write.
-	 */
-	if (unlikely(page_needs_cow_for_dma(src_vma, src_page))) {
+	get_page(src_page);
+	if (unlikely(page_try_dup_anon_rmap(src_page, true, src_vma))) {
+		/* Page maybe pinned: split and retry the fault on PTEs. */
+		put_page(src_page);
 		pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
 		__split_huge_pmd(src_vma, src_pmd, addr, false, NULL);
 		return -EAGAIN;
 	}
-
-	get_page(src_page);
-	page_dup_rmap(src_page, true);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 out_zero_page:
 	mm_inc_nr_ptes(dst_mm);
@@ -1217,14 +1210,10 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		/* No huge zero pud yet */
 	}
 
-	/* Please refer to comments in copy_huge_pmd() */
-	if (unlikely(page_needs_cow_for_dma(vma, pud_page(pud)))) {
-		spin_unlock(src_ptl);
-		spin_unlock(dst_ptl);
-		__split_huge_pud(vma, src_pud, addr);
-		return -EAGAIN;
-	}
-
+	/*
+	 * TODO: once we support anonymous pages, use page_try_dup_anon_rmap()
+	 * and split if duplicating fails.
+	 */
 	pudp_set_wrprotect(src_mm, addr, src_pud);
 	pud = pud_mkold(pud_wrprotect(pud));
 	set_pud_at(dst_mm, addr, dst_pud, pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ad5ffdc615c6..c11b431991f3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4785,15 +4785,18 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			get_page(ptepage);
 
 			/*
-			 * This is a rare case where we see pinned hugetlb
-			 * pages while they're prone to COW.  We need to do the
-			 * COW earlier during fork.
+			 * Failing to duplicate the anon rmap is a rare case
+			 * where we see pinned hugetlb pages while they're
+			 * prone to COW. We need to do the COW earlier during
+			 * fork.
 			 *
 			 * When pre-allocating the page or copying data, we
 			 * need to be without the pgtable locks since we could
 			 * sleep during the process.
 			 */
-			if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
+			if (!PageAnon(ptepage)) {
+				page_dup_file_rmap(ptepage, true);
+			} else if (page_try_dup_anon_rmap(ptepage, true, vma)) {
 				pte_t src_pte_old = entry;
 				struct page *new;
 
@@ -4840,7 +4843,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				entry = huge_pte_wrprotect(entry);
 			}
 
-			page_dup_rmap(ptepage, true);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 			hugetlb_count_add(npages, dst);
 		}
@@ -5520,7 +5522,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		ClearHPageRestoreReserve(page);
 		hugepage_add_new_anon_rmap(page, vma, haddr);
 	} else
-		page_dup_rmap(page, true);
+		page_dup_file_rmap(page, true);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, haddr, ptep, new_pte);
@@ -5881,7 +5883,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release_unlock;
 
 	if (vm_shared) {
-		page_dup_rmap(page, true);
+		page_dup_file_rmap(page, true);
 	} else {
 		ClearHPageRestoreReserve(page);
 		hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
diff --git a/mm/memory.c b/mm/memory.c
index e3038cb6f212..2baa2fed1319 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -825,7 +825,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		 */
 		get_page(page);
 		rss[mm_counter(page)]++;
-		page_dup_rmap(page, false);
+		/* Cannot fail as these pages cannot get pinned. */
+		BUG_ON(page_try_dup_anon_rmap(page, false, src_vma));
 
 		/*
 		 * We do not preserve soft-dirty information, because so
@@ -921,18 +922,24 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	struct page *page;
 
 	page = vm_normal_page(src_vma, addr, pte);
-	if (page && unlikely(page_needs_cow_for_dma(src_vma, page))) {
+	if (page && PageAnon(page)) {
 		/*
 		 * If this page may have been pinned by the parent process,
 		 * copy the page immediately for the child so that we'll always
 		 * guarantee the pinned page won't be randomly replaced in the
 		 * future.
 		 */
-		return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
-					 addr, rss, prealloc, page);
+		get_page(page);
+		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
+			/* Page maybe pinned, we have to copy. */
+			put_page(page);
+			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
+						 addr, rss, prealloc, page);
+		}
+		rss[mm_counter(page)]++;
 	} else if (page) {
 		get_page(page);
-		page_dup_rmap(page, false);
+		page_dup_file_rmap(page, false);
 		rss[mm_counter(page)]++;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 3d60823afd2d..97de2fc17f34 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -234,7 +234,7 @@ static bool remove_migration_pte(struct folio *folio,
 			if (folio_test_anon(folio))
 				hugepage_add_anon_rmap(new, vma, pvmw.address);
 			else
-				page_dup_rmap(new, true);
+				page_dup_file_rmap(new, true);
 			set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		} else
 #endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 05/16] mm/rmap: convert RMAP flags to a proper distinct rmap_t type
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (3 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 04/16] mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-12  8:11   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 06/16] mm/rmap: remove do_page_add_anon_rmap() David Hildenbrand
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

We want to pass the flags to more than one anon rmap function, getting
rid of special "do_page_add_anon_rmap()". So let's pass around a distinct
__bitwise type and refine documentation.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h | 22 ++++++++++++++++++----
 mm/memory.c          |  6 +++---
 mm/rmap.c            |  7 ++++---
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9d602fc34063..2d0f12119a13 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -160,9 +160,23 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
 
 struct anon_vma *page_get_anon_vma(struct page *page);
 
-/* bitflags for do_page_add_anon_rmap() */
-#define RMAP_EXCLUSIVE 0x01
-#define RMAP_COMPOUND 0x02
+/* RMAP flags, currently only relevant for some anon rmap operations. */
+typedef int __bitwise rmap_t;
+
+/*
+ * No special request: if the page is a subpage of a compound page, it is
+ * mapped via a PTE. The mapped (sub)page is possibly shared between processes.
+ */
+#define RMAP_NONE		((__force rmap_t)0)
+
+/* The (sub)page is exclusive to a single process. */
+#define RMAP_EXCLUSIVE		((__force rmap_t)BIT(0))
+
+/*
+ * The compound page is not mapped via PTEs, but instead via a single PMD and
+ * should be accounted accordingly.
+ */
+#define RMAP_COMPOUND		((__force rmap_t)BIT(1))
 
 /*
  * rmap interfaces called when adding or removing pte of page
@@ -171,7 +185,7 @@ void page_move_anon_rmap(struct page *, struct vm_area_struct *);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, bool compound);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long address, int flags);
+		unsigned long address, rmap_t flags);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, bool compound);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
diff --git a/mm/memory.c b/mm/memory.c
index 2baa2fed1319..f0fd63a7f652 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3507,10 +3507,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page = NULL, *swapcache;
 	struct swap_info_struct *si = NULL;
+	rmap_t rmap_flags = RMAP_NONE;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
-	int exclusive = 0;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
 
@@ -3685,7 +3685,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		vmf->flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
-		exclusive = RMAP_EXCLUSIVE;
+		rmap_flags |= RMAP_EXCLUSIVE;
 	}
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
@@ -3701,7 +3701,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
-		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
+		do_page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
 	}
 
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 07f59bc6ffc1..3623d16ae11b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1137,7 +1137,8 @@ static void __page_check_anon_rmap(struct page *page,
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
-	do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
+	do_page_add_anon_rmap(page, vma, address,
+			      compound ? RMAP_COMPOUND : RMAP_NONE);
 }
 
 /*
@@ -1146,7 +1147,7 @@ void page_add_anon_rmap(struct page *page,
  * Everybody else should continue to use page_add_anon_rmap above.
  */
 void do_page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, int flags)
+	struct vm_area_struct *vma, unsigned long address, rmap_t flags)
 {
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
@@ -1185,7 +1186,7 @@ void do_page_add_anon_rmap(struct page *page,
 	/* address might be in next vma when migration races vma_adjust */
 	else if (first)
 		__page_set_anon_rmap(page, vma, address,
-				flags & RMAP_EXCLUSIVE);
+				     !!(flags & RMAP_EXCLUSIVE));
 	else
 		__page_check_anon_rmap(page, vma, address);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 06/16] mm/rmap: remove do_page_add_anon_rmap()
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (4 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 05/16] mm/rmap: convert RMAP flags to a proper distinct rmap_t type David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-12  8:13   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 07/16] mm/rmap: pass rmap flags to hugepage_add_anon_rmap() David Hildenbrand
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

... and instead convert page_add_anon_rmap() to accept flags.

Passing flags instead of bools is usually nicer either way, and we want
to more often also pass RMAP_EXCLUSIVE in follow up patches when
detecting that an anonymous page is exclusive: for example, when
restoring an anonymous page from a writable migration entry.

This is a preparation for marking an anonymous page inside
page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h |  2 --
 mm/huge_memory.c     |  2 +-
 mm/ksm.c             |  2 +-
 mm/memory.c          |  4 ++--
 mm/migrate.c         |  3 ++-
 mm/rmap.c            | 14 +-------------
 mm/swapfile.c        |  2 +-
 7 files changed, 8 insertions(+), 21 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2d0f12119a13..aa734d2e2b01 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -183,8 +183,6 @@ typedef int __bitwise rmap_t;
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long address, bool compound);
-void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, bool compound);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 710e56087139..d933143ff886 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3073,7 +3073,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 		pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
 
 	if (PageAnon(new))
-		page_add_anon_rmap(new, vma, mmun_start, true);
+		page_add_anon_rmap(new, vma, mmun_start, RMAP_COMPOUND);
 	else
 		page_add_file_rmap(new, vma, true);
 	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
diff --git a/mm/ksm.c b/mm/ksm.c
index 063a48eeb5ee..e0fb748e37b3 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1150,7 +1150,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	 */
 	if (!is_zero_pfn(page_to_pfn(kpage))) {
 		get_page(kpage);
-		page_add_anon_rmap(kpage, vma, addr, false);
+		page_add_anon_rmap(kpage, vma, addr, RMAP_NONE);
 		newpte = mk_pte(kpage, vma->vm_page_prot);
 	} else {
 		newpte = pte_mkspecial(pfn_pte(page_to_pfn(kpage),
diff --git a/mm/memory.c b/mm/memory.c
index f0fd63a7f652..60df8a365a72 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -725,7 +725,7 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 	 * created when the swap entry was made.
 	 */
 	if (PageAnon(page))
-		page_add_anon_rmap(page, vma, address, false);
+		page_add_anon_rmap(page, vma, address, RMAP_NONE);
 	else
 		/*
 		 * Currently device exclusive access only supports anonymous
@@ -3701,7 +3701,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
-		do_page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
+		page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
 	}
 
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
diff --git a/mm/migrate.c b/mm/migrate.c
index 97de2fc17f34..436f0ec2da03 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -240,7 +240,8 @@ static bool remove_migration_pte(struct folio *folio,
 #endif
 		{
 			if (folio_test_anon(folio))
-				page_add_anon_rmap(new, vma, pvmw.address, false);
+				page_add_anon_rmap(new, vma, pvmw.address,
+						   RMAP_NONE);
 			else
 				page_add_file_rmap(new, vma, false);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 3623d16ae11b..71bf881da2a6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1127,7 +1127,7 @@ static void __page_check_anon_rmap(struct page *page,
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
- * @compound:	charge the page as compound or small page
+ * @flags:	the rmap flags
  *
  * The caller needs to hold the pte lock, and the page must be locked in
  * the anon_vma case: to serialize mapping,index checking after setting,
@@ -1135,18 +1135,6 @@ static void __page_check_anon_rmap(struct page *page,
  * (but PageKsm is never downgraded to PageAnon).
  */
 void page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, bool compound)
-{
-	do_page_add_anon_rmap(page, vma, address,
-			      compound ? RMAP_COMPOUND : RMAP_NONE);
-}
-
-/*
- * Special version of the above for do_swap_page, which often runs
- * into pages that are exclusively owned by the current process.
- * Everybody else should continue to use page_add_anon_rmap above.
- */
-void do_page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, rmap_t flags)
 {
 	bool compound = flags & RMAP_COMPOUND;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 63c61f8b2611..1ba525a2179d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1800,7 +1800,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr, false);
+		page_add_anon_rmap(page, vma, addr, RMAP_NONE);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr, false);
 		lru_cache_add_inactive_or_unevictable(page, vma);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 07/16] mm/rmap: pass rmap flags to hugepage_add_anon_rmap()
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (5 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 06/16] mm/rmap: remove do_page_add_anon_rmap() David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-12  8:37   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 08/16] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() David Hildenbrand
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

Let's prepare for passing RMAP_EXCLUSIVE, similarly as we do for
page_add_anon_rmap() now. RMAP_COMPOUND is implicit for hugetlb
pages and ignored.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h | 2 +-
 mm/migrate.c         | 3 ++-
 mm/rmap.c            | 9 ++++++---
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index aa734d2e2b01..f47bc937c383 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -191,7 +191,7 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long address);
+		unsigned long address, rmap_t flags);
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 436f0ec2da03..48db9500d20e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -232,7 +232,8 @@ static bool remove_migration_pte(struct folio *folio,
 			pte = pte_mkhuge(pte);
 			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
 			if (folio_test_anon(folio))
-				hugepage_add_anon_rmap(new, vma, pvmw.address);
+				hugepage_add_anon_rmap(new, vma, pvmw.address,
+						       RMAP_NONE);
 			else
 				page_dup_file_rmap(new, true);
 			set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 71bf881da2a6..b972eb8f351b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2347,9 +2347,11 @@ void rmap_walk_locked(struct folio *folio, const struct rmap_walk_control *rwc)
  * The following two functions are for anonymous (private mapped) hugepages.
  * Unlike common anonymous pages, anonymous hugepages have no accounting code
  * and no lru code, because we handle hugepages differently from common pages.
+ *
+ * RMAP_COMPOUND is ignored.
  */
-void hugepage_add_anon_rmap(struct page *page,
-			    struct vm_area_struct *vma, unsigned long address)
+void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
+			    unsigned long address, rmap_t flags)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	int first;
@@ -2359,7 +2361,8 @@ void hugepage_add_anon_rmap(struct page *page,
 	/* address might be in next vma when migration races vma_adjust */
 	first = atomic_inc_and_test(compound_mapcount_ptr(page));
 	if (first)
-		__page_set_anon_rmap(page, vma, address, 0);
+		__page_set_anon_rmap(page, vma, address,
+				     !!(flags & RMAP_EXCLUSIVE));
 }
 
 void hugepage_add_new_anon_rmap(struct page *page,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 08/16] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (6 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 07/16] mm/rmap: pass rmap flags to hugepage_add_anon_rmap() David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-12  8:47   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 09/16] mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively David Hildenbrand
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

New anonymous pages are always mapped natively: only THP/khugepagd code
maps a new compound anonymous page and passes "true". Otherwise, we're
just dealing with simple, non-compound pages.

Let's give the interface clearer semantics and document these.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h    |  3 ++-
 kernel/events/uprobes.c |  2 +-
 mm/huge_memory.c        |  2 +-
 mm/khugepaged.c         |  2 +-
 mm/memory.c             | 10 +++++-----
 mm/migrate_device.c     |  2 +-
 mm/rmap.c               |  9 ++++++---
 mm/swapfile.c           |  2 +-
 mm/userfaultfd.c        |  2 +-
 9 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index f47bc937c383..9c120e1b1bc7 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -185,11 +185,12 @@ void page_move_anon_rmap(struct page *, struct vm_area_struct *);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long address, bool compound);
+		unsigned long address);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
+
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6418083901d4..4ef5385815d3 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -180,7 +180,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	if (new_page) {
 		get_page(new_page);
-		page_add_new_anon_rmap(new_page, vma, addr, false);
+		page_add_new_anon_rmap(new_page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(new_page, vma);
 	} else
 		/* no new page, just dec_mm_counter for old_page */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d933143ff886..c4526343565a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -647,7 +647,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr, true);
+		page_add_new_anon_rmap(page, vma, haddr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a4e5eaf3eb01..9bb32fb7ec74 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1183,7 +1183,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address, true);
+	page_add_new_anon_rmap(new_page, vma, address);
 	lru_cache_add_inactive_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 60df8a365a72..03e29c9614e0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -893,7 +893,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	*prealloc = NULL;
 	copy_user_highpage(new_page, page, addr, src_vma);
 	__SetPageUptodate(new_page);
-	page_add_new_anon_rmap(new_page, dst_vma, addr, false);
+	page_add_new_anon_rmap(new_page, dst_vma, addr);
 	lru_cache_add_inactive_or_unevictable(new_page, dst_vma);
 	rss[mm_counter(new_page)]++;
 
@@ -3058,7 +3058,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * some TLBs while the old PTE remains in others.
 		 */
 		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-		page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+		page_add_new_anon_rmap(new_page, vma, vmf->address);
 		lru_cache_add_inactive_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
@@ -3698,7 +3698,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	/* ksm created a completely new copy */
 	if (unlikely(page != swapcache && swapcache)) {
-		page_add_new_anon_rmap(page, vma, vmf->address, false);
+		page_add_new_anon_rmap(page, vma, vmf->address);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
@@ -3848,7 +3848,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	}
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, vmf->address, false);
+	page_add_new_anon_rmap(page, vma, vmf->address);
 	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
@@ -4031,7 +4031,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, addr, false);
+		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 70c7dc05bbfc..fb6d7d5499f5 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -610,7 +610,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto unlock_abort;
 
 	inc_mm_counter(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, addr, false);
+	page_add_new_anon_rmap(page, vma, addr);
 	if (!is_zone_device_page(page))
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	get_page(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index b972eb8f351b..517f56edf6ce 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1182,19 +1182,22 @@ void page_add_anon_rmap(struct page *page,
 }
 
 /**
- * page_add_new_anon_rmap - add pte mapping to a new anonymous page
+ * page_add_new_anon_rmap - add mapping to a new anonymous page
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
- * @compound:	charge the page as compound or small page
+ *
+ * If it's a compound page, it is accounted as a compound page. As the page
+ * is new, it's assume to get mapped exclusively by a single process.
  *
  * Same as page_add_anon_rmap but must only be called on *new* pages.
  * This means the inc-and-test can be bypassed.
  * Page does not have to be locked.
  */
 void page_add_new_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, bool compound)
+	struct vm_area_struct *vma, unsigned long address)
 {
+	const bool compound = PageCompound(page);
 	int nr = compound ? thp_nr_pages(page) : 1;
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1ba525a2179d..0ad7ed7ded21 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1802,7 +1802,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr, RMAP_NONE);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, addr, false);
+		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	}
 	set_pte_at(vma->vm_mm, addr, pte,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0cb8e5ef1713..f9eb132c3260 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -101,7 +101,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 			lru_cache_add(page);
 		page_add_file_rmap(page, dst_vma, false);
 	} else {
-		page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+		page_add_new_anon_rmap(page, dst_vma, dst_addr);
 		lru_cache_add_inactive_or_unevictable(page, dst_vma);
 	}
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 09/16] mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (7 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 08/16] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-12  9:26   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 10/16] mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page() David Hildenbrand
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

We want to mark anonymous pages exclusive, and when using
page_move_anon_rmap() we know that we are the exclusive user, as
properly documented. This is a preparation for marking anonymous pages
exclusive in page_move_anon_rmap().

In both instances, we're holding page lock and are sure that we're the
exclusive owner (page_count() == 1). hugetlb already properly uses
page_move_anon_rmap() in the write fault handler.

Note that in case of a PTE-mapped THP, we'll only end up calling this
function if the whole THP is only referenced by the single PTE mapping
a single subpage (page_count() == 1); consequently, it's fine to modify
the compound page mapping inside page_move_anon_rmap().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 2 ++
 mm/memory.c      | 1 +
 2 files changed, 3 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c4526343565a..dd16819c5edc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1317,6 +1317,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		try_to_free_swap(page);
 	if (page_count(page) == 1) {
 		pmd_t entry;
+
+		page_move_anon_rmap(page, vma);
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
diff --git a/mm/memory.c b/mm/memory.c
index 03e29c9614e0..4303c0fdcf17 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3303,6 +3303,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 		 * and the page is locked, it's dark out, and we're wearing
 		 * sunglasses. Hit it.
 		 */
+		page_move_anon_rmap(page, vma);
 		unlock_page(page);
 		wp_page_reuse(vmf);
 		return VM_FAULT_WRITE;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 10/16] mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page()
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (8 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 09/16] mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-12  9:37   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 11/16] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages David Hildenbrand
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

We can already theoretically fail to unmap (still having page_mapped()) in
case arch_unmap_one() fails, which can happen on sparc. Failures to
unmap are handled gracefully, just as if there are other references on
the target page: freezing the refcount in split_huge_page_to_list()
will fail if still mapped and we'll simply remap.

In commit 504e070dc08f ("mm: thp: replace DEBUG_VM BUG with VM_WARN when
unmap fails for split") we already converted to VM_WARN_ON_ONCE_PAGE,
let's get rid of it completely now.

This is a preparation for making try_to_migrate() fail on anonymous pages
with GUP pins, which will make this VM_WARN_ON_ONCE_PAGE trigger more
frequently.

Reported-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dd16819c5edc..70298431e128 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2241,8 +2241,6 @@ static void unmap_page(struct page *page)
 		try_to_migrate(folio, ttu_flags);
 	else
 		try_to_unmap(folio, ttu_flags | TTU_IGNORE_MLOCK);
-
-	VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
 }
 
 static void remap_page(struct folio *folio, unsigned long nr)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 11/16] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (9 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 10/16] mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page() David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-13  8:25   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 12/16] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive David Hildenbrand
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

The basic question we would like to have a reliable and efficient answer
to is: is this anonymous page exclusive to a single process or might it
be shared? We need that information for ordinary/single pages, hugetlb
pages, and possibly each subpage of a THP.

Introduce a way to mark an anonymous page as exclusive, with the
ultimate goal of teaching our COW logic to not do "wrong COWs", whereby
GUP pins lose consistency with the pages mapped into the page table,
resulting in reported memory corruptions.

Most pageflags already have semantics for anonymous pages, however,
PG_mappedtodisk should never apply to pages in the swapcache, so let's
reuse that flag.

As PG_has_hwpoisoned also uses that flag on the second tail page of a
compound page, convert it to PG_error instead, which is marked as
PF_NO_TAIL, so never used for tail pages.

Use custom page flag modification functions such that we can do
additional sanity checks. The semantics we'll put into some kernel doc
in the future are:

"
  PG_anon_exclusive is *usually* only expressive in combination with a
  page table entry. Depending on the page table entry type it might
  store the following information:

       Is what's mapped via this page table entry exclusive to the
       single process and can be mapped writable without further
       checks? If not, it might be shared and we might have to COW.

  For now, we only expect PTE-mapped THPs to make use of
  PG_anon_exclusive in subpages. For other anonymous compound
  folios (i.e., hugetlb), only the head page is logically mapped and
  holds this information.

  For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
  set on the head page. When replacing the PMD by a page table full
  of PTEs, PG_anon_exclusive, if set on the head page, will be set on
  all tail pages accordingly. Note that converting from a PTE-mapping
  to a PMD mapping using the same compound page is currently not
  possible and consequently doesn't require care.

  If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
  it should only pin if the relevant PG_anon_bit is set. In that case,
  the pin will be fully reliable and stay consistent with the pages
  mapped into the page table, as the bit cannot get cleared (e.g., by
  fork(), KSM) while the page is pinned. For anonymous pages that
  are mapped R/W, PG_anon_exclusive can be assumed to always be set
  because such pages cannot possibly be shared.

  The page table lock protecting the page table entry is the primary
  synchronization mechanism for PG_anon_exclusive; GUP-fast that does
  not take the PT lock needs special care when trying to clear the
  flag.

  Page table entry types and PG_anon_exclusive:
  * Present: PG_anon_exclusive applies.
  * Swap: the information is lost. PG_anon_exclusive was cleared.
  * Migration: the entry holds this information instead.
               PG_anon_exclusive was cleared.
  * Device private: PG_anon_exclusive applies.
  * Device exclusive: PG_anon_exclusive applies.
  * HW Poison: PG_anon_exclusive is stale and not changed.

  If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
  not allowed and the flag will stick around until the page is freed
  and folio->mapping is cleared.
"

We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
zapping) of page table entries, page freeing code will handle that when
also invalidate page->mapping to not indicate PageAnon() anymore.
Letting information about exclusivity stick around will be an important
property when adding sanity checks to unpinning code.

Note that we properly clear the flag in free_pages_prepare() via
PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
so there is no need to manually clear the flag.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/page-flags.h | 39 +++++++++++++++++++++++++++++++++++++-
 mm/hugetlb.c               |  2 ++
 mm/memory.c                | 11 +++++++++++
 mm/memremap.c              |  9 +++++++++
 mm/swapfile.c              |  4 ++++
 tools/vm/page-types.c      |  8 +++++++-
 6 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9d8eeaa67d05..9f488668a1d7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -142,6 +142,15 @@ enum pageflags {
 
 	PG_readahead = PG_reclaim,
 
+	/*
+	 * Depending on the way an anonymous folio can be mapped into a page
+	 * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
+	 * THP), PG_anon_exclusive may be set only for the head page or for
+	 * tail pages of an anonymous folio. For now, we only expect it to be
+	 * set on tail pages for PTE-mapped THP.
+	 */
+	PG_anon_exclusive = PG_mappedtodisk,
+
 	/* Filesystems */
 	PG_checked = PG_owner_priv_1,
 
@@ -176,7 +185,7 @@ enum pageflags {
 	 * Indicates that at least one subpage is hwpoisoned in the
 	 * THP.
 	 */
-	PG_has_hwpoisoned = PG_mappedtodisk,
+	PG_has_hwpoisoned = PG_error,
 #endif
 
 	/* non-lru isolated movable page */
@@ -1002,6 +1011,34 @@ extern bool is_free_buddy_page(struct page *page);
 
 PAGEFLAG(Isolated, isolated, PF_ANY);
 
+static __always_inline int PageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void SetPageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	set_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void ClearPageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void __ClearPageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1UL << PG_mlocked)
 #else
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c11b431991f3..b8e7667116fd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1672,6 +1672,8 @@ void free_huge_page(struct page *page)
 	VM_BUG_ON_PAGE(page_mapcount(page), page);
 
 	hugetlb_set_page_subpool(page, NULL);
+	if (PageAnon(page))
+		__ClearPageAnonExclusive(page);
 	page->mapping = NULL;
 	restore_reserve = HPageRestoreReserve(page);
 	ClearHPageRestoreReserve(page);
diff --git a/mm/memory.c b/mm/memory.c
index 4303c0fdcf17..351623292adf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3663,6 +3663,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_nomap;
 	}
 
+	/*
+	 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
+	 * must never point at an anonymous page in the swapcache that is
+	 * PG_anon_exclusive. Sanity check that this holds and especially, that
+	 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
+	 * check after taking the PT lock and making sure that nobody
+	 * concurrently faulted in this page and set PG_anon_exclusive.
+	 */
+	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
+	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
+
 	/*
 	 * Remove the swap entry and conditionally try to free up the swapcache.
 	 * We're already holding a reference on the page but haven't mapped it
diff --git a/mm/memremap.c b/mm/memremap.c
index af0223605e69..4264f78299a8 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -458,6 +458,15 @@ void free_zone_device_page(struct page *page)
 
 	mem_cgroup_uncharge(page_folio(page));
 
+	/*
+	 * Note: we don't expect anonymous compound pages yet. Once supported
+	 * and we could PTE-map them similar to THP, we'd have to clear
+	 * PG_anon_exclusive on all tail pages.
+	 */
+	VM_BUG_ON_PAGE(PageAnon(page) && PageCompound(page), page);
+	if (PageAnon(page))
+		__ClearPageAnonExclusive(page);
+
 	/*
 	 * When a device managed page is freed, the page->mapping field
 	 * may still contain a (stale) mapping value. For example, the
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0ad7ed7ded21..a7847324d476 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1796,6 +1796,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		goto out;
 	}
 
+	/* See do_swap_page() */
+	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
+	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
+
 	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index b1ed76d9a979..381dcc00cb62 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -80,9 +80,10 @@
 #define KPF_SOFTDIRTY		40
 #define KPF_ARCH_2		41
 
-/* [48-] take some arbitrary free slots for expanding overloaded flags
+/* [47-] take some arbitrary free slots for expanding overloaded flags
  * not part of kernel API
  */
+#define KPF_ANON_EXCLUSIVE	47
 #define KPF_READAHEAD		48
 #define KPF_SLOB_FREE		49
 #define KPF_SLUB_FROZEN		50
@@ -138,6 +139,7 @@ static const char * const page_flag_names[] = {
 	[KPF_SOFTDIRTY]		= "f:softdirty",
 	[KPF_ARCH_2]		= "H:arch_2",
 
+	[KPF_ANON_EXCLUSIVE]	= "d:anon_exclusive",
 	[KPF_READAHEAD]		= "I:readahead",
 	[KPF_SLOB_FREE]		= "P:slob_free",
 	[KPF_SLUB_FROZEN]	= "A:slub_frozen",
@@ -472,6 +474,10 @@ static int bit_mask_ok(uint64_t flags)
 
 static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme)
 {
+	/* Anonymous pages overload PG_mappedtodisk */
+	if ((flags & BIT(ANON)) && (flags & BIT(MAPPEDTODISK)))
+		flags ^= BIT(MAPPEDTODISK) | BIT(ANON_EXCLUSIVE);
+
 	/* SLOB/SLUB overload several page flags */
 	if (flags & BIT(SLAB)) {
 		if (flags & BIT(PRIVATE))
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 12/16] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (10 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 11/16] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-13 16:28   ` Vlastimil Babka
  2022-04-13 18:29   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 13/16] mm/gup: disallow follow_page(FOLL_PIN) David Hildenbrand
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
exclusive, and use that information to make GUP pins reliable and stay
consistent with the page mapped into the page table even if the
page table entry gets write-protected.

With that information at hand, we can extend our COW logic to always
reuse anonymous pages that are exclusive. For anonymous pages that
might be shared, the existing logic applies.

As already documented, PG_anon_exclusive is usually only expressive in
combination with a page table entry. Especially PTE vs. PMD-mapped
anonymous pages require more thought, some examples: due to mremap() we
can easily have a single compound page PTE-mapped into multiple page tables
exclusively in a single process -- multiple page table locks apply.
Further, due to MADV_WIPEONFORK we might not necessarily write-protect
all PTEs, and only some subpages might be pinned. Long story short: once
PTE-mapped, we have to track information about exclusivity per sub-page,
but until then, we can just track it for the compound page in the head
page and not having to update a whole bunch of subpages all of the time
for a simple PMD mapping of a THP.

For simplicity, this commit mostly talks about "anonymous pages", while
it's for THP actually "the part of an anonymous folio referenced via
a page table entry".

To not spill PG_anon_exclusive code all over the mm code-base, we let
the anon rmap code to handle all PG_anon_exclusive logic it can easily
handle.

If a writable, present page table entry points at an anonymous (sub)page,
that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably
pin (FOLL_PIN) on an anonymous page references via a present
page table entry, it must only pin if PG_anon_exclusive is set for the
mapped (sub)page.

This commit doesn't adjust GUP, so this is only implicitly handled for
FOLL_WRITE, follow-up commits will teach GUP to also respect it for
FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages
fully reliable.

Whenever an anonymous page is to be shared (fork(), KSM), or when
temporarily unmapping an anonymous page (swap, migration), the relevant
PG_anon_exclusive bit has to be cleared to mark the anonymous page
possibly shared. Clearing will fail if there are GUP pins on the page:
* For fork(), this means having to copy the page and not being able to
  share it. fork() protects against concurrent GUP using the PT lock and
  the src_mm->write_protect_seq.
* For KSM, this means sharing will fail. For swap this means, unmapping
  will fail, For migration this means, migration will fail early. All
  three cases protect against concurrent GUP using the PT lock and a
  proper clear/invalidate+flush of the relevant page table entry.

This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
pinned page gets mapped R/O and the successive write fault ends up
replacing the page instead of reusing it. It improves the situation for
O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN,
if fork() is *not* involved, however swapout and fork() are still
problematic. Properly using FOLL_PIN instead of FOLL_GET for these
GUP users will fix the issue for them.

I. Details about basic handling

I.1. Fresh anonymous pages

page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
given page exclusive via __page_set_anon_rmap(exclusive=1). As that is
the mechanism fresh anonymous pages come into life (besides migration
code where we copy the page->mapping), all fresh anonymous pages will
start out as exclusive.

I.2. COW reuse handling of anonymous pages

When a COW handler stumbles over a (sub)page that's marked exclusive, it
simply reuses it. Otherwise, the handler tries harder under page lock to
detect if the (sub)page is exclusive and can be reused. If exclusive,
page_move_anon_rmap() will mark the given (sub)page exclusive.

Note that hugetlb code does not yet check for PageAnonExclusive(), as it
still uses the old COW logic that is prone to the COW security issue
because hugetlb code cannot really tolerate unnecessary/wrong COW as
huge pages are a scarce resource.

I.3. Migration handling

try_to_migrate() has to try marking an exclusive anonymous page shared
via page_try_share_anon_rmap(). If it fails because there are GUP pins
on the page, unmap fails. migrate_vma_collect_pmd() and
__split_huge_pmd_locked() are handled similarly.

Writable migration entries implicitly point at shared anonymous pages.
For readable migration entries that information is stored via a new
"readable-exclusive" migration entry, specific to anonymous pages.

When restoring a migration entry in remove_migration_pte(), information
about exlusivity is detected via the migration entry type, and
RMAP_EXCLUSIVE is set accordingly for
page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that
information.

I.4. Swapout handling

try_to_unmap() has to try marking the mapped page possibly shared via
page_try_share_anon_rmap(). If it fails because there are GUP pins on the
page, unmap fails. For now, information about exclusivity is lost. In the
future, we might want to remember that information in the swap entry in
some cases, however, it requires more thought, care, and a way to store
that information in swap entries.

I.5. Swapin handling

do_swap_page() will never stumble over exclusive anonymous pages in the
swap cache, as try_to_migrate() prohibits that. do_swap_page() always has
to detect manually if an anonymous page is exclusive and has to set
RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.

I.6. THP handling

__split_huge_pmd_locked() has to move the information about exclusivity
from the PMD to the PTEs.

a) In case we have a readable-exclusive PMD migration entry, simply insert
readable-exclusive PTE migration entries.

b) In case we have a present PMD entry and we don't want to freeze
("convert to migration entries"), simply forward PG_anon_exclusive to
all sub-pages, no need to temporarily clear the bit.

c) In case we have a present PMD entry and want to freeze, handle it
similar to try_to_migrate(): try marking the page shared first. In case
we fail, we ignore the "freeze" instruction and simply split ordinarily.
try_to_migrate() will properly fail because the THP is still mapped via
PTEs.

When splitting a compound anonymous folio (THP), the information about
exclusivity is implicitly handled via the migration entries: no need to
replicate PG_anon_exclusive manually.

I.7. fork() handling

fork() handling is relatively easy, because PG_anon_exclusive is only
expressive for some page table entry types.

a) Present anonymous pages

page_try_dup_anon_rmap() will mark the given subpage shared -- which
will fail if the page is pinned. If it failed, we have to copy (or
PTE-map a PMD to handle it on the PTE level).

Note that device exclusive entries are just a pointer at a PageAnon()
page. fork() will first convert a device exclusive entry to a present
page table and handle it just like present anonymous pages.

b) Device private entry

Device private entries point at PageAnon() pages that cannot be mapped
directly and, therefore, cannot get pinned.

page_try_dup_anon_rmap() will mark the given subpage shared, which
cannot fail because they cannot get pinned.

c) HW poison entries

PG_anon_exclusive will remain untouched and is stale -- the page table
entry is just a placeholder after all.

d) Migration entries

Writable and readable-exclusive entries are converted to readable
entries: possibly shared.

I.8. mprotect() handling

mprotect() only has to properly handle the new readable-exclusive
migration entry:

When write-protecting a migration entry that points at an anonymous
page, remember the information about exclusivity via the
"readable-exclusive" migration entry type.

II. Migration and GUP-fast

Whenever replacing a present page table entry that maps an exclusive
anonymous page by a migration entry, we have to mark the page possibly
shared and synchronize against GUP-fast by a proper
clear/invalidate+flush to make the following scenario impossible:

1. try_to_migrate() places a migration entry after checking for GUP pins
   and marks the page possibly shared.
2. GUP-fast pins the page due to lack of synchronization
3. fork() converts the "writable/readable-exclusive" migration entry into a
   readable migration entry
4. Migration fails due to the GUP pin (failing to freeze the refcount)
5. Migration entries are restored. PG_anon_exclusive is lost

-> We have a pinned page that is not marked exclusive anymore.

Note that we move information about exclusivity from the page to the
migration entry as it otherwise highly overcomplicates fork() and
PTE-mapping a THP.

III. Swapout and GUP-fast

Whenever replacing a present page table entry that maps an exclusive
anonymous page by a swap entry, we have to mark the page possibly
shared and synchronize against GUP-fast by a proper
clear/invalidate+flush to make the following scenario impossible:

1. try_to_unmap() places a swap entry after checking for GUP pins and
   clears exclusivity information on the page.
2. GUP-fast pins the page due to lack of synchronization.

-> We have a pinned page that is not marked exclusive anymore.

If we'd ever store information about exclusivity in the swap entry,
similar to migration handling, the same considerations as in II would
apply. This is future work.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h    | 40 +++++++++++++++++++++
 include/linux/swap.h    | 15 +++++---
 include/linux/swapops.h | 25 +++++++++++++
 mm/huge_memory.c        | 78 +++++++++++++++++++++++++++++++++++++----
 mm/hugetlb.c            | 15 +++++---
 mm/ksm.c                | 13 ++++++-
 mm/memory.c             | 33 ++++++++++++-----
 mm/migrate.c            | 14 ++++++--
 mm/migrate_device.c     | 21 ++++++++++-
 mm/mprotect.c           |  8 +++--
 mm/rmap.c               | 61 +++++++++++++++++++++++++++++---
 11 files changed, 289 insertions(+), 34 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9c120e1b1bc7..b4a2c0044647 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -228,6 +228,13 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
 {
 	VM_BUG_ON_PAGE(!PageAnon(page), page);
 
+	/*
+	 * No need to check+clear for already shared pages, including KSM
+	 * pages.
+	 */
+	if (!PageAnonExclusive(page))
+		goto dup;
+
 	/*
 	 * If this page may have been pinned by the parent process,
 	 * don't allow to duplicate the mapping but instead require to e.g.,
@@ -239,14 +246,47 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
 	    unlikely(page_needs_cow_for_dma(vma, page))))
 		return -EBUSY;
 
+	ClearPageAnonExclusive(page);
 	/*
 	 * It's okay to share the anon page between both processes, mapping
 	 * the page R/O into both processes.
 	 */
+dup:
 	__page_dup_rmap(page, compound);
 	return 0;
 }
 
+/**
+ * page_try_share_anon_rmap - try marking an exclusive anonymous page possibly
+ *			      shared to prepare for KSM or temporary unmapping
+ * @page: the exclusive anonymous page to try marking possibly shared
+ *
+ * The caller needs to hold the PT lock and has to have the page table entry
+ * cleared/invalidated+flushed, to properly sync against GUP-fast.
+ *
+ * This is similar to page_try_dup_anon_rmap(), however, not used during fork()
+ * to duplicate a mapping, but instead to prepare for KSM or temporarily
+ * unmapping a page (swap, migration) via page_remove_rmap().
+ *
+ * Marking the page shared can only fail if the page may be pinned; device
+ * private pages cannot get pinned and consequently this function cannot fail.
+ *
+ * Returns 0 if marking the page possibly shared succeeded. Returns -EBUSY
+ * otherwise.
+ */
+static inline int page_try_share_anon_rmap(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
+
+	/* See page_try_dup_anon_rmap(). */
+	if (likely(!is_device_private_page(page) &&
+	    unlikely(page_maybe_dma_pinned(page))))
+		return -EBUSY;
+
+	ClearPageAnonExclusive(page);
+	return 0;
+}
+
 /*
  * Called from mm/vmscan.c to handle paging out
  */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 27093b477c5f..e6d70a4156e8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -78,12 +78,19 @@ static inline int current_is_kswapd(void)
 #endif
 
 /*
- * NUMA node memory migration support
+ * Page migration support.
+ *
+ * SWP_MIGRATION_READ_EXCLUSIVE is only applicable to anonymous pages and
+ * indicates that the referenced (part of) an anonymous page is exclusive to
+ * a single process. For SWP_MIGRATION_WRITE, that information is implicit:
+ * (part of) an anonymous page that are mapped writable are exclusive to a
+ * single process.
  */
 #ifdef CONFIG_MIGRATION
-#define SWP_MIGRATION_NUM 2
-#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
+#define SWP_MIGRATION_NUM 3
+#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM)
+#define SWP_MIGRATION_READ_EXCLUSIVE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
+#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 2)
 #else
 #define SWP_MIGRATION_NUM 0
 #endif
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d356ab4047f7..06280fc1c99b 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -194,6 +194,7 @@ static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
 static inline int is_migration_entry(swp_entry_t entry)
 {
 	return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
+			swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE ||
 			swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
@@ -202,11 +203,26 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
 	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
+static inline int is_readable_migration_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_MIGRATION_READ);
+}
+
+static inline int is_readable_exclusive_migration_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE);
+}
+
 static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
 	return swp_entry(SWP_MIGRATION_READ, offset);
 }
 
+static inline swp_entry_t make_readable_exclusive_migration_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_MIGRATION_READ_EXCLUSIVE, offset);
+}
+
 static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
 {
 	return swp_entry(SWP_MIGRATION_WRITE, offset);
@@ -224,6 +240,11 @@ static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 	return swp_entry(0, 0);
 }
 
+static inline swp_entry_t make_readable_exclusive_migration_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
 static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
 {
 	return swp_entry(0, 0);
@@ -244,6 +265,10 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
 {
 	return 0;
 }
+static inline int is_readable_migration_entry(swp_entry_t entry)
+{
+	return 0;
+}
 
 #endif
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 70298431e128..a74a3c5ae3a6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1054,7 +1054,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
 
 		VM_BUG_ON(!is_pmd_migration_entry(pmd));
-		if (is_writable_migration_entry(entry)) {
+		if (!is_readable_migration_entry(entry)) {
 			entry = make_readable_migration_entry(
 							swp_offset(entry));
 			pmd = swp_entry_to_pmd(entry);
@@ -1292,6 +1292,10 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
+	/* Early check when only holding the PT lock. */
+	if (PageAnonExclusive(page))
+		goto reuse;
+
 	if (!trylock_page(page)) {
 		get_page(page);
 		spin_unlock(vmf->ptl);
@@ -1306,6 +1310,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		put_page(page);
 	}
 
+	/* Recheck after temporarily dropping the PT lock. */
+	if (PageAnonExclusive(page)) {
+		unlock_page(page);
+		goto reuse;
+	}
+
 	/*
 	 * See do_wp_page(): we can only map the page writable if there are
 	 * no additional references. Note that we always drain the LRU
@@ -1319,11 +1329,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		pmd_t entry;
 
 		page_move_anon_rmap(page, vma);
+		unlock_page(page);
+reuse:
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
 			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
-		unlock_page(page);
 		spin_unlock(vmf->ptl);
 		return VM_FAULT_WRITE;
 	}
@@ -1708,6 +1719,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 	if (is_swap_pmd(*pmd)) {
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
+		struct page *page = pfn_swap_entry_to_page(entry);
 
 		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
 		if (is_writable_migration_entry(entry)) {
@@ -1716,8 +1728,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			 * A protection check is difficult so
 			 * just be safe and disable write
 			 */
-			entry = make_readable_migration_entry(
-							swp_offset(entry));
+			if (PageAnon(page))
+				entry = make_readable_exclusive_migration_entry(swp_offset(entry));
+			else
+				entry = make_readable_migration_entry(swp_offset(entry));
 			newpmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*pmd))
 				newpmd = pmd_swp_mksoft_dirty(newpmd);
@@ -1937,6 +1951,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
 	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
+	bool anon_exclusive = false;
 	unsigned long addr;
 	int i;
 
@@ -2018,6 +2033,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		entry = pmd_to_swp_entry(old_pmd);
 		page = pfn_swap_entry_to_page(entry);
 		write = is_writable_migration_entry(entry);
+		if (PageAnon(page))
+			anon_exclusive = is_readable_exclusive_migration_entry(entry);
 		young = false;
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
@@ -2029,8 +2046,26 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		young = pmd_young(old_pmd);
 		soft_dirty = pmd_soft_dirty(old_pmd);
 		uffd_wp = pmd_uffd_wp(old_pmd);
+
 		VM_BUG_ON_PAGE(!page_count(page), page);
 		page_ref_add(page, HPAGE_PMD_NR - 1);
+
+		/*
+		 * Without "freeze", we'll simply split the PMD, propagating the
+		 * PageAnonExclusive() flag for each PTE by setting it for
+		 * each subpage -- no need to (temporarily) clear.
+		 *
+		 * With "freeze" we want to replace mapped pages by
+		 * migration entries right away. This is only possible if we
+		 * managed to clear PageAnonExclusive() -- see
+		 * set_pmd_migration_entry().
+		 *
+		 * In case we cannot clear PageAnonExclusive(), split the PMD
+		 * only and let try_to_migrate_one() fail later.
+		 */
+		anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
+		if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
+			freeze = false;
 	}
 
 	/*
@@ -2052,6 +2087,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			if (write)
 				swp_entry = make_writable_migration_entry(
 							page_to_pfn(page + i));
+			else if (anon_exclusive)
+				swp_entry = make_readable_exclusive_migration_entry(
+							page_to_pfn(page + i));
 			else
 				swp_entry = make_readable_migration_entry(
 							page_to_pfn(page + i));
@@ -2063,6 +2101,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		} else {
 			entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
 			entry = maybe_mkwrite(entry, vma);
+			if (anon_exclusive)
+				SetPageAnonExclusive(page + i);
 			if (!write)
 				entry = pte_wrprotect(entry);
 			if (!young)
@@ -2295,6 +2335,13 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	 *
 	 * After successful get_page_unless_zero() might follow flags change,
 	 * for example lock_page() which set PG_waiters.
+	 *
+	 * Note that for mapped sub-pages of an anonymous THP,
+	 * PG_anon_exclusive has been cleared in unmap_page() and is stored in
+	 * the migration entry instead from where remap_page() will restore it.
+	 * We can still have PG_anon_exclusive set on effectively unmapped and
+	 * unreferenced sub-pages of an anonymous THP: we can simply drop
+	 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
 	 */
 	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	page_tail->flags |= (head->flags &
@@ -3026,6 +3073,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	struct vm_area_struct *vma = pvmw->vma;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address = pvmw->address;
+	bool anon_exclusive;
 	pmd_t pmdval;
 	swp_entry_t entry;
 	pmd_t pmdswp;
@@ -3035,10 +3083,19 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 
 	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
 	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+
+	anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
+	if (anon_exclusive && page_try_share_anon_rmap(page)) {
+		set_pmd_at(mm, address, pvmw->pmd, pmdval);
+		return;
+	}
+
 	if (pmd_dirty(pmdval))
 		set_page_dirty(page);
 	if (pmd_write(pmdval))
 		entry = make_writable_migration_entry(page_to_pfn(page));
+	else if (anon_exclusive)
+		entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
 	else
 		entry = make_readable_migration_entry(page_to_pfn(page));
 	pmdswp = swp_entry_to_pmd(entry);
@@ -3072,10 +3129,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	if (pmd_swp_uffd_wp(*pvmw->pmd))
 		pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
 
-	if (PageAnon(new))
-		page_add_anon_rmap(new, vma, mmun_start, RMAP_COMPOUND);
-	else
+	if (PageAnon(new)) {
+		rmap_t rmap_flags = RMAP_COMPOUND;
+
+		if (!is_readable_migration_entry(entry))
+			rmap_flags |= RMAP_EXCLUSIVE;
+
+		page_add_anon_rmap(new, vma, mmun_start, rmap_flags);
+	} else {
 		page_add_file_rmap(new, vma, true);
+	}
+	VM_BUG_ON(pmd_write(pmde) && PageAnon(new) && !PageAnonExclusive(new));
 	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
 
 	/* No need to invalidate - it was non-present before */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b8e7667116fd..6910545f028e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4769,7 +4769,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				    is_hugetlb_entry_hwpoisoned(entry))) {
 			swp_entry_t swp_entry = pte_to_swp_entry(entry);
 
-			if (is_writable_migration_entry(swp_entry) && cow) {
+			if (!is_readable_migration_entry(swp_entry) && cow) {
 				/*
 				 * COW mappings require pages in both
 				 * parent and child to be set to read.
@@ -5169,6 +5169,8 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		set_huge_ptep_writable(vma, haddr, ptep);
 		return 0;
 	}
+	VM_BUG_ON_PAGE(PageAnon(old_page) && PageAnonExclusive(old_page),
+		       old_page);
 
 	/*
 	 * If the process that created a MAP_PRIVATE mapping is about to
@@ -6166,12 +6168,17 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		}
 		if (unlikely(is_hugetlb_entry_migration(pte))) {
 			swp_entry_t entry = pte_to_swp_entry(pte);
+			struct page *page = pfn_swap_entry_to_page(entry);
 
-			if (is_writable_migration_entry(entry)) {
+			if (!is_readable_migration_entry(entry)) {
 				pte_t newpte;
 
-				entry = make_readable_migration_entry(
-							swp_offset(entry));
+				if (PageAnon(page))
+					entry = make_readable_exclusive_migration_entry(
+								swp_offset(entry));
+				else
+					entry = make_readable_migration_entry(
+								swp_offset(entry));
 				newpte = swp_entry_to_pte(entry);
 				set_huge_swap_pte_at(mm, address, ptep,
 						     newpte, huge_page_size(h));
diff --git a/mm/ksm.c b/mm/ksm.c
index e0fb748e37b3..8d5369425c62 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -866,6 +866,7 @@ static inline struct stable_node *page_stable_node(struct page *page)
 static inline void set_page_stable_node(struct page *page,
 					struct stable_node *stable_node)
 {
+	VM_BUG_ON_PAGE(PageAnon(page) && PageAnonExclusive(page), page);
 	page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM);
 }
 
@@ -1038,6 +1039,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	int swapped;
 	int err = -EFAULT;
 	struct mmu_notifier_range range;
+	bool anon_exclusive;
 
 	pvmw.address = page_address_in_vma(page, vma);
 	if (pvmw.address == -EFAULT)
@@ -1055,9 +1057,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?"))
 		goto out_unlock;
 
+	anon_exclusive = PageAnonExclusive(page);
 	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
 	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||
-						mm_tlb_flush_pending(mm)) {
+	    anon_exclusive || mm_tlb_flush_pending(mm)) {
 		pte_t entry;
 
 		swapped = PageSwapCache(page);
@@ -1085,6 +1088,12 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 			set_pte_at(mm, pvmw.address, pvmw.pte, entry);
 			goto out_unlock;
 		}
+
+		if (anon_exclusive && page_try_share_anon_rmap(page)) {
+			set_pte_at(mm, pvmw.address, pvmw.pte, entry);
+			goto out_unlock;
+		}
+
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 
@@ -1143,6 +1152,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 		pte_unmap_unlock(ptep, ptl);
 		goto out_mn;
 	}
+	VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
+	VM_BUG_ON_PAGE(PageAnon(kpage) && PageAnonExclusive(kpage), kpage);
 
 	/*
 	 * No need to check ksm_use_zero_pages here: we can only have a
diff --git a/mm/memory.c b/mm/memory.c
index 351623292adf..d3596e2eaee6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -720,6 +720,8 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 	else if (is_writable_device_exclusive_entry(entry))
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 
+	VM_BUG_ON(pte_write(pte) && !(PageAnon(page) && PageAnonExclusive(page)));
+
 	/*
 	 * No need to take a page reference as one was already
 	 * created when the swap entry was made.
@@ -796,11 +798,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 		rss[mm_counter(page)]++;
 
-		if (is_writable_migration_entry(entry) &&
+		if (!is_readable_migration_entry(entry) &&
 				is_cow_mapping(vm_flags)) {
 			/*
-			 * COW mappings require pages in both
-			 * parent and child to be set to read.
+			 * COW mappings require pages in both parent and child
+			 * to be set to read. A previously exclusive entry is
+			 * now shared.
 			 */
 			entry = make_readable_migration_entry(
 							swp_offset(entry));
@@ -951,6 +954,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
+	VM_BUG_ON(page && PageAnon(page) && PageAnonExclusive(page));
 
 	/*
 	 * If it's a shared mapping, mark it clean in
@@ -2949,6 +2953,9 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page = vmf->page;
 	pte_t entry;
+
+	VM_BUG_ON(PageAnon(page) && !PageAnonExclusive(page));
+
 	/*
 	 * Clear the pages cpupid information as the existing
 	 * information potentially belongs to a now completely
@@ -3273,6 +3280,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 	if (PageAnon(vmf->page)) {
 		struct page *page = vmf->page;
 
+		/*
+		 * If the page is exclusive to this process we must reuse the
+		 * page without further checks.
+		 */
+		if (PageAnonExclusive(page))
+			goto reuse;
+
 		/*
 		 * We have to verify under page lock: these early checks are
 		 * just an optimization to avoid locking the page and freeing
@@ -3305,6 +3319,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 		 */
 		page_move_anon_rmap(page, vma);
 		unlock_page(page);
+reuse:
 		wp_page_reuse(vmf);
 		return VM_FAULT_WRITE;
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
@@ -3692,11 +3707,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * that are certainly not shared because we just allocated them without
 	 * exposing them to the swapcache.
 	 */
-	if ((vmf->flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
-	    (page != swapcache || page_count(page) == 1)) {
-		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
-		vmf->flags &= ~FAULT_FLAG_WRITE;
-		ret |= VM_FAULT_WRITE;
+	if (!PageKsm(page) && (page != swapcache || page_count(page) == 1)) {
+		if (vmf->flags & FAULT_FLAG_WRITE) {
+			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+			vmf->flags &= ~FAULT_FLAG_WRITE;
+			ret |= VM_FAULT_WRITE;
+		}
 		rmap_flags |= RMAP_EXCLUSIVE;
 	}
 	flush_icache_page(vma, page);
@@ -3716,6 +3732,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
 	}
 
+	VM_BUG_ON(!PageAnon(page) || (pte_write(pte) && !PageAnonExclusive(page)));
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 48db9500d20e..231907e89b93 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -177,6 +177,7 @@ static bool remove_migration_pte(struct folio *folio,
 	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
 
 	while (page_vma_mapped_walk(&pvmw)) {
+		rmap_t rmap_flags = RMAP_NONE;
 		pte_t pte;
 		swp_entry_t entry;
 		struct page *new;
@@ -211,6 +212,9 @@ static bool remove_migration_pte(struct folio *folio,
 		else if (pte_swp_uffd_wp(*pvmw.pte))
 			pte = pte_mkuffd_wp(pte);
 
+		if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
+			rmap_flags |= RMAP_EXCLUSIVE;
+
 		if (unlikely(is_device_private_page(new))) {
 			if (pte_write(pte))
 				entry = make_writable_device_private_entry(
@@ -233,7 +237,7 @@ static bool remove_migration_pte(struct folio *folio,
 			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
 			if (folio_test_anon(folio))
 				hugepage_add_anon_rmap(new, vma, pvmw.address,
-						       RMAP_NONE);
+						       rmap_flags);
 			else
 				page_dup_file_rmap(new, true);
 			set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
@@ -242,7 +246,7 @@ static bool remove_migration_pte(struct folio *folio,
 		{
 			if (folio_test_anon(folio))
 				page_add_anon_rmap(new, vma, pvmw.address,
-						   RMAP_NONE);
+						   rmap_flags);
 			else
 				page_add_file_rmap(new, vma, false);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
@@ -519,6 +523,12 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
 		folio_set_workingset(newfolio);
 	if (folio_test_checked(folio))
 		folio_set_checked(newfolio);
+	/*
+	 * PG_anon_exclusive (-> PG_mappedtodisk) is always migrated via
+	 * migration entries. We can still have PG_anon_exclusive set on an
+	 * effectively unmapped and unreferenced first sub-pages of an
+	 * anonymous THP: we can simply copy it here via PG_mappedtodisk.
+	 */
 	if (folio_test_mappedtodisk(folio))
 		folio_set_mappedtodisk(newfolio);
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index fb6d7d5499f5..5052093d0262 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -184,15 +184,34 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		 * set up a special migration page table entry now.
 		 */
 		if (trylock_page(page)) {
+			bool anon_exclusive;
 			pte_t swp_pte;
 
+			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
+			if (anon_exclusive) {
+				flush_cache_page(vma, addr, pte_pfn(*ptep));
+				ptep_clear_flush(vma, addr, ptep);
+
+				if (page_try_share_anon_rmap(page)) {
+					set_pte_at(mm, addr, ptep, pte);
+					unlock_page(page);
+					put_page(page);
+					mpfn = 0;
+					goto next;
+				}
+			} else {
+				ptep_get_and_clear(mm, addr, ptep);
+			}
+
 			migrate->cpages++;
-			ptep_get_and_clear(mm, addr, ptep);
 
 			/* Setup special migration page table entry */
 			if (mpfn & MIGRATE_PFN_WRITE)
 				entry = make_writable_migration_entry(
 							page_to_pfn(page));
+			else if (anon_exclusive)
+				entry = make_readable_exclusive_migration_entry(
+							page_to_pfn(page));
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(page));
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b69ce7a7b2b7..56060acdabd3 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -152,6 +152,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			pages++;
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
+			struct page *page = pfn_swap_entry_to_page(entry);
 			pte_t newpte;
 
 			if (is_writable_migration_entry(entry)) {
@@ -159,8 +160,11 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				 * A protection check is difficult so
 				 * just be safe and disable write
 				 */
-				entry = make_readable_migration_entry(
-							swp_offset(entry));
+				if (PageAnon(page))
+					entry = make_readable_exclusive_migration_entry(
+							     swp_offset(entry));
+				else
+					entry = make_readable_migration_entry(swp_offset(entry));
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_soft_dirty(oldpte))
 					newpte = pte_swp_mksoft_dirty(newpte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 517f56edf6ce..4de07234cbcf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1044,6 +1044,7 @@ EXPORT_SYMBOL_GPL(folio_mkclean);
 void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
+	struct page *subpage = page;
 
 	page = compound_head(page);
 
@@ -1057,6 +1058,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
 	 * folio_test_anon()) will not see one without the other.
 	 */
 	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
+	SetPageAnonExclusive(subpage);
 }
 
 /**
@@ -1074,7 +1076,7 @@ static void __page_set_anon_rmap(struct page *page,
 	BUG_ON(!anon_vma);
 
 	if (PageAnon(page))
-		return;
+		goto out;
 
 	/*
 	 * If the page isn't exclusively mapped into this vma,
@@ -1093,6 +1095,9 @@ static void __page_set_anon_rmap(struct page *page,
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
 	page->index = linear_page_index(vma, address);
+out:
+	if (exclusive)
+		SetPageAnonExclusive(page);
 }
 
 /**
@@ -1154,6 +1159,8 @@ void page_add_anon_rmap(struct page *page,
 	} else {
 		first = atomic_inc_and_test(&page->_mapcount);
 	}
+	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
+	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
 
 	if (first) {
 		int nr = compound ? thp_nr_pages(page) : 1;
@@ -1417,7 +1424,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	pte_t pteval;
 	struct page *subpage;
-	bool ret = true;
+	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 
@@ -1473,6 +1480,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		subpage = folio_page(folio,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
 		address = pvmw.address;
+		anon_exclusive = folio_test_anon(folio) &&
+				 PageAnonExclusive(subpage);
 
 		if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) {
 			/*
@@ -1508,9 +1517,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			}
 		}
 
-		/* Nuke the page table entry. */
+		/*
+		 * Nuke the page table entry. When having to clear
+		 * PageAnonExclusive(), we always have to flush.
+		 */
 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
-		if (should_defer_flush(mm, flags)) {
+		if (should_defer_flush(mm, flags) && !anon_exclusive) {
 			/*
 			 * We clear the PTE but do not flush so potentially
 			 * a remote CPU could still be writing to the folio.
@@ -1635,6 +1647,24 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
+			if (anon_exclusive &&
+			    page_try_share_anon_rmap(subpage)) {
+				swap_free(entry);
+				set_pte_at(mm, address, pvmw.pte, pteval);
+				ret = false;
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+			/*
+			 * Note: We *don't* remember yet if the page was mapped
+			 * exclusively in the swap entry, so swapin code has
+			 * to re-determine that manually and might detect the
+			 * page as possibly shared, for example, if there are
+			 * other references on the page or if the page is under
+			 * writeback. We made sure that there are no GUP pins
+			 * on the page that would rely on it, so for GUP pins
+			 * this is fine.
+			 */
 			if (list_empty(&mm->mmlist)) {
 				spin_lock(&mmlist_lock);
 				if (list_empty(&mm->mmlist))
@@ -1734,7 +1764,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	pte_t pteval;
 	struct page *subpage;
-	bool ret = true;
+	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 
@@ -1795,6 +1825,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		subpage = folio_page(folio,
 				pte_pfn(*pvmw.pte) - folio_pfn(folio));
 		address = pvmw.address;
+		anon_exclusive = folio_test_anon(folio) &&
+				 PageAnonExclusive(subpage);
 
 		if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) {
 			/*
@@ -1846,6 +1878,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			swp_entry_t entry;
 			pte_t swp_pte;
 
+			if (anon_exclusive)
+				BUG_ON(page_try_share_anon_rmap(subpage));
+
 			/*
 			 * Store the pfn of the page in a special migration
 			 * pte. do_swap_page() will wait until the migration
@@ -1854,6 +1889,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			entry = pte_to_swp_entry(pteval);
 			if (is_writable_device_private_entry(entry))
 				entry = make_writable_migration_entry(pfn);
+			else if (anon_exclusive)
+				entry = make_readable_exclusive_migration_entry(pfn);
 			else
 				entry = make_readable_migration_entry(pfn);
 			swp_pte = swp_entry_to_pte(entry);
@@ -1918,6 +1955,15 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
+			VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) &&
+				       !anon_exclusive, subpage);
+			if (anon_exclusive &&
+			    page_try_share_anon_rmap(subpage)) {
+				set_pte_at(mm, address, pvmw.pte, pteval);
+				ret = false;
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
 
 			/*
 			 * Store the pfn of the page in a special migration
@@ -1927,6 +1973,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			if (pte_write(pteval))
 				entry = make_writable_migration_entry(
 							page_to_pfn(subpage));
+			else if (anon_exclusive)
+				entry = make_readable_exclusive_migration_entry(
+							page_to_pfn(subpage));
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(subpage));
@@ -2363,6 +2412,8 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 	BUG_ON(!anon_vma);
 	/* address might be in next vma when migration races vma_adjust */
 	first = atomic_inc_and_test(compound_mapcount_ptr(page));
+	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
+	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
 	if (first)
 		__page_set_anon_rmap(page, vma, address,
 				     !!(flags & RMAP_EXCLUSIVE));
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 13/16] mm/gup: disallow follow_page(FOLL_PIN)
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (11 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 12/16] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-14 15:18   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 14/16] mm: support GUP-triggered unsharing of anonymous pages David Hildenbrand
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

We want to change the way we handle R/O pins on anonymous pages that
might be shared: if we detect a possibly shared anonymous page --
mapped R/O and not !PageAnonExclusive() -- we want to trigger unsharing
via a page fault, resulting in an exclusive anonymous page that can be
pinned reliably without getting replaced via COW on the next write
fault.

However, the required page fault will be problematic for follow_page():
in contrast to ordinary GUP, follow_page() doesn't trigger faults
internally. So we would have to end up failing a R/O pin via
follow_page(), although there is something mapped R/O into the page
table, which might be rather surprising.

We don't seem to have follow_page(FOLL_PIN) users, and it's a purely
internal MM function. Let's just make our life easier and the semantics of
follow_page() clearer by just disallowing FOLL_PIN for follow_page()
completely.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c     | 3 +++
 mm/hugetlb.c | 8 +++++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 271fbe8195d7..f96fc415ea6c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -787,6 +787,9 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	if (vma_is_secretmem(vma))
 		return NULL;
 
+	if (foll_flags & FOLL_PIN)
+		return NULL;
+
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
 	if (ctx.pgmap)
 		put_dev_pagemap(ctx.pgmap);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6910545f028e..75b689ce21c5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6698,9 +6698,11 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	spinlock_t *ptl;
 	pte_t pte;
 
-	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
-	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
-			 (FOLL_PIN | FOLL_GET)))
+	/*
+	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
+	 * follow_hugetlb_page().
+	 */
+	if (WARN_ON_ONCE(flags & FOLL_PIN))
 		return NULL;
 
 retry:
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 14/16] mm: support GUP-triggered unsharing of anonymous pages
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (12 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 13/16] mm/gup: disallow follow_page(FOLL_PIN) David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-14 17:15   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 15/16] mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page David Hildenbrand
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

Whenever GUP currently ends up taking a R/O pin on an anonymous page that
might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
on the page table entry will end up replacing the mapped anonymous page
due to COW, resulting in the GUP pin no longer being consistent with the
page actually mapped into the page table.

The possible ways to deal with this situation are:
 (1) Ignore and pin -- what we do right now.
 (2) Fail to pin -- which would be rather surprising to callers and
     could break user space.
 (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
     pins.

We want to implement 3) because it provides the clearest semantics and
allows for checking in unpin_user_pages() and friends for possible BUGs:
when trying to unpin a page that's no longer exclusive, clearly
something went very wrong and might result in memory corruptions that
might be hard to debug. So we better have a nice way to spot such
issues.

To implement 3), we need a way for GUP to trigger unsharing:
FAULT_FLAG_UNSHARE. FAULT_FLAG_UNSHARE is only applicable to R/O mapped
anonymous pages and resembles COW logic during a write fault. However, in
contrast to a write fault, GUP-triggered unsharing will, for example, still
maintain the write protection.

Let's implement FAULT_FLAG_UNSHARE by hooking into the existing write fault
handlers for all applicable anonymous page types: ordinary pages, THP and
hugetlb.

* If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that has been
  marked exclusive in the meantime by someone else, there is nothing to do.
* If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that's not
  marked exclusive, it will try detecting if the process is the exclusive
  owner. If exclusive, it can be set exclusive similar to reuse logic
  during write faults via page_move_anon_rmap() and there is nothing
  else to do; otherwise, we either have to copy and map a fresh,
  anonymous exclusive page R/O (ordinary pages, hugetlb), or split the
  THP.

This commit is heavily based on patches by Andrea.

Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm_types.h |   8 +++
 mm/huge_memory.c         |  10 +++-
 mm/hugetlb.c             |  56 ++++++++++++--------
 mm/memory.c              | 107 +++++++++++++++++++++++++++------------
 4 files changed, 126 insertions(+), 55 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8834e38c06a4..02019bc8f30e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -812,6 +812,9 @@ typedef struct {
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark
+ *                      exclusive) a possibly shared anonymous page that is
+ *                      mapped R/O.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -831,6 +834,10 @@ typedef struct {
  * continuous faults with flags (b).  We should always try to detect pending
  * signals before a retry to make sure the continuous page faults can still be
  * interrupted if necessary.
+ *
+ * The combination FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE is illegal.
+ * FAULT_FLAG_UNSHARE is ignored and treated like an ordinary read fault when
+ * no existing R/O-mapped anonymous page is encountered.
  */
 enum fault_flag {
 	FAULT_FLAG_WRITE =		1 << 0,
@@ -843,6 +850,7 @@ enum fault_flag {
 	FAULT_FLAG_REMOTE =		1 << 7,
 	FAULT_FLAG_INSTRUCTION =	1 << 8,
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
+	FAULT_FLAG_UNSHARE =		1 << 10,
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a74a3c5ae3a6..8560e234ab4d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1271,6 +1271,7 @@ void huge_pmd_set_accessed(struct vm_fault *vmf)
 
 vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 {
+	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
@@ -1279,6 +1280,9 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
 
+	VM_BUG_ON(unshare && (vmf->flags & FAULT_FLAG_WRITE));
+	VM_BUG_ON(!unshare && !(vmf->flags & FAULT_FLAG_WRITE));
+
 	if (is_huge_zero_pmd(orig_pmd))
 		goto fallback;
 
@@ -1317,7 +1321,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	}
 
 	/*
-	 * See do_wp_page(): we can only map the page writable if there are
+	 * See do_wp_page(): we can only reuse the page exclusively if there are
 	 * no additional references. Note that we always drain the LRU
 	 * pagevecs immediately after adding a THP.
 	 */
@@ -1331,6 +1335,10 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		page_move_anon_rmap(page, vma);
 		unlock_page(page);
 reuse:
+		if (unlikely(unshare)) {
+			spin_unlock(vmf->ptl);
+			return 0;
+		}
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 75b689ce21c5..366a1f704405 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5141,15 +5141,16 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 /*
- * Hugetlb_cow() should be called with page lock of the original hugepage held.
+ * hugetlb_wp() should be called with page lock of the original hugepage held.
  * Called with hugetlb_fault_mutex_table held and pte_page locked so we
  * cannot race with other handlers or page migration.
  * Keep the pte_same checks anyway to make transition from the mutex easier.
  */
-static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
-		       unsigned long address, pte_t *ptep,
+static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
+		       unsigned long address, pte_t *ptep, unsigned int flags,
 		       struct page *pagecache_page, spinlock_t *ptl)
 {
+	const bool unshare = flags & FAULT_FLAG_UNSHARE;
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
@@ -5158,15 +5159,22 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long haddr = address & huge_page_mask(h);
 	struct mmu_notifier_range range;
 
+	VM_BUG_ON(unshare && (flags & FOLL_WRITE));
+	VM_BUG_ON(!unshare && !(flags & FOLL_WRITE));
+
 	pte = huge_ptep_get(ptep);
 	old_page = pte_page(pte);
 
 retry_avoidcopy:
-	/* If no-one else is actually using this page, avoid the copy
-	 * and just make the page writable */
+	/*
+	 * If no-one else is actually using this page, we're the exclusive
+	 * owner and can reuse this page.
+	 */
 	if (page_mapcount(old_page) == 1 && PageAnon(old_page)) {
-		page_move_anon_rmap(old_page, vma);
-		set_huge_ptep_writable(vma, haddr, ptep);
+		if (!PageAnonExclusive(old_page))
+			page_move_anon_rmap(old_page, vma);
+		if (likely(!unshare))
+			set_huge_ptep_writable(vma, haddr, ptep);
 		return 0;
 	}
 	VM_BUG_ON_PAGE(PageAnon(old_page) && PageAnonExclusive(old_page),
@@ -5269,13 +5277,13 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
 		ClearHPageRestoreReserve(new_page);
 
-		/* Break COW */
+		/* Break COW or unshare */
 		huge_ptep_clear_flush(vma, haddr, ptep);
 		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		page_remove_rmap(old_page, vma, true);
 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
 		set_huge_pte_at(mm, haddr, ptep,
-				make_huge_pte(vma, new_page, 1));
+				make_huge_pte(vma, new_page, !unshare));
 		SetHPageMigratable(new_page);
 		/* Make the old page be freed below */
 		new_page = old_page;
@@ -5283,7 +5291,10 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 out_release_all:
-	/* No restore in case of successful pagetable update (Break COW) */
+	/*
+	 * No restore in case of successful pagetable update (Break COW or
+	 * unshare)
+	 */
 	if (new_page != old_page)
 		restore_reserve_on_error(h, vma, haddr, new_page);
 	put_page(new_page);
@@ -5408,7 +5419,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	/*
 	 * Currently, we are forced to kill the process in the event the
 	 * original mapper has unmapped pages from the child due to a failed
-	 * COW. Warn that such a situation has occurred as it may not be obvious
+	 * COW/unsharing. Warn that such a situation has occurred as it may not
+	 * be obvious.
 	 */
 	if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) {
 		pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n",
@@ -5534,7 +5546,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	hugetlb_count_add(pages_per_huge_page(h), mm);
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_cow(mm, vma, address, ptep, page, ptl);
+		ret = hugetlb_wp(mm, vma, address, ptep, flags, page, ptl);
 	}
 
 	spin_unlock(ptl);
@@ -5664,14 +5676,15 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_mutex;
 
 	/*
-	 * If we are going to COW the mapping later, we examine the pending
-	 * reservations for this page now. This will ensure that any
+	 * If we are going to COW/unshare the mapping later, we examine the
+	 * pending reservations for this page now. This will ensure that any
 	 * allocations necessary to record that reservation occur outside the
 	 * spinlock. For private mappings, we also lookup the pagecache
 	 * page now as it is used to determine if a reservation has been
 	 * consumed.
 	 */
-	if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
+	if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
+	    !huge_pte_write(entry)) {
 		if (vma_needs_reservation(h, vma, haddr) < 0) {
 			ret = VM_FAULT_OOM;
 			goto out_mutex;
@@ -5686,12 +5699,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ptl = huge_pte_lock(h, mm, ptep);
 
-	/* Check for a racing update before calling hugetlb_cow */
+	/* Check for a racing update before calling hugetlb_wp() */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_ptl;
 
 	/*
-	 * hugetlb_cow() requires page locks of pte_page(entry) and
+	 * hugetlb_wp() requires page locks of pte_page(entry) and
 	 * pagecache_page, so here we need take the former one
 	 * when page != pagecache_page or !pagecache_page.
 	 */
@@ -5704,13 +5717,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	get_page(page);
 
-	if (flags & FAULT_FLAG_WRITE) {
+	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 		if (!huge_pte_write(entry)) {
-			ret = hugetlb_cow(mm, vma, address, ptep,
-					  pagecache_page, ptl);
+			ret = hugetlb_wp(mm, vma, address, ptep, flags,
+					 pagecache_page, ptl);
 			goto out_put_page;
+		} else if (likely(flags & FAULT_FLAG_WRITE)) {
+			entry = huge_pte_mkdirty(entry);
 		}
-		entry = huge_pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
 	if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
diff --git a/mm/memory.c b/mm/memory.c
index d3596e2eaee6..14618f446139 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2745,8 +2745,8 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
 	return same;
 }
 
-static inline bool cow_user_page(struct page *dst, struct page *src,
-				 struct vm_fault *vmf)
+static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
+				       struct vm_fault *vmf)
 {
 	bool ret;
 	void *kaddr;
@@ -2954,6 +2954,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	struct page *page = vmf->page;
 	pte_t entry;
 
+	VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE));
 	VM_BUG_ON(PageAnon(page) && !PageAnonExclusive(page));
 
 	/*
@@ -2974,7 +2975,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 }
 
 /*
- * Handle the case of a page which we actually need to copy to a new page.
+ * Handle the case of a page which we actually need to copy to a new page,
+ * either due to COW or unsharing.
  *
  * Called with mmap_lock locked and the old page referenced, but
  * without the ptl held.
@@ -2991,6 +2993,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
  */
 static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 {
+	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
 	struct vm_area_struct *vma = vmf->vma;
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *old_page = vmf->page;
@@ -3013,7 +3016,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		if (!new_page)
 			goto oom;
 
-		if (!cow_user_page(new_page, old_page, vmf)) {
+		if (!__wp_page_copy_user(new_page, old_page, vmf)) {
 			/*
 			 * COW failed, if the fault was solved by other,
 			 * it's fine. If not, userspace would re-fault on
@@ -3055,7 +3058,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = pte_sw_mkyoung(entry);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (unlikely(unshare)) {
+			if (pte_soft_dirty(vmf->orig_pte))
+				entry = pte_mksoft_dirty(entry);
+			if (pte_uffd_wp(vmf->orig_pte))
+				entry = pte_mkuffd_wp(entry);
+		} else {
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		}
 
 		/*
 		 * Clear the pte entry and flush it first, before updating the
@@ -3072,6 +3082,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
+		BUG_ON(unshare && pte_write(entry));
 		set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
 		update_mmu_cache(vma, vmf->address, vmf->pte);
 		if (old_page) {
@@ -3121,7 +3132,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			free_swap_cache(old_page);
 		put_page(old_page);
 	}
-	return page_copied ? VM_FAULT_WRITE : 0;
+	return page_copied && !unshare ? VM_FAULT_WRITE : 0;
 oom_free_new:
 	put_page(new_page);
 oom:
@@ -3221,18 +3232,22 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
 }
 
 /*
- * This routine handles present pages, when users try to write
- * to a shared page. It is done by copying the page to a new address
- * and decrementing the shared-page counter for the old page.
+ * This routine handles present pages, when
+ * * users try to write to a shared page (FAULT_FLAG_WRITE)
+ * * GUP wants to take a R/O pin on a possibly shared anonymous page
+ *   (FAULT_FLAG_UNSHARE)
+ *
+ * It is done by copying the page to a new address and decrementing the
+ * shared-page counter for the old page.
  *
  * Note that this routine assumes that the protection checks have been
  * done by the caller (the low-level page fault routine in most cases).
- * Thus we can safely just mark it writable once we've done any necessary
- * COW.
+ * Thus, with FAULT_FLAG_WRITE, we can safely just mark it writable once we've
+ * done any necessary COW.
  *
- * We also mark the page dirty at this point even though the page will
- * change only once the write actually happens. This avoids a few races,
- * and potentially makes it more efficient.
+ * In case of FAULT_FLAG_WRITE, we also mark the page dirty at this point even
+ * though the page will change only once the write actually happens. This
+ * avoids a few races, and potentially makes it more efficient.
  *
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), with pte both mapped and locked.
@@ -3241,23 +3256,35 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
 static vm_fault_t do_wp_page(struct vm_fault *vmf)
 	__releases(vmf->ptl)
 {
+	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
 	struct vm_area_struct *vma = vmf->vma;
 
-	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return handle_userfault(vmf, VM_UFFD_WP);
-	}
+	VM_BUG_ON(unshare && (vmf->flags & FAULT_FLAG_WRITE));
+	VM_BUG_ON(!unshare && !(vmf->flags & FAULT_FLAG_WRITE));
 
-	/*
-	 * Userfaultfd write-protect can defer flushes. Ensure the TLB
-	 * is flushed in this case before copying.
-	 */
-	if (unlikely(userfaultfd_wp(vmf->vma) &&
-		     mm_tlb_flush_pending(vmf->vma->vm_mm)))
-		flush_tlb_page(vmf->vma, vmf->address);
+	if (likely(!unshare)) {
+		if (userfaultfd_pte_wp(vma, *vmf->pte)) {
+			pte_unmap_unlock(vmf->pte, vmf->ptl);
+			return handle_userfault(vmf, VM_UFFD_WP);
+		}
+
+		/*
+		 * Userfaultfd write-protect can defer flushes. Ensure the TLB
+		 * is flushed in this case before copying.
+		 */
+		if (unlikely(userfaultfd_wp(vmf->vma) &&
+			     mm_tlb_flush_pending(vmf->vma->vm_mm)))
+			flush_tlb_page(vmf->vma, vmf->address);
+	}
 
 	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
 	if (!vmf->page) {
+		if (unlikely(unshare)) {
+			/* No anonymous page -> nothing to do. */
+			pte_unmap_unlock(vmf->pte, vmf->ptl);
+			return 0;
+		}
+
 		/*
 		 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
 		 * VM_PFNMAP VMA.
@@ -3320,8 +3347,16 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 		page_move_anon_rmap(page, vma);
 		unlock_page(page);
 reuse:
+		if (unlikely(unshare)) {
+			pte_unmap_unlock(vmf->pte, vmf->ptl);
+			return 0;
+		}
 		wp_page_reuse(vmf);
 		return VM_FAULT_WRITE;
+	} else if (unshare) {
+		/* No anonymous page -> nothing to do. */
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return 0;
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
 		return wp_page_shared(vmf);
@@ -4515,8 +4550,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 /* `inline' is required to avoid gcc 4.1.2 build error */
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 {
+	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
+
 	if (vma_is_anonymous(vmf->vma)) {
-		if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
+		if (unlikely(unshare) &&
+		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf);
 	}
@@ -4651,10 +4689,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
 		goto unlock;
 	}
-	if (vmf->flags & FAULT_FLAG_WRITE) {
+	if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 		if (!pte_write(entry))
 			return do_wp_page(vmf);
-		entry = pte_mkdirty(entry);
+		else if (likely(vmf->flags & FAULT_FLAG_WRITE))
+			entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
 	if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
@@ -4695,7 +4734,6 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		.pgoff = linear_page_index(vma, address),
 		.gfp_mask = __get_fault_gfp_mask(vma),
 	};
-	unsigned int dirty = flags & FAULT_FLAG_WRITE;
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4720,9 +4758,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		barrier();
 		if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
 
-			/* NUMA case for anonymous PUDs would go here */
-
-			if (dirty && !pud_write(orig_pud)) {
+			/*
+			 * TODO once we support anonymous PUDs: NUMA case and
+			 * FAULT_FLAG_UNSHARE handling.
+			 */
+			if ((flags & FAULT_FLAG_WRITE) && !pud_write(orig_pud)) {
 				ret = wp_huge_pud(&vmf, orig_pud);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
@@ -4760,7 +4800,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf);
 
-			if (dirty && !pmd_write(vmf.orig_pmd)) {
+			if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
+			    !pmd_write(vmf.orig_pmd)) {
 				ret = wp_huge_pmd(&vmf);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 15/16] mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (13 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 14/16] mm: support GUP-triggered unsharing of anonymous pages David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-19 15:56   ` Vlastimil Babka
  2022-03-29 16:04 ` [PATCH v3 16/16] mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning David Hildenbrand
  2022-03-29 16:09 ` [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

Whenever GUP currently ends up taking a R/O pin on an anonymous page that
might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
on the page table entry will end up replacing the mapped anonymous page
due to COW, resulting in the GUP pin no longer being consistent with the
page actually mapped into the page table.

The possible ways to deal with this situation are:
 (1) Ignore and pin -- what we do right now.
 (2) Fail to pin -- which would be rather surprising to callers and
     could break user space.
 (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
     pins.

Let's implement 3) because it provides the clearest semantics and
allows for checking in unpin_user_pages() and friends for possible BUGs:
when trying to unpin a page that's no longer exclusive, clearly
something went very wrong and might result in memory corruptions that
might be hard to debug. So we better have a nice way to spot such
issues.

This change implies that whenever user space *wrote* to a private
mapping (IOW, we have an anonymous page mapped), that GUP pins will
always remain consistent: reliable R/O GUP pins of anonymous pages.

As a side note, this commit fixes the COW security issue for hugetlb with
FOLL_PIN as documented in:
  https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
The vmsplice reproducer still applies, because vmsplice uses FOLL_GET
instead of FOLL_PIN.

Note that follow_huge_pmd() doesn't apply because we cannot end up in
there with FOLL_PIN.

This commit is heavily based on prototype patches by Andrea.

Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h | 39 +++++++++++++++++++++++++++++++++++++++
 mm/gup.c           | 42 +++++++++++++++++++++++++++++++++++++++---
 mm/huge_memory.c   |  3 +++
 mm/hugetlb.c       | 27 ++++++++++++++++++++++++---
 4 files changed, 105 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dfc4ec83f76e..26428ff262fc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3005,6 +3005,45 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
 	return 0;
 }
 
+/*
+ * Indicates for which pages that are write-protected in the page table,
+ * whether GUP has to trigger unsharing via FAULT_FLAG_UNSHARE such that the
+ * GUP pin will remain consistent with the pages mapped into the page tables
+ * of the MM.
+ *
+ * Temporary unmapping of PageAnonExclusive() pages or clearing of
+ * PageAnonExclusive() has to protect against concurrent GUP:
+ * * Ordinary GUP: Using the PT lock
+ * * GUP-fast and fork(): mm->write_protect_seq
+ * * GUP-fast and KSM or temporary unmapping (swap, migration):
+ *   clear/invalidate+flush of the page table entry
+ *
+ * Must be called with the (sub)page that's actually referenced via the
+ * page table entry, which might not necessarily be the head page for a
+ * PTE-mapped THP.
+ */
+static inline bool gup_must_unshare(unsigned int flags, struct page *page)
+{
+	/*
+	 * FOLL_WRITE is implicitly handled correctly as the page table entry
+	 * has to be writable -- and if it references (part of) an anonymous
+	 * folio, that part is required to be marked exclusive.
+	 */
+	if ((flags & (FOLL_WRITE | FOLL_PIN)) != FOLL_PIN)
+		return false;
+	/*
+	 * Note: PageAnon(page) is stable until the page is actually getting
+	 * freed.
+	 */
+	if (!PageAnon(page))
+		return false;
+	/*
+	 * Note that PageKsm() pages cannot be exclusive, and consequently,
+	 * cannot get pinned.
+	 */
+	return !PageAnonExclusive(page);
+}
+
 typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
 			       unsigned long size, pte_fn_t fn, void *data);
diff --git a/mm/gup.c b/mm/gup.c
index f96fc415ea6c..6060823f9be8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -506,6 +506,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		}
 	}
 
+	if (!pte_write(pte) && gup_must_unshare(flags, page)) {
+		page = ERR_PTR(-EMLINK);
+		goto out;
+	}
 	/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
 	if (unlikely(!try_grab_page(page, flags))) {
 		page = ERR_PTR(-ENOMEM);
@@ -732,6 +736,11 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
  * When getting pages from ZONE_DEVICE memory, the @ctx->pgmap caches
  * the device's dev_pagemap metadata to avoid repeating expensive lookups.
  *
+ * When getting an anonymous page and the caller has to trigger unsharing
+ * of a shared anonymous page first, -EMLINK is returned. The caller should
+ * trigger a fault with FAULT_FLAG_UNSHARE set. Note that unsharing is only
+ * relevant with FOLL_PIN and !FOLL_WRITE.
+ *
  * On output, the @ctx->page_mask is set according to the size of the page.
  *
  * Return: the mapped (struct page *), %NULL if no mapping exists, or
@@ -855,7 +864,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
  * is, *@locked will be set to 0 and -EBUSY returned.
  */
 static int faultin_page(struct vm_area_struct *vma,
-		unsigned long address, unsigned int *flags, int *locked)
+		unsigned long address, unsigned int *flags, bool unshare,
+		int *locked)
 {
 	unsigned int fault_flags = 0;
 	vm_fault_t ret;
@@ -877,6 +887,11 @@ static int faultin_page(struct vm_area_struct *vma,
 		 */
 		fault_flags |= FAULT_FLAG_TRIED;
 	}
+	if (unshare) {
+		fault_flags |= FAULT_FLAG_UNSHARE;
+		/* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */
+		VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE);
+	}
 
 	ret = handle_mm_fault(vma, address, fault_flags, NULL);
 	if (ret & VM_FAULT_ERROR) {
@@ -1098,8 +1113,9 @@ static long __get_user_pages(struct mm_struct *mm,
 		cond_resched();
 
 		page = follow_page_mask(vma, start, foll_flags, &ctx);
-		if (!page) {
-			ret = faultin_page(vma, start, &foll_flags, locked);
+		if (!page || PTR_ERR(page) == -EMLINK) {
+			ret = faultin_page(vma, start, &foll_flags,
+					   PTR_ERR(page) == -EMLINK, locked);
 			switch (ret) {
 			case 0:
 				goto retry;
@@ -2195,6 +2211,11 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			goto pte_unmap;
 		}
 
+		if (!pte_write(pte) && gup_must_unshare(flags, page)) {
+			gup_put_folio(folio, 1, flags);
+			goto pte_unmap;
+		}
+
 		/*
 		 * We need to make the page accessible if and only if we are
 		 * going to access its content (the FOLL_PIN case).  Please
@@ -2375,6 +2396,11 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
+	if (!pte_write(pte) && gup_must_unshare(flags, &folio->page)) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
 	*nr += refs;
 	folio_set_referenced(folio);
 	return 1;
@@ -2436,6 +2462,11 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 	}
 
+	if (!pmd_write(orig) && gup_must_unshare(flags, &folio->page)) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
 	*nr += refs;
 	folio_set_referenced(folio);
 	return 1;
@@ -2471,6 +2502,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 	}
 
+	if (!pud_write(orig) && gup_must_unshare(flags, &folio->page)) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
 	*nr += refs;
 	folio_set_referenced(folio);
 	return 1;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8560e234ab4d..2dc820e8c873 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1389,6 +1389,9 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
 
+	if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
+		return ERR_PTR(-EMLINK);
+
 	if (!try_grab_page(page, flags))
 		return ERR_PTR(-ENOMEM);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 366a1f704405..21f2ec446117 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5961,6 +5961,25 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
 	}
 }
 
+static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte,
+					       bool *unshare)
+{
+	pte_t pteval = huge_ptep_get(pte);
+
+	*unshare = false;
+	if (is_swap_pte(pteval))
+		return true;
+	if (huge_pte_write(pteval))
+		return false;
+	if (flags & FOLL_WRITE)
+		return true;
+	if (gup_must_unshare(flags, pte_page(pteval))) {
+		*unshare = true;
+		return true;
+	}
+	return false;
+}
+
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 struct page **pages, struct vm_area_struct **vmas,
 			 unsigned long *position, unsigned long *nr_pages,
@@ -5975,6 +5994,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
 		spinlock_t *ptl = NULL;
+		bool unshare = false;
 		int absent;
 		struct page *page;
 
@@ -6025,9 +6045,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * both cases, and because we can't follow correct pages
 		 * directly from any kind of swap entries.
 		 */
-		if (absent || is_swap_pte(huge_ptep_get(pte)) ||
-		    ((flags & FOLL_WRITE) &&
-		      !huge_pte_write(huge_ptep_get(pte)))) {
+		if (absent ||
+		    __follow_hugetlb_must_fault(flags, pte, &unshare)) {
 			vm_fault_t ret;
 			unsigned int fault_flags = 0;
 
@@ -6035,6 +6054,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				spin_unlock(ptl);
 			if (flags & FOLL_WRITE)
 				fault_flags |= FAULT_FLAG_WRITE;
+			else if (unshare)
+				fault_flags |= FAULT_FLAG_UNSHARE;
 			if (locked)
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
 					FAULT_FLAG_KILLABLE;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 16/16] mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (14 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 15/16] mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page David Hildenbrand
@ 2022-03-29 16:04 ` David Hildenbrand
  2022-04-19 17:40   ` Vlastimil Babka
  2022-03-29 16:09 ` [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
  16 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm,
	David Hildenbrand

Let's verify when (un)pinning anonymous pages that we always deal with
exclusive anonymous pages, which guarantees that we'll have a reliable
PIN, meaning that we cannot end up with the GUP pin being inconsistent
with he pages mapped into the page tables due to a COW triggered
by a write fault.

When pinning pages, after conditionally triggering GUP unsharing of
possibly shared anonymous pages, we should always only see exclusive
anonymous pages. Note that anonymous pages that are mapped writable
must be marked exclusive, otherwise we'd have a BUG.

When pinning during ordinary GUP, simply add a check after our
conditional GUP-triggered unsharing checks. As we know exactly how the
page is mapped, we know exactly in which page we have to check for
PageAnonExclusive().

When pinning via GUP-fast we have to be careful, because we can race with
fork(): verify only after we made sure via the seqcount that we didn't
race with concurrent fork() that we didn't end up pinning a possibly
shared anonymous page.

Similarly, when unpinning, verify that the pages are still marked as
exclusive: otherwise something turned the pages possibly shared, which
can result in random memory corruptions, which we really want to catch.

With only the pinned pages at hand and not the actual page table entries
we have to be a bit careful: hugetlb pages are always mapped via a
single logical page table entry referencing the head page and
PG_anon_exclusive of the head page applies. Anon THP are a bit more
complicated, because we might have obtained the page reference either via
a PMD or a PTE -- depending on the mapping type we either have to check
PageAnonExclusive of the head page (PMD-mapped THP) or the tail page
(PTE-mapped THP) applies: as we don't know and to make our life easier,
check that either is set.

Take care to not verify in case we're unpinning during GUP-fast because
we detected concurrent fork(): we might stumble over an anonymous page
that is now shared.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c         | 61 +++++++++++++++++++++++++++++++++++++++++++++++-
 mm/huge_memory.c |  3 +++
 mm/hugetlb.c     |  3 +++
 3 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 6060823f9be8..2b4cd5fa7f51 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,39 @@ struct follow_page_context {
 	unsigned int page_mask;
 };
 
+static inline void sanity_check_pinned_pages(struct page **pages,
+					     unsigned long npages)
+{
+	if (!IS_ENABLED(CONFIG_DEBUG_VM))
+		return;
+
+	/*
+	 * We only pin anonymous pages if they are exclusive. Once pinned, we
+	 * can no longer turn them possibly shared and PageAnonExclusive() will
+	 * stick around until the page is freed.
+	 *
+	 * We'd like to verify that our pinned anonymous pages are still mapped
+	 * exclusively. The issue with anon THP is that we don't know how
+	 * they are/were mapped when pinning them. However, for anon
+	 * THP we can assume that either the given page (PTE-mapped THP) or
+	 * the head page (PMD-mapped THP) should be PageAnonExclusive(). If
+	 * neither is the case, there is certainly something wrong.
+	 */
+	for (; npages; npages--, pages++) {
+		struct page *page = *pages;
+		struct folio *folio = page_folio(page);
+
+		if (!folio_test_anon(folio))
+			continue;
+		if (!folio_test_large(folio) || folio_test_hugetlb(folio))
+			VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
+		else
+			/* Either a PTE-mapped or a PMD-mapped THP. */
+			VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) &&
+				       !PageAnonExclusive(page), page);
+	}
+}
+
 /*
  * Return the folio with ref appropriately incremented,
  * or NULL if that failed.
@@ -204,6 +237,7 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
  */
 void unpin_user_page(struct page *page)
 {
+	sanity_check_pinned_pages(&page, 1);
 	gup_put_folio(page_folio(page), 1, FOLL_PIN);
 }
 EXPORT_SYMBOL(unpin_user_page);
@@ -272,6 +306,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 		return;
 	}
 
+	sanity_check_pinned_pages(pages, npages);
 	for (i = 0; i < npages; i += nr) {
 		folio = gup_folio_next(pages, npages, i, &nr);
 		/*
@@ -344,6 +379,23 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
 }
 EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
 
+static void unpin_user_pages_lockless(struct page **pages, unsigned long npages)
+{
+	unsigned long i;
+	struct folio *folio;
+	unsigned int nr;
+
+	/*
+	 * Don't perform any sanity checks because we might have raced with
+	 * fork() and some anonymous pages might now actually be shared --
+	 * which is why we're unpinning after all.
+	 */
+	for (i = 0; i < npages; i += nr) {
+		folio = gup_folio_next(pages, npages, i, &nr);
+		gup_put_folio(folio, nr, FOLL_PIN);
+	}
+}
+
 /**
  * unpin_user_pages() - release an array of gup-pinned pages.
  * @pages:  array of pages to be marked dirty and released.
@@ -367,6 +419,7 @@ void unpin_user_pages(struct page **pages, unsigned long npages)
 	if (WARN_ON(IS_ERR_VALUE(npages)))
 		return;
 
+	sanity_check_pinned_pages(pages, npages);
 	for (i = 0; i < npages; i += nr) {
 		folio = gup_folio_next(pages, npages, i, &nr);
 		gup_put_folio(folio, nr, FOLL_PIN);
@@ -510,6 +563,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		page = ERR_PTR(-EMLINK);
 		goto out;
 	}
+
+	VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
+		  !PageAnonExclusive(page));
+
 	/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
 	if (unlikely(!try_grab_page(page, flags))) {
 		page = ERR_PTR(-ENOMEM);
@@ -2744,8 +2801,10 @@ static unsigned long lockless_pages_from_mm(unsigned long start,
 	 */
 	if (gup_flags & FOLL_PIN) {
 		if (read_seqcount_retry(&current->mm->write_protect_seq, seq)) {
-			unpin_user_pages(pages, nr_pinned);
+			unpin_user_pages_lockless(pages, nr_pinned);
 			return 0;
+		} else {
+			sanity_check_pinned_pages(pages, nr_pinned);
 		}
 	}
 	return nr_pinned;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2dc820e8c873..b32774f289d6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1392,6 +1392,9 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
 		return ERR_PTR(-EMLINK);
 
+	VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
+		  !PageAnonExclusive(page));
+
 	if (!try_grab_page(page, flags))
 		return ERR_PTR(-ENOMEM);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 21f2ec446117..48740e6c3476 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6097,6 +6097,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
 		page = pte_page(huge_ptep_get(pte));
 
+		VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
+			  !PageAnonExclusive(page));
+
 		/*
 		 * If subpage information not requested, update counters
 		 * and skip the same_page loop below.
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages
  2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
                   ` (15 preceding siblings ...)
  2022-03-29 16:04 ` [PATCH v3 16/16] mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning David Hildenbrand
@ 2022-03-29 16:09 ` David Hildenbrand
  16 siblings, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm, Khalid Aziz

On 29.03.22 18:04, David Hildenbrand wrote:
> This series is the result of the discussion on the previous approach [2].
> More information on the general COW issues can be found there. It is based
> on latest linus/master (post v5.17, with relevant core-MM changes for
> v5.18-rc1).
> 
> v3 is located at:
> 	https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2_v3
> 
> 
> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
> on an anonymous page and COW logic fails to detect exclusivity of the page
> to then replacing the anonymous page by a copy in the page table: The
> GUP pin lost synchronicity with the pages mapped into the page tables.
> 
> This issue, including other related COW issues, has been summarized in [3]
> under 3):
> "
>   3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)
> 
>   page_maybe_dma_pinned() is used to check if a page may be pinned for
>   DMA (using FOLL_PIN instead of FOLL_GET). While false positives are
>   tolerable, false negatives are problematic: pages that are pinned for
>   DMA must not be added to the swapcache. If it happens, the (now pinned)
>   page could be faulted back from the swapcache into page tables
>   read-only. Future write-access would detect the pinning and COW the
>   page, losing synchronicity. For the interested reader, this is nicely
>   documented in feb889fb40fa ("mm: don't put pinned pages into the swap
>   cache").
> 
>   Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
>   cases and can result in a violation of the documented semantics:
>   giving false negatives because of the race.
> 
>   There are cases where we call it without properly taking a per-process
>   sequence lock, turning the usage of page_maybe_dma_pinned() racy. While
>   one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
>   handle, there is especially one rmap case (shrink_page_list) that's hard
>   to fix: in the rmap world, we're not limited to a single process.
> 
>   The shrink_page_list() issue is really subtle. If we race with
>   someone pinning a page, we can trigger the same issue as in the FOLL_GET
>   case. See the detail section at the end of this mail on a discussion how
>   bad this can bite us with VFIO or other FOLL_PIN user.
> 
>   It's harder to reproduce, but I managed to modify the O_DIRECT
>   reproducer to use io_uring fixed buffers [15] instead, which ends up
>   using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
>   similarly trigger a loss of synchronicity and consequently a memory
>   corruption.
> 
>   Again, the root issue is that a write-fault on a page that has
>   additional references results in a COW and thereby a loss of
>   synchronicity and consequently a memory corruption if two parties
>   believe they are referencing the same page.
> "
> 
> This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable,
> especially also taking care of concurrent pinning via GUP-fast,
> for example, also fully fixing an issue reported regarding NUMA
> balancing [4] recently. While doing that, it further reduces "unnecessary
> COWs", especially when we don't fork()/KSM and don't swapout, and fixes the
> COW security for hugetlb for FOLL_PIN.
> 
> In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
> anonymous page is exclusive. Exclusive anonymous pages that are mapped
> R/O can directly be mapped R/W by the COW logic in the write fault handler.
> Exclusive anonymous pages that want to be shared (fork(), KSM) first have
> to be marked shared -- which will fail if there are GUP pins on the page.
> GUP is only allowed to take a pin on anonymous pages that are exclusive.
> The PT lock is the primary mechanism to synchronize modifications of
> PG_anon_exclusive. We synchronize against GUP-fast either via the
> src_mm->write_protect_seq (during fork()) or via clear/invalidate+flush of
> the relevant page table entry.
> 
> Special care has to be taken about swap, migration, and THPs (whereby a
> PMD-mapping can be converted to a PTE mapping and we have to track
> information for subpages). Besides these, we let the rmap code handle most
> magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE
> logic as part of our previous approach [2], however, it's now 100% mapcount
> free and I further simplified it a bit.
> 
>   #1 is a fix
>   #3-#10 are mostly rmap preparations for PG_anon_exclusive handling
>   #11 introduces PG_anon_exclusive
>   #12 uses PG_anon_exclusive and make R/W pins of anonymous pages
>    reliable
>   #13 is a preparation for reliable R/O pins
>   #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins
>    make R/O pins of anonymous pages reliable
>   #16 adds sanity check when (un)pinning anonymous pages
> 
> 
> [1] https://lkml.kernel.org/r/20220131162940.210846-1-david@redhat.com
> [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
> [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
> [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616
> 
> 
> v2 -> v3:
> * Note 1: Left the terminology "unshare" in place for now instead of
>   switching to "make anon exclusive".
> * Note 2: We might have to tackle undoing effects of arch_unmap_one() on
>   sparc, to free some tag memory immediately instead of when tearing down
>   the vma/mm; looks like this needs more care either way, so I'll ignore it
>   for now.
> * Rebased on top of core MM changes for v5.18-rc1 (most conflicts were due
>   to folio and ZONE_DEVICE migration rework). No severe changes were
>   necessary -- mostly folio conversion and code movement.
> * Retested on aarch64, ppc64, s390x and x86_64
> * "mm/rmap: convert RMAP flags to a proper distinct rmap_t type"
>   -> Missed to convert one instance in restore_exclusive_pte()
> * "mm/rmap: pass rmap flags to hugepage_add_anon_rmap()"
>   -> Use "!!(flags & RMAP_EXCLUSIVE)" to avoid sparse warnings
> * "mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page()"
>   -> Added, as we can trigger that now more frequently
> * "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"
>   -> Use subpage in VM_BUG_ON_PAGE() in try_to_migrate_one()
>   -> Move comment from folio_migrate_mapping() to folio_migrate_flags()
>      regarding PG_anon_exclusive/PG_mappedtodisk
>   -> s/int rmap_flags/rmap_t rmap_flags/ in remove_migration_pmd()
> * "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
>    exclusive when (un)pinning"
>   -> Use IS_ENABLED(CONFIG_DEBUG_VM) instead of ifdef
> 
> v1 -> v2:
> * Tested on aarch64, ppc64, s390x and x86_64
> * "mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon()
>    pages"
>   -> Use PG_mappedtodisk instead of PG_slab (thanks Willy!), this simlifies
>      the patch and necessary handling a lot. Add safety BUG_ON's
>   -> Move most documentation to the patch description, to be placed in a
>      proper documentation doc in the future, once everything's in place
> * ""mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
>   -> Skip check+clearing in page_try_dup_anon_rmap(), otherwise we might
>      trigger a wrong VM_BUG_ON() for KSM pages in ClearPageAnonExclusive()
>   -> In __split_huge_pmd_locked(), call page_try_share_anon_rmap() only
>      for "anon_exclusive", otherwise we might trigger a wrong VM_BUG_ON()
>   -> In __split_huge_page_tail(), drop any remaining PG_anon_exclusive on
>      tail pages, and document why that is fine
> 
> RFC -> v1:
> * Rephrased/extended some patch descriptions+comments
> * Tested on aarch64, ppc64 and x86_64
> * "mm/rmap: convert RMAP flags to a proper distinct rmap_t type"
>  -> Added
> * "mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()"
>  -> Added
> * "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"
>  -> Fixed __do_huge_pmd_anonymous_page() to recheck after temporarily
>     dropping the PT lock.
>  -> Use "reuse" label in __do_huge_pmd_anonymous_page()
>  -> Slightly simplify logic in hugetlb_cow()
>  -> In remove_migration_pte(), remove unrelated changes around
>     page_remove_rmap()
> * "mm: support GUP-triggered unsharing of anonymous pages"
>  -> In handle_pte_fault(), trigger pte_mkdirty() only with
>     FAULT_FLAG_WRITE
>  -> In __handle_mm_fault(), extend comment regarding anonymous PUDs
> * "mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
>    anonymous page"
>    -> Added unsharing logic to gup_hugepte() and gup_huge_pud()
>    -> Changed return logic in __follow_hugetlb_must_fault(), making sure
>       that "unshare" is always set
> * "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
>    exclusive when (un)pinning"
>   -> Slightly simplified sanity_check_pinned_pages()
> 
> David Hildenbrand (16):
>   mm/rmap: fix missing swap_free() in try_to_unmap() after
>     arch_unmap_one() failed
>   mm/hugetlb: take src_mm->write_protect_seq in
>     copy_hugetlb_page_range()
>   mm/memory: slightly simplify copy_present_pte()
>   mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and
>     page_try_dup_anon_rmap()
>   mm/rmap: convert RMAP flags to a proper distinct rmap_t type
>   mm/rmap: remove do_page_add_anon_rmap()
>   mm/rmap: pass rmap flags to hugepage_add_anon_rmap()
>   mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
>   mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon()
>     page exclusively
>   mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page()
>   mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for
>     PageAnon() pages
>   mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
>   mm/gup: disallow follow_page(FOLL_PIN)
>   mm: support GUP-triggered unsharing of anonymous pages
>   mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
>     anonymous page
>   mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
>     exclusive when (un)pinning
> 
>  include/linux/mm.h         |  46 +++++++-
>  include/linux/mm_types.h   |   8 ++
>  include/linux/page-flags.h |  39 ++++++-
>  include/linux/rmap.h       | 118 +++++++++++++++++--
>  include/linux/swap.h       |  15 ++-
>  include/linux/swapops.h    |  25 ++++
>  kernel/events/uprobes.c    |   2 +-
>  mm/gup.c                   | 106 ++++++++++++++++-
>  mm/huge_memory.c           | 127 +++++++++++++++-----
>  mm/hugetlb.c               | 135 ++++++++++++++-------
>  mm/khugepaged.c            |   2 +-
>  mm/ksm.c                   |  15 ++-
>  mm/memory.c                | 234 +++++++++++++++++++++++--------------
>  mm/memremap.c              |   9 ++
>  mm/migrate.c               |  18 ++-
>  mm/migrate_device.c        |  23 +++-
>  mm/mprotect.c              |   8 +-
>  mm/rmap.c                  |  97 +++++++++++----
>  mm/swapfile.c              |   8 +-
>  mm/userfaultfd.c           |   2 +-
>  tools/vm/page-types.c      |   8 +-
>  21 files changed, 825 insertions(+), 220 deletions(-)
> 
> 
> base-commit: 1930a6e739c4b4a654a69164dbe39e554d228915


The effective change on v5.17 *before rebasing* compared to v2 is:


diff --git a/mm/gup.c b/mm/gup.c
index 72e39b77da10..e1de0104cb19 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -48,7 +48,9 @@ static void hpage_pincount_sub(struct page *page, int refs)
 static inline void sanity_check_pinned_pages(struct page **pages,
 					     unsigned long npages)
 {
-#ifdef CONFIG_DEBUG_VM
+	if (!IS_ENABLED(CONFIG_DEBUG_VM))
+		return;
+
 	/*
 	 * We only pin anonymous pages if they are exclusive. Once pinned, we
 	 * can no longer turn them possibly shared and PageAnonExclusive() will
@@ -74,7 +76,6 @@ static inline void sanity_check_pinned_pages(struct page **pages,
 			VM_BUG_ON_PAGE(!PageAnonExclusive(head) &&
 				       !PageAnonExclusive(page), page);
 	}
-#endif /* CONFIG_DEBUG_VM */
 }
 
 /* Equivalent to calling put_page() @refs times. */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0cc34addd911..c53764b0640f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2316,8 +2316,6 @@ static void unmap_page(struct page *page)
 		try_to_migrate(page, ttu_flags);
 	else
 		try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK);
-
-	VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
 }
 
 static void remap_page(struct page *page, unsigned int nr)
@@ -3187,7 +3185,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 
 	flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
 	if (PageAnon(new)) {
-		int rmap_flags = RMAP_COMPOUND;
+		rmap_t rmap_flags = RMAP_COMPOUND;
 
 		if (!is_readable_migration_entry(entry))
 			rmap_flags |= RMAP_EXCLUSIVE;
diff --git a/mm/memory.c b/mm/memory.c
index b0b9d07a2850..1c8b52771799 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -727,7 +727,7 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 	 * created when the swap entry was made.
 	 */
 	if (PageAnon(page))
-		page_add_anon_rmap(page, vma, address, false);
+		page_add_anon_rmap(page, vma, address, RMAP_NONE);
 	else
 		/*
 		 * Currently device exclusive access only supports anonymous
diff --git a/mm/migrate.c b/mm/migrate.c
index 7f440d2103ce..013eb4f52fed 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -384,10 +384,6 @@ int folio_migrate_mapping(struct address_space *mapping,
 		/* No turning back from here */
 		newfolio->index = folio->index;
 		newfolio->mapping = folio->mapping;
-		/*
-		 * Note: PG_anon_exclusive is always migrated via migration
-		 * entries.
-		 */
 		if (folio_test_swapbacked(folio))
 			__folio_set_swapbacked(newfolio);
 
@@ -540,6 +536,12 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
 		folio_set_workingset(newfolio);
 	if (folio_test_checked(folio))
 		folio_set_checked(newfolio);
+	/*
+	 * PG_anon_exclusive (-> PG_mappedtodisk) is always migrated via
+	 * migration entries. We can still have PG_anon_exclusive set on an
+	 * effectively unmapped and unreferenced first sub-pages of an
+	 * anonymous THP: we can simply copy it here via PG_mappedtodisk.
+	 */
 	if (folio_test_mappedtodisk(folio))
 		folio_set_mappedtodisk(newfolio);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 9d2a7e11e8cc..c18f6d7891d0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1939,7 +1939,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 				break;
 			}
 			VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) &&
-				       !anon_exclusive, page);
+				       !anon_exclusive, subpage);
 			if (anon_exclusive &&
 			    page_try_share_anon_rmap(subpage)) {
 				set_pte_at(mm, address, pvmw.pte, pteval);
@@ -2476,7 +2476,7 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
 	if (first)
 		__page_set_anon_rmap(page, vma, address,
-				     flags & RMAP_EXCLUSIVE);
+				     !!(flags & RMAP_EXCLUSIVE));
 }
 
 void hugepage_add_new_anon_rmap(struct page *page,

-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 01/16] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed
  2022-03-29 16:04 ` [PATCH v3 01/16] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed David Hildenbrand
@ 2022-04-11 16:04   ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-11 16:04 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm, Khalid Aziz

On 3/29/22 18:04, David Hildenbrand wrote:
> In case arch_unmap_one() fails, we already did a swap_duplicate(). let's
> undo that properly via swap_free().
> 
> Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap")
> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/rmap.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5cb970d51f0a..07f59bc6ffc1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1637,6 +1637,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  				break;
>  			}
>  			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
> +				swap_free(entry);
>  				set_pte_at(mm, address, pvmw.pte, pteval);
>  				ret = false;
>  				page_vma_mapped_walk_done(&pvmw);


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 02/16] mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range()
  2022-03-29 16:04 ` [PATCH v3 02/16] mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range() David Hildenbrand
@ 2022-04-11 16:15   ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-11 16:15 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> Let's do it just like copy_page_range(), taking the seqlock and making
> sure the mmap_lock is held in write mode.
> 
> This allows for add a VM_BUG_ON to page_needs_cow_for_dma() and
> properly synchronizes cocnurrent fork() with GUP-fast of hugetlb pages,

			concurrent

> which will be relevant for further changes.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 03/16] mm/memory: slightly simplify copy_present_pte()
  2022-03-29 16:04 ` [PATCH v3 03/16] mm/memory: slightly simplify copy_present_pte() David Hildenbrand
@ 2022-04-11 16:38   ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-11 16:38 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> Let's move the pinning check into the caller, to simplify return code
> logic and prepare for further changes: relocating the
> page_needs_cow_for_dma() into rmap handling code.
> 
> While at it, remove the unused pte parameter and simplify the comments a
> bit.
> 
> No functional change intended.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Yeah, much better.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 04/16] mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()
  2022-03-29 16:04 ` [PATCH v3 04/16] mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() David Hildenbrand
@ 2022-04-11 18:18   ` Vlastimil Babka
  2022-04-12  8:06     ` David Hildenbrand
  0 siblings, 1 reply; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-11 18:18 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> ... and move the special check for pinned pages into
> page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous
> pages via a new pageflag, clearing it only after making sure that there
> are no GUP pins on the anonymous page.
> 
> We really only care about pins on anonymous pages, because they are
> prone to getting replaced in the COW handler once mapped R/O. For !anon
> pages in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really
> care about that, at least not that I could come up with an example.
> 
> Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we
> know we're dealing with anonymous pages. Also, drop the handling of
> pinned pages from copy_huge_pud() and add a comment if ever supporting
> anonymous pages on the PUD level.
> 
> This is a preparation for tracking exclusivity of anonymous pages in
> the rmap code, and disallowing marking a page shared (-> failing to
> duplicate) if there are GUP pins on a page.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Nit:

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -825,7 +825,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		 */
>  		get_page(page);
>  		rss[mm_counter(page)]++;
> -		page_dup_rmap(page, false);
> +		/* Cannot fail as these pages cannot get pinned. */
> +		BUG_ON(page_try_dup_anon_rmap(page, false, src_vma));

Should we just call __page_dup_rmap() here? This is block for the condition
is_device_private_entry(), and page_try_dup_anon_rmap() can't return -EBUSY
for is_device_private_page().

>  
>  		/*
>  		 * We do not preserve soft-dirty information, because so
> @@ -921,18 +922,24 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  	struct page *page;
>  
>  	page = vm_normal_page(src_vma, addr, pte);
> -	if (page && unlikely(page_needs_cow_for_dma(src_vma, page))) {
> +	if (page && PageAnon(page)) {
>  		/*
>  		 * If this page may have been pinned by the parent process,
>  		 * copy the page immediately for the child so that we'll always
>  		 * guarantee the pinned page won't be randomly replaced in the
>  		 * future.
>  		 */
> -		return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> -					 addr, rss, prealloc, page);
> +		get_page(page);
> +		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> +			/* Page maybe pinned, we have to copy. */
> +			put_page(page);
> +			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> +						 addr, rss, prealloc, page);
> +		}
> +		rss[mm_counter(page)]++;
>  	} else if (page) {
>  		get_page(page);
> -		page_dup_rmap(page, false);
> +		page_dup_file_rmap(page, false);
>  		rss[mm_counter(page)]++;
>  	}
>  
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 3d60823afd2d..97de2fc17f34 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -234,7 +234,7 @@ static bool remove_migration_pte(struct folio *folio,
>  			if (folio_test_anon(folio))
>  				hugepage_add_anon_rmap(new, vma, pvmw.address);
>  			else
> -				page_dup_rmap(new, true);
> +				page_dup_file_rmap(new, true);
>  			set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
>  		} else
>  #endif


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 04/16] mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()
  2022-04-11 18:18   ` Vlastimil Babka
@ 2022-04-12  8:06     ` David Hildenbrand
  0 siblings, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2022-04-12  8:06 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 11.04.22 20:18, Vlastimil Babka wrote:
> On 3/29/22 18:04, David Hildenbrand wrote:
>> ... and move the special check for pinned pages into
>> page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous
>> pages via a new pageflag, clearing it only after making sure that there
>> are no GUP pins on the anonymous page.
>>
>> We really only care about pins on anonymous pages, because they are
>> prone to getting replaced in the COW handler once mapped R/O. For !anon
>> pages in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really
>> care about that, at least not that I could come up with an example.
>>
>> Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we
>> know we're dealing with anonymous pages. Also, drop the handling of
>> pinned pages from copy_huge_pud() and add a comment if ever supporting
>> anonymous pages on the PUD level.
>>
>> This is a preparation for tracking exclusivity of anonymous pages in
>> the rmap code, and disallowing marking a page shared (-> failing to
>> duplicate) if there are GUP pins on a page.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Nit:
> 
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -825,7 +825,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>  		 */
>>  		get_page(page);
>>  		rss[mm_counter(page)]++;
>> -		page_dup_rmap(page, false);
>> +		/* Cannot fail as these pages cannot get pinned. */
>> +		BUG_ON(page_try_dup_anon_rmap(page, false, src_vma));
> 
> Should we just call __page_dup_rmap() here? This is block for the condition
> is_device_private_entry(), and page_try_dup_anon_rmap() can't return -EBUSY
> for is_device_private_page().


Hi Vlastimil,

thanks for your review!

We want to keep page_try_dup_anon_rmap() here, because we extend
page_try_dup_anon_rmap() in patch #12 to properly clear
PageAnonExclusive() of there are no GUP pins. Just like with current
page_try_dup_anon_rmap(), that can't fail for device private pages.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 05/16] mm/rmap: convert RMAP flags to a proper distinct rmap_t type
  2022-03-29 16:04 ` [PATCH v3 05/16] mm/rmap: convert RMAP flags to a proper distinct rmap_t type David Hildenbrand
@ 2022-04-12  8:11   ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-12  8:11 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> We want to pass the flags to more than one anon rmap function, getting
> rid of special "do_page_add_anon_rmap()". So let's pass around a distinct
> __bitwise type and refine documentation.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 06/16] mm/rmap: remove do_page_add_anon_rmap()
  2022-03-29 16:04 ` [PATCH v3 06/16] mm/rmap: remove do_page_add_anon_rmap() David Hildenbrand
@ 2022-04-12  8:13   ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-12  8:13 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> ... and instead convert page_add_anon_rmap() to accept flags.
> 
> Passing flags instead of bools is usually nicer either way, and we want
> to more often also pass RMAP_EXCLUSIVE in follow up patches when
> detecting that an anonymous page is exclusive: for example, when
> restoring an anonymous page from a writable migration entry.
> 
> This is a preparation for marking an anonymous page inside
> page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 07/16] mm/rmap: pass rmap flags to hugepage_add_anon_rmap()
  2022-03-29 16:04 ` [PATCH v3 07/16] mm/rmap: pass rmap flags to hugepage_add_anon_rmap() David Hildenbrand
@ 2022-04-12  8:37   ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-12  8:37 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> Let's prepare for passing RMAP_EXCLUSIVE, similarly as we do for
> page_add_anon_rmap() now. RMAP_COMPOUND is implicit for hugetlb
> pages and ignored.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/rmap.h | 2 +-
>  mm/migrate.c         | 3 ++-
>  mm/rmap.c            | 9 ++++++---
>  3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index aa734d2e2b01..f47bc937c383 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -191,7 +191,7 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
>  void page_remove_rmap(struct page *, struct vm_area_struct *,
>  		bool compound);
>  void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
> -		unsigned long address);
> +		unsigned long address, rmap_t flags);
>  void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>  		unsigned long address);
>  
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 436f0ec2da03..48db9500d20e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -232,7 +232,8 @@ static bool remove_migration_pte(struct folio *folio,
>  			pte = pte_mkhuge(pte);
>  			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
>  			if (folio_test_anon(folio))
> -				hugepage_add_anon_rmap(new, vma, pvmw.address);
> +				hugepage_add_anon_rmap(new, vma, pvmw.address,
> +						       RMAP_NONE);
>  			else
>  				page_dup_file_rmap(new, true);
>  			set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 71bf881da2a6..b972eb8f351b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2347,9 +2347,11 @@ void rmap_walk_locked(struct folio *folio, const struct rmap_walk_control *rwc)
>   * The following two functions are for anonymous (private mapped) hugepages.
>   * Unlike common anonymous pages, anonymous hugepages have no accounting code
>   * and no lru code, because we handle hugepages differently from common pages.
> + *
> + * RMAP_COMPOUND is ignored.
>   */
> -void hugepage_add_anon_rmap(struct page *page,
> -			    struct vm_area_struct *vma, unsigned long address)
> +void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
> +			    unsigned long address, rmap_t flags)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
>  	int first;
> @@ -2359,7 +2361,8 @@ void hugepage_add_anon_rmap(struct page *page,
>  	/* address might be in next vma when migration races vma_adjust */
>  	first = atomic_inc_and_test(compound_mapcount_ptr(page));
>  	if (first)
> -		__page_set_anon_rmap(page, vma, address, 0);
> +		__page_set_anon_rmap(page, vma, address,
> +				     !!(flags & RMAP_EXCLUSIVE));
>  }
>  
>  void hugepage_add_new_anon_rmap(struct page *page,


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 08/16] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
  2022-03-29 16:04 ` [PATCH v3 08/16] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() David Hildenbrand
@ 2022-04-12  8:47   ` Vlastimil Babka
  2022-04-12  9:37     ` David Hildenbrand
  0 siblings, 1 reply; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-12  8:47 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> New anonymous pages are always mapped natively: only THP/khugepagd code

						khugepaged ^

> maps a new compound anonymous page and passes "true". Otherwise, we're
> just dealing with simple, non-compound pages.
> 
> Let's give the interface clearer semantics and document these.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Nit:

> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1182,19 +1182,22 @@ void page_add_anon_rmap(struct page *page,
>  }
>  
>  /**
> - * page_add_new_anon_rmap - add pte mapping to a new anonymous page
> + * page_add_new_anon_rmap - add mapping to a new anonymous page
>   * @page:	the page to add the mapping to
>   * @vma:	the vm area in which the mapping is added
>   * @address:	the user virtual address mapped
> - * @compound:	charge the page as compound or small page
> + *
> + * If it's a compound page, it is accounted as a compound page. As the page
> + * is new, it's assume to get mapped exclusively by a single process.
>   *
>   * Same as page_add_anon_rmap but must only be called on *new* pages.
>   * This means the inc-and-test can be bypassed.
>   * Page does not have to be locked.
>   */
>  void page_add_new_anon_rmap(struct page *page,
> -	struct vm_area_struct *vma, unsigned long address, bool compound)
> +	struct vm_area_struct *vma, unsigned long address)
>  {
> +	const bool compound = PageCompound(page);
>  	int nr = compound ? thp_nr_pages(page) : 1;
>  
>  	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);

There's a VM_BUG_ON_PAGE(PageTransCompound(page), page); later in a
!compound branch. Since compound is now determined by the same check, could
be deleted.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 09/16] mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively
  2022-03-29 16:04 ` [PATCH v3 09/16] mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively David Hildenbrand
@ 2022-04-12  9:26   ` Vlastimil Babka
  2022-04-12  9:28     ` David Hildenbrand
  0 siblings, 1 reply; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-12  9:26 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> We want to mark anonymous pages exclusive, and when using
> page_move_anon_rmap() we know that we are the exclusive user, as
> properly documented. This is a preparation for marking anonymous pages
> exclusive in page_move_anon_rmap().
> 
> In both instances, we're holding page lock and are sure that we're the
> exclusive owner (page_count() == 1). hugetlb already properly uses
> page_move_anon_rmap() in the write fault handler.

Yeah, note that do_wp_page() used to call page_move_anon_rmap() always since
the latter was introduced, until commit 09854ba94c6a ("mm: do_wp_page()
simplification"). Probably not intended.

> Note that in case of a PTE-mapped THP, we'll only end up calling this
> function if the whole THP is only referenced by the single PTE mapping
> a single subpage (page_count() == 1); consequently, it's fine to modify
> the compound page mapping inside page_move_anon_rmap().
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/huge_memory.c | 2 ++
>  mm/memory.c      | 1 +
>  2 files changed, 3 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c4526343565a..dd16819c5edc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1317,6 +1317,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
>  		try_to_free_swap(page);
>  	if (page_count(page) == 1) {
>  		pmd_t entry;
> +
> +		page_move_anon_rmap(page, vma);
>  		entry = pmd_mkyoung(orig_pmd);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>  		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
> diff --git a/mm/memory.c b/mm/memory.c
> index 03e29c9614e0..4303c0fdcf17 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3303,6 +3303,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  		 * and the page is locked, it's dark out, and we're wearing
>  		 * sunglasses. Hit it.
>  		 */
> +		page_move_anon_rmap(page, vma);
>  		unlock_page(page);
>  		wp_page_reuse(vmf);
>  		return VM_FAULT_WRITE;


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 09/16] mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively
  2022-04-12  9:26   ` Vlastimil Babka
@ 2022-04-12  9:28     ` David Hildenbrand
  0 siblings, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2022-04-12  9:28 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 12.04.22 11:26, Vlastimil Babka wrote:
> On 3/29/22 18:04, David Hildenbrand wrote:
>> We want to mark anonymous pages exclusive, and when using
>> page_move_anon_rmap() we know that we are the exclusive user, as
>> properly documented. This is a preparation for marking anonymous pages
>> exclusive in page_move_anon_rmap().
>>
>> In both instances, we're holding page lock and are sure that we're the
>> exclusive owner (page_count() == 1). hugetlb already properly uses
>> page_move_anon_rmap() in the write fault handler.
> 
> Yeah, note that do_wp_page() used to call page_move_anon_rmap() always since
> the latter was introduced, until commit 09854ba94c6a ("mm: do_wp_page()
> simplification"). Probably not intended.

Yeah, it was buried underneath all that reuse_swap_page() complexity.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 10/16] mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page()
  2022-03-29 16:04 ` [PATCH v3 10/16] mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page() David Hildenbrand
@ 2022-04-12  9:37   ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-12  9:37 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> We can already theoretically fail to unmap (still having page_mapped()) in
> case arch_unmap_one() fails, which can happen on sparc. Failures to
> unmap are handled gracefully, just as if there are other references on
> the target page: freezing the refcount in split_huge_page_to_list()
> will fail if still mapped and we'll simply remap.
> 
> In commit 504e070dc08f ("mm: thp: replace DEBUG_VM BUG with VM_WARN when
> unmap fails for split") we already converted to VM_WARN_ON_ONCE_PAGE,
> let's get rid of it completely now.
> 
> This is a preparation for making try_to_migrate() fail on anonymous pages
> with GUP pins, which will make this VM_WARN_ON_ONCE_PAGE trigger more
> frequently.
> 
> Reported-by: Yang Shi <shy828301@gmail.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/huge_memory.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index dd16819c5edc..70298431e128 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2241,8 +2241,6 @@ static void unmap_page(struct page *page)
>  		try_to_migrate(folio, ttu_flags);
>  	else
>  		try_to_unmap(folio, ttu_flags | TTU_IGNORE_MLOCK);
> -
> -	VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
>  }
>  
>  static void remap_page(struct folio *folio, unsigned long nr)


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 08/16] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
  2022-04-12  8:47   ` Vlastimil Babka
@ 2022-04-12  9:37     ` David Hildenbrand
  2022-04-13 12:26       ` Matthew Wilcox
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-04-12  9:37 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 12.04.22 10:47, Vlastimil Babka wrote:
> On 3/29/22 18:04, David Hildenbrand wrote:
>> New anonymous pages are always mapped natively: only THP/khugepagd code
> 
> 						khugepaged ^
> 
>> maps a new compound anonymous page and passes "true". Otherwise, we're
>> just dealing with simple, non-compound pages.
>>
>> Let's give the interface clearer semantics and document these.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Nit:
> 
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1182,19 +1182,22 @@ void page_add_anon_rmap(struct page *page,
>>  }
>>  
>>  /**
>> - * page_add_new_anon_rmap - add pte mapping to a new anonymous page
>> + * page_add_new_anon_rmap - add mapping to a new anonymous page
>>   * @page:	the page to add the mapping to
>>   * @vma:	the vm area in which the mapping is added
>>   * @address:	the user virtual address mapped
>> - * @compound:	charge the page as compound or small page
>> + *
>> + * If it's a compound page, it is accounted as a compound page. As the page
>> + * is new, it's assume to get mapped exclusively by a single process.
>>   *
>>   * Same as page_add_anon_rmap but must only be called on *new* pages.
>>   * This means the inc-and-test can be bypassed.
>>   * Page does not have to be locked.
>>   */
>>  void page_add_new_anon_rmap(struct page *page,
>> -	struct vm_area_struct *vma, unsigned long address, bool compound)
>> +	struct vm_area_struct *vma, unsigned long address)
>>  {
>> +	const bool compound = PageCompound(page);
>>  	int nr = compound ? thp_nr_pages(page) : 1;
>>  
>>  	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> 
> There's a VM_BUG_ON_PAGE(PageTransCompound(page), page); later in a
> !compound branch. Since compound is now determined by the same check, could
> be deleted.
> 

Yes, eventually we could get rid of both VM_BUG_ON_PAGE() on both
branches and add a single VM_BUG_ON_PAGE(PageTail(page), page) check on
the compound branch. (we could also make sure that we're not given a
hugetlb page)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 11/16] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages
  2022-03-29 16:04 ` [PATCH v3 11/16] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages David Hildenbrand
@ 2022-04-13  8:25   ` Vlastimil Babka
  2022-04-13 10:28     ` David Hildenbrand
  0 siblings, 1 reply; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-13  8:25 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> The basic question we would like to have a reliable and efficient answer
> to is: is this anonymous page exclusive to a single process or might it
> be shared? We need that information for ordinary/single pages, hugetlb
> pages, and possibly each subpage of a THP.
> 
> Introduce a way to mark an anonymous page as exclusive, with the
> ultimate goal of teaching our COW logic to not do "wrong COWs", whereby
> GUP pins lose consistency with the pages mapped into the page table,
> resulting in reported memory corruptions.
> 
> Most pageflags already have semantics for anonymous pages, however,
> PG_mappedtodisk should never apply to pages in the swapcache, so let's
> reuse that flag.
> 
> As PG_has_hwpoisoned also uses that flag on the second tail page of a
> compound page, convert it to PG_error instead, which is marked as
> PF_NO_TAIL, so never used for tail pages.
> 
> Use custom page flag modification functions such that we can do
> additional sanity checks. The semantics we'll put into some kernel doc
> in the future are:
> 
> "
>   PG_anon_exclusive is *usually* only expressive in combination with a
>   page table entry. Depending on the page table entry type it might
>   store the following information:
> 
>        Is what's mapped via this page table entry exclusive to the
>        single process and can be mapped writable without further
>        checks? If not, it might be shared and we might have to COW.
> 
>   For now, we only expect PTE-mapped THPs to make use of
>   PG_anon_exclusive in subpages. For other anonymous compound
>   folios (i.e., hugetlb), only the head page is logically mapped and
>   holds this information.
> 
>   For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
>   set on the head page. When replacing the PMD by a page table full
>   of PTEs, PG_anon_exclusive, if set on the head page, will be set on
>   all tail pages accordingly. Note that converting from a PTE-mapping
>   to a PMD mapping using the same compound page is currently not
>   possible and consequently doesn't require care.
> 
>   If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
>   it should only pin if the relevant PG_anon_bit is set. In that case,

					^ PG_anon_exclusive bit ?

>   the pin will be fully reliable and stay consistent with the pages
>   mapped into the page table, as the bit cannot get cleared (e.g., by
>   fork(), KSM) while the page is pinned. For anonymous pages that
>   are mapped R/W, PG_anon_exclusive can be assumed to always be set
>   because such pages cannot possibly be shared.
> 
>   The page table lock protecting the page table entry is the primary
>   synchronization mechanism for PG_anon_exclusive; GUP-fast that does
>   not take the PT lock needs special care when trying to clear the
>   flag.
> 
>   Page table entry types and PG_anon_exclusive:
>   * Present: PG_anon_exclusive applies.
>   * Swap: the information is lost. PG_anon_exclusive was cleared.
>   * Migration: the entry holds this information instead.
>                PG_anon_exclusive was cleared.
>   * Device private: PG_anon_exclusive applies.
>   * Device exclusive: PG_anon_exclusive applies.
>   * HW Poison: PG_anon_exclusive is stale and not changed.
> 
>   If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
>   not allowed and the flag will stick around until the page is freed
>   and folio->mapping is cleared.

Or also if it's unpinned?

> "
> 
> We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
> zapping) of page table entries, page freeing code will handle that when
> also invalidate page->mapping to not indicate PageAnon() anymore.
> Letting information about exclusivity stick around will be an important
> property when adding sanity checks to unpinning code.
> 
> Note that we properly clear the flag in free_pages_prepare() via
> PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
> so there is no need to manually clear the flag.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3663,6 +3663,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		goto out_nomap;
>  	}
>  
> +	/*
> +	 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
> +	 * must never point at an anonymous page in the swapcache that is
> +	 * PG_anon_exclusive. Sanity check that this holds and especially, that
> +	 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
> +	 * check after taking the PT lock and making sure that nobody
> +	 * concurrently faulted in this page and set PG_anon_exclusive.
> +	 */
> +	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
> +	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
> +

Hmm, dunno why not VM_BUG_ON?

>  	/*
>  	 * Remove the swap entry and conditionally try to free up the swapcache.
>  	 * We're already holding a reference on the page but haven't mapped it
> diff --git a/mm/memremap.c b/mm/memremap.c
> index af0223605e69..4264f78299a8 100644

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 11/16] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages
  2022-04-13  8:25   ` Vlastimil Babka
@ 2022-04-13 10:28     ` David Hildenbrand
  2022-04-13 14:55       ` Vlastimil Babka
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-04-13 10:28 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 13.04.22 10:25, Vlastimil Babka wrote:
> On 3/29/22 18:04, David Hildenbrand wrote:
>> The basic question we would like to have a reliable and efficient answer
>> to is: is this anonymous page exclusive to a single process or might it
>> be shared? We need that information for ordinary/single pages, hugetlb
>> pages, and possibly each subpage of a THP.
>>
>> Introduce a way to mark an anonymous page as exclusive, with the
>> ultimate goal of teaching our COW logic to not do "wrong COWs", whereby
>> GUP pins lose consistency with the pages mapped into the page table,
>> resulting in reported memory corruptions.
>>
>> Most pageflags already have semantics for anonymous pages, however,
>> PG_mappedtodisk should never apply to pages in the swapcache, so let's
>> reuse that flag.
>>
>> As PG_has_hwpoisoned also uses that flag on the second tail page of a
>> compound page, convert it to PG_error instead, which is marked as
>> PF_NO_TAIL, so never used for tail pages.
>>
>> Use custom page flag modification functions such that we can do
>> additional sanity checks. The semantics we'll put into some kernel doc
>> in the future are:
>>
>> "
>>   PG_anon_exclusive is *usually* only expressive in combination with a
>>   page table entry. Depending on the page table entry type it might
>>   store the following information:
>>
>>        Is what's mapped via this page table entry exclusive to the
>>        single process and can be mapped writable without further
>>        checks? If not, it might be shared and we might have to COW.
>>
>>   For now, we only expect PTE-mapped THPs to make use of
>>   PG_anon_exclusive in subpages. For other anonymous compound
>>   folios (i.e., hugetlb), only the head page is logically mapped and
>>   holds this information.
>>
>>   For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
>>   set on the head page. When replacing the PMD by a page table full
>>   of PTEs, PG_anon_exclusive, if set on the head page, will be set on
>>   all tail pages accordingly. Note that converting from a PTE-mapping
>>   to a PMD mapping using the same compound page is currently not
>>   possible and consequently doesn't require care.
>>
>>   If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
>>   it should only pin if the relevant PG_anon_bit is set. In that case,
> 
> 					^ PG_anon_exclusive bit ?
> 
>>   the pin will be fully reliable and stay consistent with the pages
>>   mapped into the page table, as the bit cannot get cleared (e.g., by
>>   fork(), KSM) while the page is pinned. For anonymous pages that
>>   are mapped R/W, PG_anon_exclusive can be assumed to always be set
>>   because such pages cannot possibly be shared.
>>
>>   The page table lock protecting the page table entry is the primary
>>   synchronization mechanism for PG_anon_exclusive; GUP-fast that does
>>   not take the PT lock needs special care when trying to clear the
>>   flag.
>>
>>   Page table entry types and PG_anon_exclusive:
>>   * Present: PG_anon_exclusive applies.
>>   * Swap: the information is lost. PG_anon_exclusive was cleared.
>>   * Migration: the entry holds this information instead.
>>                PG_anon_exclusive was cleared.
>>   * Device private: PG_anon_exclusive applies.
>>   * Device exclusive: PG_anon_exclusive applies.
>>   * HW Poison: PG_anon_exclusive is stale and not changed.
>>
>>   If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
>>   not allowed and the flag will stick around until the page is freed
>>   and folio->mapping is cleared.
> 
> Or also if it's unpinned?

I'm afraid I didn't get your question. Once the page is no longer
pinned, we can succeed in clearing PG_anon_exclusive (just like pinning
never happened). Does that answer your question?

> 
>> "
>>
>> We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
>> zapping) of page table entries, page freeing code will handle that when
>> also invalidate page->mapping to not indicate PageAnon() anymore.
>> Letting information about exclusivity stick around will be an important
>> property when adding sanity checks to unpinning code.
>>
>> Note that we properly clear the flag in free_pages_prepare() via
>> PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
>> so there is no need to manually clear the flag.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> 
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3663,6 +3663,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  		goto out_nomap;
>>  	}
>>  
>> +	/*
>> +	 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>> +	 * must never point at an anonymous page in the swapcache that is
>> +	 * PG_anon_exclusive. Sanity check that this holds and especially, that
>> +	 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
>> +	 * check after taking the PT lock and making sure that nobody
>> +	 * concurrently faulted in this page and set PG_anon_exclusive.
>> +	 */
>> +	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
>> +	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
>> +
> 
> Hmm, dunno why not VM_BUG_ON?

Getting PageAnonExclusive accidentally set by a file system would result
in an extremely unpleasant security issue. I most surely want to catch
something like that in any case, especially in the foreseeable future.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 08/16] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
  2022-04-12  9:37     ` David Hildenbrand
@ 2022-04-13 12:26       ` Matthew Wilcox
  2022-04-13 12:28         ` David Hildenbrand
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Wilcox @ 2022-04-13 12:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vlastimil Babka, linux-kernel, Andrew Morton, Hugh Dickins,
	Linus Torvalds, David Rientjes, Shakeel Butt, John Hubbard,
	Jason Gunthorpe, Mike Kravetz, Mike Rapoport, Yang Shi,
	Kirill A . Shutemov, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm

On Tue, Apr 12, 2022 at 11:37:09AM +0200, David Hildenbrand wrote:
> On 12.04.22 10:47, Vlastimil Babka wrote:
> > There's a VM_BUG_ON_PAGE(PageTransCompound(page), page); later in a
> > !compound branch. Since compound is now determined by the same check, could
> > be deleted.
> 
> Yes, eventually we could get rid of both VM_BUG_ON_PAGE() on both
> branches and add a single VM_BUG_ON_PAGE(PageTail(page), page) check on
> the compound branch. (we could also make sure that we're not given a
> hugetlb page)

As a rule of thumb, if you find yourself wanting to add
VM_BUG_ON_PAGE(PageTail(page), page), you probably want to change the
interface to take a folio.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 08/16] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
  2022-04-13 12:26       ` Matthew Wilcox
@ 2022-04-13 12:28         ` David Hildenbrand
  2022-04-13 12:48           ` Matthew Wilcox
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-04-13 12:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, linux-kernel, Andrew Morton, Hugh Dickins,
	Linus Torvalds, David Rientjes, Shakeel Butt, John Hubbard,
	Jason Gunthorpe, Mike Kravetz, Mike Rapoport, Yang Shi,
	Kirill A . Shutemov, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm

On 13.04.22 14:26, Matthew Wilcox wrote:
> On Tue, Apr 12, 2022 at 11:37:09AM +0200, David Hildenbrand wrote:
>> On 12.04.22 10:47, Vlastimil Babka wrote:
>>> There's a VM_BUG_ON_PAGE(PageTransCompound(page), page); later in a
>>> !compound branch. Since compound is now determined by the same check, could
>>> be deleted.
>>
>> Yes, eventually we could get rid of both VM_BUG_ON_PAGE() on both
>> branches and add a single VM_BUG_ON_PAGE(PageTail(page), page) check on
>> the compound branch. (we could also make sure that we're not given a
>> hugetlb page)
> 
> As a rule of thumb, if you find yourself wanting to add
> VM_BUG_ON_PAGE(PageTail(page), page), you probably want to change the
> interface to take a folio.

Yeah, I had the same in mind. Might be a reasonable addon on top --
although it would stick out in the rmap code a bit because most
functions deal with both, folios and subpages.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 08/16] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
  2022-04-13 12:28         ` David Hildenbrand
@ 2022-04-13 12:48           ` Matthew Wilcox
  2022-04-13 16:20             ` David Hildenbrand
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Wilcox @ 2022-04-13 12:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vlastimil Babka, linux-kernel, Andrew Morton, Hugh Dickins,
	Linus Torvalds, David Rientjes, Shakeel Butt, John Hubbard,
	Jason Gunthorpe, Mike Kravetz, Mike Rapoport, Yang Shi,
	Kirill A . Shutemov, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm

On Wed, Apr 13, 2022 at 02:28:38PM +0200, David Hildenbrand wrote:
> On 13.04.22 14:26, Matthew Wilcox wrote:
> > On Tue, Apr 12, 2022 at 11:37:09AM +0200, David Hildenbrand wrote:
> >> On 12.04.22 10:47, Vlastimil Babka wrote:
> >>> There's a VM_BUG_ON_PAGE(PageTransCompound(page), page); later in a
> >>> !compound branch. Since compound is now determined by the same check, could
> >>> be deleted.
> >>
> >> Yes, eventually we could get rid of both VM_BUG_ON_PAGE() on both
> >> branches and add a single VM_BUG_ON_PAGE(PageTail(page), page) check on
> >> the compound branch. (we could also make sure that we're not given a
> >> hugetlb page)
> > 
> > As a rule of thumb, if you find yourself wanting to add
> > VM_BUG_ON_PAGE(PageTail(page), page), you probably want to change the
> > interface to take a folio.
> 
> Yeah, I had the same in mind. Might be a reasonable addon on top --
> although it would stick out in the rmap code a bit because most
> functions deal with both, folios and subpages.

I have the start of a series which starts looking at the fault path
to see where it makes sense to use folios and where it makes sense to
use pages.

We're (generally) faulting on a PTE, so we need the precise page to
be returned in vmf->page.  However vmf->cow_page can/should be a
folio (because it's definitely not a tail page).  That trickles
down into copy_present_page() (new_page and prealloc both become folios)
and so page_add_new_anon_rmap() then looks like a good target to
take a folio.

The finish_fault() -> do_set_pte() -> page_add_new_anon_rmap() looks
like the only kind of strange place where we don't necessarily have a
folio (all the others we just allocated it).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 11/16] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages
  2022-04-13 10:28     ` David Hildenbrand
@ 2022-04-13 14:55       ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-13 14:55 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 4/13/22 12:28, David Hildenbrand wrote:
> On 13.04.22 10:25, Vlastimil Babka wrote:
>> On 3/29/22 18:04, David Hildenbrand wrote:
>>>   the pin will be fully reliable and stay consistent with the pages
>>>   mapped into the page table, as the bit cannot get cleared (e.g., by
>>>   fork(), KSM) while the page is pinned. For anonymous pages that
>>>   are mapped R/W, PG_anon_exclusive can be assumed to always be set
>>>   because such pages cannot possibly be shared.
>>>
>>>   The page table lock protecting the page table entry is the primary
>>>   synchronization mechanism for PG_anon_exclusive; GUP-fast that does
>>>   not take the PT lock needs special care when trying to clear the
>>>   flag.
>>>
>>>   Page table entry types and PG_anon_exclusive:
>>>   * Present: PG_anon_exclusive applies.
>>>   * Swap: the information is lost. PG_anon_exclusive was cleared.
>>>   * Migration: the entry holds this information instead.
>>>                PG_anon_exclusive was cleared.
>>>   * Device private: PG_anon_exclusive applies.
>>>   * Device exclusive: PG_anon_exclusive applies.
>>>   * HW Poison: PG_anon_exclusive is stale and not changed.
>>>
>>>   If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
>>>   not allowed and the flag will stick around until the page is freed
>>>   and folio->mapping is cleared.
>> 
>> Or also if it's unpinned?
> 
> I'm afraid I didn't get your question. Once the page is no longer
> pinned, we can succeed in clearing PG_anon_exclusive (just like pinning
> never happened). Does that answer your question?

Yeah it looked like a scenario that's oddly missing in that description, yet
probably obvious. Now I feel it's indeed obvious, so nevermind :)

>>> We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
>>> zapping) of page table entries, page freeing code will handle that when
>>> also invalidate page->mapping to not indicate PageAnon() anymore.
>>> Letting information about exclusivity stick around will be an important
>>> property when adding sanity checks to unpinning code.
>>>
>>> Note that we properly clear the flag in free_pages_prepare() via
>>> PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
>>> so there is no need to manually clear the flag.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> 
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Thanks!
> 
>> 
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3663,6 +3663,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>  		goto out_nomap;
>>>  	}
>>>  
>>> +	/*
>>> +	 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>> +	 * must never point at an anonymous page in the swapcache that is
>>> +	 * PG_anon_exclusive. Sanity check that this holds and especially, that
>>> +	 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
>>> +	 * check after taking the PT lock and making sure that nobody
>>> +	 * concurrently faulted in this page and set PG_anon_exclusive.
>>> +	 */
>>> +	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
>>> +	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
>>> +
>> 
>> Hmm, dunno why not VM_BUG_ON?
> 
> Getting PageAnonExclusive accidentally set by a file system would result
> in an extremely unpleasant security issue. I most surely want to catch
> something like that in any case, especially in the foreseeable future.

OK then.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 08/16] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
  2022-04-13 12:48           ` Matthew Wilcox
@ 2022-04-13 16:20             ` David Hildenbrand
  0 siblings, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2022-04-13 16:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, linux-kernel, Andrew Morton, Hugh Dickins,
	Linus Torvalds, David Rientjes, Shakeel Butt, John Hubbard,
	Jason Gunthorpe, Mike Kravetz, Mike Rapoport, Yang Shi,
	Kirill A . Shutemov, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, linux-mm

On 13.04.22 14:48, Matthew Wilcox wrote:
> On Wed, Apr 13, 2022 at 02:28:38PM +0200, David Hildenbrand wrote:
>> On 13.04.22 14:26, Matthew Wilcox wrote:
>>> On Tue, Apr 12, 2022 at 11:37:09AM +0200, David Hildenbrand wrote:
>>>> On 12.04.22 10:47, Vlastimil Babka wrote:
>>>>> There's a VM_BUG_ON_PAGE(PageTransCompound(page), page); later in a
>>>>> !compound branch. Since compound is now determined by the same check, could
>>>>> be deleted.
>>>>
>>>> Yes, eventually we could get rid of both VM_BUG_ON_PAGE() on both
>>>> branches and add a single VM_BUG_ON_PAGE(PageTail(page), page) check on
>>>> the compound branch. (we could also make sure that we're not given a
>>>> hugetlb page)
>>>
>>> As a rule of thumb, if you find yourself wanting to add
>>> VM_BUG_ON_PAGE(PageTail(page), page), you probably want to change the
>>> interface to take a folio.
>>
>> Yeah, I had the same in mind. Might be a reasonable addon on top --
>> although it would stick out in the rmap code a bit because most
>> functions deal with both, folios and subpages.
> 
> I have the start of a series which starts looking at the fault path
> to see where it makes sense to use folios and where it makes sense to
> use pages.
> 
> We're (generally) faulting on a PTE, so we need the precise page to
> be returned in vmf->page.  However vmf->cow_page can/should be a
> folio (because it's definitely not a tail page).  That trickles
> down into copy_present_page() (new_page and prealloc both become folios)
> and so page_add_new_anon_rmap() then looks like a good target to
> take a folio.
> 
> The finish_fault() -> do_set_pte() -> page_add_new_anon_rmap() looks
> like the only kind of strange place where we don't necessarily have a
> folio (all the others we just allocated it).
> 

That's an interesting point. In this patch I'm assuming that we don't
have a compound page here (see below).

Which makes sense, because as the interface states "Same as
page_add_anon_rmap but must only be called on *new* pages.".

At least to me it would be weird to allocate a new compound page to then
pass a subpage to do_set_pte() page_add_new_anon_rmap().


And in fact, inside page_add_new_anon_rmap(compound=false) we have

/* Anon THP always mapped first with PMD */
VM_BUG_ON_PAGE(PageTransCompound(page), page);


which makes sure that we cannot have a compound page here, but in fact a
folio.

So unless I am missing something, do_set_pte() should in fact have a
folio here unless BUG?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 12/16] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
  2022-03-29 16:04 ` [PATCH v3 12/16] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive David Hildenbrand
@ 2022-04-13 16:28   ` Vlastimil Babka
  2022-04-13 16:39     ` David Hildenbrand
  2022-04-13 18:29   ` Vlastimil Babka
  1 sibling, 1 reply; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-13 16:28 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
> exclusive, and use that information to make GUP pins reliable and stay
> consistent with the page mapped into the page table even if the
> page table entry gets write-protected.
> 
> With that information at hand, we can extend our COW logic to always
> reuse anonymous pages that are exclusive. For anonymous pages that
> might be shared, the existing logic applies.
> 
> As already documented, PG_anon_exclusive is usually only expressive in
> combination with a page table entry. Especially PTE vs. PMD-mapped
> anonymous pages require more thought, some examples: due to mremap() we
> can easily have a single compound page PTE-mapped into multiple page tables
> exclusively in a single process -- multiple page table locks apply.
> Further, due to MADV_WIPEONFORK we might not necessarily write-protect
> all PTEs, and only some subpages might be pinned. Long story short: once
> PTE-mapped, we have to track information about exclusivity per sub-page,
> but until then, we can just track it for the compound page in the head
> page and not having to update a whole bunch of subpages all of the time
> for a simple PMD mapping of a THP.
> 
> For simplicity, this commit mostly talks about "anonymous pages", while
> it's for THP actually "the part of an anonymous folio referenced via
> a page table entry".
> 
> To not spill PG_anon_exclusive code all over the mm code-base, we let
> the anon rmap code to handle all PG_anon_exclusive logic it can easily
> handle.
> 
> If a writable, present page table entry points at an anonymous (sub)page,
> that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably
> pin (FOLL_PIN) on an anonymous page references via a present
> page table entry, it must only pin if PG_anon_exclusive is set for the
> mapped (sub)page.
> 
> This commit doesn't adjust GUP, so this is only implicitly handled for
> FOLL_WRITE, follow-up commits will teach GUP to also respect it for
> FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages

	   without FOLL_WRITE ?

> fully reliable.

<snip>

> @@ -202,11 +203,26 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
>  	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
>  }
>  
> +static inline int is_readable_migration_entry(swp_entry_t entry)
> +{
> +	return unlikely(swp_type(entry) == SWP_MIGRATION_READ);
> +}
> +
> +static inline int is_readable_exclusive_migration_entry(swp_entry_t entry)
> +{
> +	return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE);
> +}

This one seems to be missing a !CONFIG_MIGRATION counterpart. Although the
only caller __split_huge_pmd_locked() probably indirectly only exists with
CONFIG_MIGRATION so it's not an immediate issue.  (THP selects COMPACTION
selects MIGRATION)

<snip>

> @@ -3035,10 +3083,19 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  
>  	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
>  	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
> +
> +	anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
> +	if (anon_exclusive && page_try_share_anon_rmap(page)) {
> +		set_pmd_at(mm, address, pvmw->pmd, pmdval);
> +		return;

I am admittedly not too familiar with this code, but looks like this means
we fail to migrate the THP, right? But we don't seem to be telling the
caller, which is try_to_migrate_one(), so it will continue and not terminate
the walk and return false?

> +	}
> +
>  	if (pmd_dirty(pmdval))
>  		set_page_dirty(page);
>  	if (pmd_write(pmdval))
>  		entry = make_writable_migration_entry(page_to_pfn(page));
> +	else if (anon_exclusive)
> +		entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
>  	else
>  		entry = make_readable_migration_entry(page_to_pfn(page));
>  	pmdswp = swp_entry_to_pmd(entry);

<snip>

> @@ -1918,6 +1955,15 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  				page_vma_mapped_walk_done(&pvmw);
>  				break;
>  			}
> +			VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) &&
> +				       !anon_exclusive, subpage);
> +			if (anon_exclusive &&
> +			    page_try_share_anon_rmap(subpage)) {
> +				set_pte_at(mm, address, pvmw.pte, pteval);
> +				ret = false;
> +				page_vma_mapped_walk_done(&pvmw);
> +				break;
> +			}

Yeah for the PTE version it seems to do what I'd expect.

>  			/*
>  			 * Store the pfn of the page in a special migration

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 12/16] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
  2022-04-13 16:28   ` Vlastimil Babka
@ 2022-04-13 16:39     ` David Hildenbrand
  2022-04-13 18:28       ` Vlastimil Babka
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-04-13 16:39 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 13.04.22 18:28, Vlastimil Babka wrote:
> On 3/29/22 18:04, David Hildenbrand wrote:
>> Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
>> exclusive, and use that information to make GUP pins reliable and stay
>> consistent with the page mapped into the page table even if the
>> page table entry gets write-protected.
>>
>> With that information at hand, we can extend our COW logic to always
>> reuse anonymous pages that are exclusive. For anonymous pages that
>> might be shared, the existing logic applies.
>>
>> As already documented, PG_anon_exclusive is usually only expressive in
>> combination with a page table entry. Especially PTE vs. PMD-mapped
>> anonymous pages require more thought, some examples: due to mremap() we
>> can easily have a single compound page PTE-mapped into multiple page tables
>> exclusively in a single process -- multiple page table locks apply.
>> Further, due to MADV_WIPEONFORK we might not necessarily write-protect
>> all PTEs, and only some subpages might be pinned. Long story short: once
>> PTE-mapped, we have to track information about exclusivity per sub-page,
>> but until then, we can just track it for the compound page in the head
>> page and not having to update a whole bunch of subpages all of the time
>> for a simple PMD mapping of a THP.
>>
>> For simplicity, this commit mostly talks about "anonymous pages", while
>> it's for THP actually "the part of an anonymous folio referenced via
>> a page table entry".
>>
>> To not spill PG_anon_exclusive code all over the mm code-base, we let
>> the anon rmap code to handle all PG_anon_exclusive logic it can easily
>> handle.
>>
>> If a writable, present page table entry points at an anonymous (sub)page,
>> that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably
>> pin (FOLL_PIN) on an anonymous page references via a present
>> page table entry, it must only pin if PG_anon_exclusive is set for the
>> mapped (sub)page.
>>
>> This commit doesn't adjust GUP, so this is only implicitly handled for
>> FOLL_WRITE, follow-up commits will teach GUP to also respect it for
>> FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages
> 
> 	   without FOLL_WRITE ?

Indeed, thanks.

> 
>> fully reliable.
> 
> <snip>
> 
>> @@ -202,11 +203,26 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
>>  	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
>>  }
>>  
>> +static inline int is_readable_migration_entry(swp_entry_t entry)
>> +{
>> +	return unlikely(swp_type(entry) == SWP_MIGRATION_READ);
>> +}
>> +
>> +static inline int is_readable_exclusive_migration_entry(swp_entry_t entry)
>> +{
>> +	return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE);
>> +}
> 
> This one seems to be missing a !CONFIG_MIGRATION counterpart. Although the
> only caller __split_huge_pmd_locked() probably indirectly only exists with
> CONFIG_MIGRATION so it's not an immediate issue.  (THP selects COMPACTION
> selects MIGRATION)

So far no builds bailed out. And yes, I think it's for the reason
stated. THP without compaction would be a lost bet.

> 
> <snip>
> 
>> @@ -3035,10 +3083,19 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>>  
>>  	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
>>  	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>> +
>> +	anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>> +	if (anon_exclusive && page_try_share_anon_rmap(page)) {
>> +		set_pmd_at(mm, address, pvmw->pmd, pmdval);
>> +		return;
> 
> I am admittedly not too familiar with this code, but looks like this means
> we fail to migrate the THP, right? But we don't seem to be telling the
> caller, which is try_to_migrate_one(), so it will continue and not terminate
> the walk and return false?

Right, we're not returning "false". Returning "false" would be an
optimization to make rmap_walk_anon() fail faster.

But, after all, the THP is exclusive (-> single mapping), so
anon_vma_interval_tree_foreach() would most probably not have a lot work
to do either way I'd assume?

In  any case, once we return from try_to_migrate(), the page will still
be mapped.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 12/16] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
  2022-04-13 16:39     ` David Hildenbrand
@ 2022-04-13 18:28       ` Vlastimil Babka
  2022-04-19 16:46         ` David Hildenbrand
  0 siblings, 1 reply; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-13 18:28 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 4/13/22 18:39, David Hildenbrand wrote:
>>> @@ -3035,10 +3083,19 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>>>  
>>>  	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
>>>  	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>>> +
>>> +	anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>>> +	if (anon_exclusive && page_try_share_anon_rmap(page)) {
>>> +		set_pmd_at(mm, address, pvmw->pmd, pmdval);
>>> +		return;
>> 
>> I am admittedly not too familiar with this code, but looks like this means
>> we fail to migrate the THP, right? But we don't seem to be telling the
>> caller, which is try_to_migrate_one(), so it will continue and not terminate
>> the walk and return false?
> 
> Right, we're not returning "false". Returning "false" would be an
> optimization to make rmap_walk_anon() fail faster.

Ah right, that's what I missed, it's an optimization and we will realize
elsewhere afterwards that the page has still mappings and we can't migrate...

> But, after all, the THP is exclusive (-> single mapping), so
> anon_vma_interval_tree_foreach() would most probably not have a lot work
> to do either way I'd assume?
> 
> In  any case, once we return from try_to_migrate(), the page will still
> be mapped.
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 12/16] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
  2022-03-29 16:04 ` [PATCH v3 12/16] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive David Hildenbrand
  2022-04-13 16:28   ` Vlastimil Babka
@ 2022-04-13 18:29   ` Vlastimil Babka
  1 sibling, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-13 18:29 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
> exclusive, and use that information to make GUP pins reliable and stay
> consistent with the page mapped into the page table even if the
> page table entry gets write-protected.
> 
> With that information at hand, we can extend our COW logic to always
> reuse anonymous pages that are exclusive. For anonymous pages that
> might be shared, the existing logic applies.
> 
> As already documented, PG_anon_exclusive is usually only expressive in
> combination with a page table entry. Especially PTE vs. PMD-mapped
> anonymous pages require more thought, some examples: due to mremap() we
> can easily have a single compound page PTE-mapped into multiple page tables
> exclusively in a single process -- multiple page table locks apply.
> Further, due to MADV_WIPEONFORK we might not necessarily write-protect
> all PTEs, and only some subpages might be pinned. Long story short: once
> PTE-mapped, we have to track information about exclusivity per sub-page,
> but until then, we can just track it for the compound page in the head
> page and not having to update a whole bunch of subpages all of the time
> for a simple PMD mapping of a THP.
> 
> For simplicity, this commit mostly talks about "anonymous pages", while
> it's for THP actually "the part of an anonymous folio referenced via
> a page table entry".
> 
> To not spill PG_anon_exclusive code all over the mm code-base, we let
> the anon rmap code to handle all PG_anon_exclusive logic it can easily
> handle.
> 
> If a writable, present page table entry points at an anonymous (sub)page,
> that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably
> pin (FOLL_PIN) on an anonymous page references via a present
> page table entry, it must only pin if PG_anon_exclusive is set for the
> mapped (sub)page.
> 
> This commit doesn't adjust GUP, so this is only implicitly handled for
> FOLL_WRITE, follow-up commits will teach GUP to also respect it for
> FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages
> fully reliable.
> 
> Whenever an anonymous page is to be shared (fork(), KSM), or when
> temporarily unmapping an anonymous page (swap, migration), the relevant
> PG_anon_exclusive bit has to be cleared to mark the anonymous page
> possibly shared. Clearing will fail if there are GUP pins on the page:
> * For fork(), this means having to copy the page and not being able to
>   share it. fork() protects against concurrent GUP using the PT lock and
>   the src_mm->write_protect_seq.
> * For KSM, this means sharing will fail. For swap this means, unmapping
>   will fail, For migration this means, migration will fail early. All
>   three cases protect against concurrent GUP using the PT lock and a
>   proper clear/invalidate+flush of the relevant page table entry.
> 
> This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
> pinned page gets mapped R/O and the successive write fault ends up
> replacing the page instead of reusing it. It improves the situation for
> O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN,
> if fork() is *not* involved, however swapout and fork() are still
> problematic. Properly using FOLL_PIN instead of FOLL_GET for these
> GUP users will fix the issue for them.
> 
> I. Details about basic handling
> 
> I.1. Fresh anonymous pages
> 
> page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
> given page exclusive via __page_set_anon_rmap(exclusive=1). As that is
> the mechanism fresh anonymous pages come into life (besides migration
> code where we copy the page->mapping), all fresh anonymous pages will
> start out as exclusive.
> 
> I.2. COW reuse handling of anonymous pages
> 
> When a COW handler stumbles over a (sub)page that's marked exclusive, it
> simply reuses it. Otherwise, the handler tries harder under page lock to
> detect if the (sub)page is exclusive and can be reused. If exclusive,
> page_move_anon_rmap() will mark the given (sub)page exclusive.
> 
> Note that hugetlb code does not yet check for PageAnonExclusive(), as it
> still uses the old COW logic that is prone to the COW security issue
> because hugetlb code cannot really tolerate unnecessary/wrong COW as
> huge pages are a scarce resource.
> 
> I.3. Migration handling
> 
> try_to_migrate() has to try marking an exclusive anonymous page shared
> via page_try_share_anon_rmap(). If it fails because there are GUP pins
> on the page, unmap fails. migrate_vma_collect_pmd() and
> __split_huge_pmd_locked() are handled similarly.
> 
> Writable migration entries implicitly point at shared anonymous pages.
> For readable migration entries that information is stored via a new
> "readable-exclusive" migration entry, specific to anonymous pages.
> 
> When restoring a migration entry in remove_migration_pte(), information
> about exlusivity is detected via the migration entry type, and
> RMAP_EXCLUSIVE is set accordingly for
> page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that
> information.
> 
> I.4. Swapout handling
> 
> try_to_unmap() has to try marking the mapped page possibly shared via
> page_try_share_anon_rmap(). If it fails because there are GUP pins on the
> page, unmap fails. For now, information about exclusivity is lost. In the
> future, we might want to remember that information in the swap entry in
> some cases, however, it requires more thought, care, and a way to store
> that information in swap entries.
> 
> I.5. Swapin handling
> 
> do_swap_page() will never stumble over exclusive anonymous pages in the
> swap cache, as try_to_migrate() prohibits that. do_swap_page() always has
> to detect manually if an anonymous page is exclusive and has to set
> RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.
> 
> I.6. THP handling
> 
> __split_huge_pmd_locked() has to move the information about exclusivity
> from the PMD to the PTEs.
> 
> a) In case we have a readable-exclusive PMD migration entry, simply insert
> readable-exclusive PTE migration entries.
> 
> b) In case we have a present PMD entry and we don't want to freeze
> ("convert to migration entries"), simply forward PG_anon_exclusive to
> all sub-pages, no need to temporarily clear the bit.
> 
> c) In case we have a present PMD entry and want to freeze, handle it
> similar to try_to_migrate(): try marking the page shared first. In case
> we fail, we ignore the "freeze" instruction and simply split ordinarily.
> try_to_migrate() will properly fail because the THP is still mapped via
> PTEs.
> 
> When splitting a compound anonymous folio (THP), the information about
> exclusivity is implicitly handled via the migration entries: no need to
> replicate PG_anon_exclusive manually.
> 
> I.7. fork() handling
> 
> fork() handling is relatively easy, because PG_anon_exclusive is only
> expressive for some page table entry types.
> 
> a) Present anonymous pages
> 
> page_try_dup_anon_rmap() will mark the given subpage shared -- which
> will fail if the page is pinned. If it failed, we have to copy (or
> PTE-map a PMD to handle it on the PTE level).
> 
> Note that device exclusive entries are just a pointer at a PageAnon()
> page. fork() will first convert a device exclusive entry to a present
> page table and handle it just like present anonymous pages.
> 
> b) Device private entry
> 
> Device private entries point at PageAnon() pages that cannot be mapped
> directly and, therefore, cannot get pinned.
> 
> page_try_dup_anon_rmap() will mark the given subpage shared, which
> cannot fail because they cannot get pinned.
> 
> c) HW poison entries
> 
> PG_anon_exclusive will remain untouched and is stale -- the page table
> entry is just a placeholder after all.
> 
> d) Migration entries
> 
> Writable and readable-exclusive entries are converted to readable
> entries: possibly shared.
> 
> I.8. mprotect() handling
> 
> mprotect() only has to properly handle the new readable-exclusive
> migration entry:
> 
> When write-protecting a migration entry that points at an anonymous
> page, remember the information about exclusivity via the
> "readable-exclusive" migration entry type.
> 
> II. Migration and GUP-fast
> 
> Whenever replacing a present page table entry that maps an exclusive
> anonymous page by a migration entry, we have to mark the page possibly
> shared and synchronize against GUP-fast by a proper
> clear/invalidate+flush to make the following scenario impossible:
> 
> 1. try_to_migrate() places a migration entry after checking for GUP pins
>    and marks the page possibly shared.
> 2. GUP-fast pins the page due to lack of synchronization
> 3. fork() converts the "writable/readable-exclusive" migration entry into a
>    readable migration entry
> 4. Migration fails due to the GUP pin (failing to freeze the refcount)
> 5. Migration entries are restored. PG_anon_exclusive is lost
> 
> -> We have a pinned page that is not marked exclusive anymore.
> 
> Note that we move information about exclusivity from the page to the
> migration entry as it otherwise highly overcomplicates fork() and
> PTE-mapping a THP.
> 
> III. Swapout and GUP-fast
> 
> Whenever replacing a present page table entry that maps an exclusive
> anonymous page by a swap entry, we have to mark the page possibly
> shared and synchronize against GUP-fast by a proper
> clear/invalidate+flush to make the following scenario impossible:
> 
> 1. try_to_unmap() places a swap entry after checking for GUP pins and
>    clears exclusivity information on the page.
> 2. GUP-fast pins the page due to lack of synchronization.
> 
> -> We have a pinned page that is not marked exclusive anymore.
> 
> If we'd ever store information about exclusivity in the swap entry,
> similar to migration handling, the same considerations as in II would
> apply. This is future work.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 13/16] mm/gup: disallow follow_page(FOLL_PIN)
  2022-03-29 16:04 ` [PATCH v3 13/16] mm/gup: disallow follow_page(FOLL_PIN) David Hildenbrand
@ 2022-04-14 15:18   ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-14 15:18 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> We want to change the way we handle R/O pins on anonymous pages that
> might be shared: if we detect a possibly shared anonymous page --
> mapped R/O and not !PageAnonExclusive() -- we want to trigger unsharing
> via a page fault, resulting in an exclusive anonymous page that can be
> pinned reliably without getting replaced via COW on the next write
> fault.
> 
> However, the required page fault will be problematic for follow_page():
> in contrast to ordinary GUP, follow_page() doesn't trigger faults
> internally. So we would have to end up failing a R/O pin via
> follow_page(), although there is something mapped R/O into the page
> table, which might be rather surprising.
> 
> We don't seem to have follow_page(FOLL_PIN) users, and it's a purely
> internal MM function. Let's just make our life easier and the semantics of
> follow_page() clearer by just disallowing FOLL_PIN for follow_page()
> completely.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 14/16] mm: support GUP-triggered unsharing of anonymous pages
  2022-03-29 16:04 ` [PATCH v3 14/16] mm: support GUP-triggered unsharing of anonymous pages David Hildenbrand
@ 2022-04-14 17:15   ` Vlastimil Babka
  2022-04-19 16:29     ` David Hildenbrand
  0 siblings, 1 reply; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-14 17:15 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> Whenever GUP currently ends up taking a R/O pin on an anonymous page that
> might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
> on the page table entry will end up replacing the mapped anonymous page
> due to COW, resulting in the GUP pin no longer being consistent with the
> page actually mapped into the page table.
> 
> The possible ways to deal with this situation are:
>  (1) Ignore and pin -- what we do right now.
>  (2) Fail to pin -- which would be rather surprising to callers and
>      could break user space.
>  (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
>      pins.
> 
> We want to implement 3) because it provides the clearest semantics and
> allows for checking in unpin_user_pages() and friends for possible BUGs:
> when trying to unpin a page that's no longer exclusive, clearly
> something went very wrong and might result in memory corruptions that
> might be hard to debug. So we better have a nice way to spot such
> issues.
> 
> To implement 3), we need a way for GUP to trigger unsharing:
> FAULT_FLAG_UNSHARE. FAULT_FLAG_UNSHARE is only applicable to R/O mapped
> anonymous pages and resembles COW logic during a write fault. However, in
> contrast to a write fault, GUP-triggered unsharing will, for example, still
> maintain the write protection.
> 
> Let's implement FAULT_FLAG_UNSHARE by hooking into the existing write fault
> handlers for all applicable anonymous page types: ordinary pages, THP and
> hugetlb.
> 
> * If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that has been
>   marked exclusive in the meantime by someone else, there is nothing to do.
> * If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that's not
>   marked exclusive, it will try detecting if the process is the exclusive
>   owner. If exclusive, it can be set exclusive similar to reuse logic
>   during write faults via page_move_anon_rmap() and there is nothing
>   else to do; otherwise, we either have to copy and map a fresh,
>   anonymous exclusive page R/O (ordinary pages, hugetlb), or split the
>   THP.
> 
> This commit is heavily based on patches by Andrea.
> 
> Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Modulo a nit and suspected logical bug below.

<snip>

> @@ -3072,6 +3082,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  		 * mmu page tables (such as kvm shadow page tables), we want the
>  		 * new page to be mapped directly into the secondary page table.
>  		 */
> +		BUG_ON(unshare && pte_write(entry));
>  		set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
>  		update_mmu_cache(vma, vmf->address, vmf->pte);
>  		if (old_page) {
> @@ -3121,7 +3132,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  			free_swap_cache(old_page);
>  		put_page(old_page);
>  	}
> -	return page_copied ? VM_FAULT_WRITE : 0;
> +	return page_copied && !unshare ? VM_FAULT_WRITE : 0;

Could be just me but I would prefer (page_copied && !unshare) as I rarely
see these operators together like this to remember their relative priority
very well.

>  oom_free_new:
>  	put_page(new_page);
>  oom:

<snip>

> @@ -4515,8 +4550,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
>  /* `inline' is required to avoid gcc 4.1.2 build error */
>  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>  {
> +	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
> +
>  	if (vma_is_anonymous(vmf->vma)) {
> -		if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
> +		if (unlikely(unshare) &&

Is this condition flipped, should it be "likely(!unshare)"? As the similar
code in do_wp_page() does.

> +		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
>  			return handle_userfault(vmf, VM_UFFD_WP);
>  		return do_huge_pmd_wp_page(vmf);
>  	}
> @@ -4651,10 +4689,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>  		goto unlock;
>  	}
> -	if (vmf->flags & FAULT_FLAG_WRITE) {
> +	if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>  		if (!pte_write(entry))
>  			return do_wp_page(vmf);
> -		entry = pte_mkdirty(entry);
> +		else if (likely(vmf->flags & FAULT_FLAG_WRITE))
> +			entry = pte_mkdirty(entry);
>  	}
>  	entry = pte_mkyoung(entry);
>  	if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 15/16] mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page
  2022-03-29 16:04 ` [PATCH v3 15/16] mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page David Hildenbrand
@ 2022-04-19 15:56   ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-19 15:56 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> Whenever GUP currently ends up taking a R/O pin on an anonymous page that
> might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
> on the page table entry will end up replacing the mapped anonymous page
> due to COW, resulting in the GUP pin no longer being consistent with the
> page actually mapped into the page table.
> 
> The possible ways to deal with this situation are:
>  (1) Ignore and pin -- what we do right now.
>  (2) Fail to pin -- which would be rather surprising to callers and
>      could break user space.
>  (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
>      pins.
> 
> Let's implement 3) because it provides the clearest semantics and
> allows for checking in unpin_user_pages() and friends for possible BUGs:
> when trying to unpin a page that's no longer exclusive, clearly
> something went very wrong and might result in memory corruptions that
> might be hard to debug. So we better have a nice way to spot such
> issues.
> 
> This change implies that whenever user space *wrote* to a private
> mapping (IOW, we have an anonymous page mapped), that GUP pins will
> always remain consistent: reliable R/O GUP pins of anonymous pages.
> 
> As a side note, this commit fixes the COW security issue for hugetlb with
> FOLL_PIN as documented in:
>   https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
> The vmsplice reproducer still applies, because vmsplice uses FOLL_GET
> instead of FOLL_PIN.
> 
> Note that follow_huge_pmd() doesn't apply because we cannot end up in
> there with FOLL_PIN.
> 
> This commit is heavily based on prototype patches by Andrea.
> 
> Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 14/16] mm: support GUP-triggered unsharing of anonymous pages
  2022-04-14 17:15   ` Vlastimil Babka
@ 2022-04-19 16:29     ` David Hildenbrand
  2022-04-19 16:31       ` Vlastimil Babka
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-04-19 16:29 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 14.04.22 19:15, Vlastimil Babka wrote:
> On 3/29/22 18:04, David Hildenbrand wrote:
>> Whenever GUP currently ends up taking a R/O pin on an anonymous page that
>> might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
>> on the page table entry will end up replacing the mapped anonymous page
>> due to COW, resulting in the GUP pin no longer being consistent with the
>> page actually mapped into the page table.
>>
>> The possible ways to deal with this situation are:
>>  (1) Ignore and pin -- what we do right now.
>>  (2) Fail to pin -- which would be rather surprising to callers and
>>      could break user space.
>>  (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
>>      pins.
>>
>> We want to implement 3) because it provides the clearest semantics and
>> allows for checking in unpin_user_pages() and friends for possible BUGs:
>> when trying to unpin a page that's no longer exclusive, clearly
>> something went very wrong and might result in memory corruptions that
>> might be hard to debug. So we better have a nice way to spot such
>> issues.
>>
>> To implement 3), we need a way for GUP to trigger unsharing:
>> FAULT_FLAG_UNSHARE. FAULT_FLAG_UNSHARE is only applicable to R/O mapped
>> anonymous pages and resembles COW logic during a write fault. However, in
>> contrast to a write fault, GUP-triggered unsharing will, for example, still
>> maintain the write protection.
>>
>> Let's implement FAULT_FLAG_UNSHARE by hooking into the existing write fault
>> handlers for all applicable anonymous page types: ordinary pages, THP and
>> hugetlb.
>>
>> * If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that has been
>>   marked exclusive in the meantime by someone else, there is nothing to do.
>> * If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that's not
>>   marked exclusive, it will try detecting if the process is the exclusive
>>   owner. If exclusive, it can be set exclusive similar to reuse logic
>>   during write faults via page_move_anon_rmap() and there is nothing
>>   else to do; otherwise, we either have to copy and map a fresh,
>>   anonymous exclusive page R/O (ordinary pages, hugetlb), or split the
>>   THP.
>>
>> This commit is heavily based on patches by Andrea.
>>
>> Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Modulo a nit and suspected logical bug below.

Thanks!

>> @@ -4515,8 +4550,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
>>  /* `inline' is required to avoid gcc 4.1.2 build error */
>>  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>>  {
>> +	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
>> +
>>  	if (vma_is_anonymous(vmf->vma)) {
>> -		if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
>> +		if (unlikely(unshare) &&
> 
> Is this condition flipped, should it be "likely(!unshare)"? As the similar
> code in do_wp_page() does.

Good catch, this should affect uffd-wp on THP -- it wouldn't trigger as expected. Thanks a lot for finding that!

> 
>> +		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
>>  			return handle_userfault(vmf, VM_UFFD_WP);
>>  		return do_huge_pmd_wp_page(vmf);
>>  	}
>> @@ -4651,10 +4689,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>  		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>>  		goto unlock;
>>  	}
>> -	if (vmf->flags & FAULT_FLAG_WRITE) {
>> +	if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>>  		if (!pte_write(entry))
>>  			return do_wp_page(vmf);
>> -		entry = pte_mkdirty(entry);
>> +		else if (likely(vmf->flags & FAULT_FLAG_WRITE))
>> +			entry = pte_mkdirty(entry);
>>  	}
>>  	entry = pte_mkyoung(entry);
>>  	if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
> 


So the following on top, right?


diff --git a/mm/memory.c b/mm/memory.c
index 8b3cb73f5e44..4584c7e87a70 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3137,7 +3137,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
                        free_swap_cache(old_page);
                put_page(old_page);
        }
-       return page_copied && !unshare ? VM_FAULT_WRITE : 0;
+       return (page_copied && !unshare) ? VM_FAULT_WRITE : 0;
 oom_free_new:
        put_page(new_page);
 oom:
@@ -4604,7 +4604,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
        const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
 
        if (vma_is_anonymous(vmf->vma)) {
-               if (unlikely(unshare) &&
+               if (likely(!unshare) &&
                    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
                        return handle_userfault(vmf, VM_UFFD_WP);
                return do_huge_pmd_wp_page(vmf);


-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 14/16] mm: support GUP-triggered unsharing of anonymous pages
  2022-04-19 16:29     ` David Hildenbrand
@ 2022-04-19 16:31       ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-19 16:31 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 4/19/22 18:29, David Hildenbrand wrote:
>>> @@ -4515,8 +4550,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
>>>  /* `inline' is required to avoid gcc 4.1.2 build error */
>>>  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>>>  {
>>> +	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
>>> +
>>>  	if (vma_is_anonymous(vmf->vma)) {
>>> -		if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
>>> +		if (unlikely(unshare) &&
>> 
>> Is this condition flipped, should it be "likely(!unshare)"? As the similar
>> code in do_wp_page() does.
> 
> Good catch, this should affect uffd-wp on THP -- it wouldn't trigger as expected. Thanks a lot for finding that!

Yay, glad I was right this time.

>> 
>>> +		    userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
>>>  			return handle_userfault(vmf, VM_UFFD_WP);
>>>  		return do_huge_pmd_wp_page(vmf);
>>>  	}
>>> @@ -4651,10 +4689,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>>  		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>>>  		goto unlock;
>>>  	}
>>> -	if (vmf->flags & FAULT_FLAG_WRITE) {
>>> +	if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>>>  		if (!pte_write(entry))
>>>  			return do_wp_page(vmf);
>>> -		entry = pte_mkdirty(entry);
>>> +		else if (likely(vmf->flags & FAULT_FLAG_WRITE))
>>> +			entry = pte_mkdirty(entry);
>>>  	}
>>>  	entry = pte_mkyoung(entry);
>>>  	if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
>> 
> 
> 
> So the following on top, right?

Looks good!

> diff --git a/mm/memory.c b/mm/memory.c
> index 8b3cb73f5e44..4584c7e87a70 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3137,7 +3137,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>                         free_swap_cache(old_page);
>                 put_page(old_page);
>         }
> -       return page_copied && !unshare ? VM_FAULT_WRITE : 0;
> +       return (page_copied && !unshare) ? VM_FAULT_WRITE : 0;
>  oom_free_new:
>         put_page(new_page);
>  oom:
> @@ -4604,7 +4604,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>         const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
>  
>         if (vma_is_anonymous(vmf->vma)) {
> -               if (unlikely(unshare) &&
> +               if (likely(!unshare) &&
>                     userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
>                         return handle_userfault(vmf, VM_UFFD_WP);
>                 return do_huge_pmd_wp_page(vmf);
> 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 12/16] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
  2022-04-13 18:28       ` Vlastimil Babka
@ 2022-04-19 16:46         ` David Hildenbrand
  0 siblings, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2022-04-19 16:46 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 13.04.22 20:28, Vlastimil Babka wrote:
> On 4/13/22 18:39, David Hildenbrand wrote:
>>>> @@ -3035,10 +3083,19 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>>>>  
>>>>  	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
>>>>  	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>>>> +
>>>> +	anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>>>> +	if (anon_exclusive && page_try_share_anon_rmap(page)) {
>>>> +		set_pmd_at(mm, address, pvmw->pmd, pmdval);
>>>> +		return;
>>>
>>> I am admittedly not too familiar with this code, but looks like this means
>>> we fail to migrate the THP, right? But we don't seem to be telling the
>>> caller, which is try_to_migrate_one(), so it will continue and not terminate
>>> the walk and return false?
>>
>> Right, we're not returning "false". Returning "false" would be an
>> optimization to make rmap_walk_anon() fail faster.
> 
> Ah right, that's what I missed, it's an optimization and we will realize
> elsewhere afterwards that the page has still mappings and we can't migrate...

I'll include that patch in v4 (to be tested):


From 08fb0e45404e3d0f85c2ad23a473e95053396376 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 19 Apr 2022 18:39:23 +0200
Subject: [PATCH] mm/rmap: fail try_to_migrate() early when setting a PMD
 migration entry fails

Let's fail right away in case we cannot clear PG_anon_exclusive because
the anon THP may be pinned. Right now, we continue trying to
install migration entries and the caller of try_to_migrate() will
realize that the page is still mapped and has to restore the migration
entries. Let's just fail fast just like for PTE migration entries.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/swapops.h | 4 ++--
 mm/huge_memory.c        | 8 +++++---
 mm/rmap.c               | 6 +++++-
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 06280fc1c99b..8b6e4cd1fab8 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -299,7 +299,7 @@ static inline bool is_pfn_swap_entry(swp_entry_t entry)
 struct page_vma_mapped_walk;
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+extern int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		struct page *page);
 
 extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
@@ -332,7 +332,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
 	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
 }
 #else
-static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		struct page *page)
 {
 	BUILD_BUG();
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c7ac1b462543..390f22334ee9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3080,7 +3080,7 @@ late_initcall(split_huge_pages_debugfs);
 #endif
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		struct page *page)
 {
 	struct vm_area_struct *vma = pvmw->vma;
@@ -3092,7 +3092,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	pmd_t pmdswp;
 
 	if (!(pvmw->pmd && !pvmw->pte))
-		return;
+		return 0;
 
 	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
 	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
@@ -3100,7 +3100,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
 	if (anon_exclusive && page_try_share_anon_rmap(page)) {
 		set_pmd_at(mm, address, pvmw->pmd, pmdval);
-		return;
+		return -EBUSY;
 	}
 
 	if (pmd_dirty(pmdval))
@@ -3118,6 +3118,8 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	page_remove_rmap(page, vma, true);
 	put_page(page);
 	trace_set_migration_pmd(address, pmd_val(pmdswp));
+
+	return 0;
 }
 
 void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
diff --git a/mm/rmap.c b/mm/rmap.c
index 00418faaf4ce..68c2f61bf212 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1814,7 +1814,11 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
 					!folio_test_pmd_mappable(folio), folio);
 
-			set_pmd_migration_entry(&pvmw, subpage);
+			if (set_pmd_migration_entry(&pvmw, subpage)) {
+				ret = false;
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
 			continue;
 		}
 #endif
-- 
2.35.1



-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 16/16] mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning
  2022-03-29 16:04 ` [PATCH v3 16/16] mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning David Hildenbrand
@ 2022-04-19 17:40   ` Vlastimil Babka
  2022-04-21  9:15     ` David Hildenbrand
  0 siblings, 1 reply; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-19 17:40 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 3/29/22 18:04, David Hildenbrand wrote:
> Let's verify when (un)pinning anonymous pages that we always deal with
> exclusive anonymous pages, which guarantees that we'll have a reliable
> PIN, meaning that we cannot end up with the GUP pin being inconsistent
> with he pages mapped into the page tables due to a COW triggered
> by a write fault.
> 
> When pinning pages, after conditionally triggering GUP unsharing of
> possibly shared anonymous pages, we should always only see exclusive
> anonymous pages. Note that anonymous pages that are mapped writable
> must be marked exclusive, otherwise we'd have a BUG.
> 
> When pinning during ordinary GUP, simply add a check after our
> conditional GUP-triggered unsharing checks. As we know exactly how the
> page is mapped, we know exactly in which page we have to check for
> PageAnonExclusive().
> 
> When pinning via GUP-fast we have to be careful, because we can race with
> fork(): verify only after we made sure via the seqcount that we didn't
> race with concurrent fork() that we didn't end up pinning a possibly
> shared anonymous page.
> 
> Similarly, when unpinning, verify that the pages are still marked as
> exclusive: otherwise something turned the pages possibly shared, which
> can result in random memory corruptions, which we really want to catch.
> 
> With only the pinned pages at hand and not the actual page table entries
> we have to be a bit careful: hugetlb pages are always mapped via a
> single logical page table entry referencing the head page and
> PG_anon_exclusive of the head page applies. Anon THP are a bit more
> complicated, because we might have obtained the page reference either via
> a PMD or a PTE -- depending on the mapping type we either have to check
> PageAnonExclusive of the head page (PMD-mapped THP) or the tail page
> (PTE-mapped THP) applies: as we don't know and to make our life easier,
> check that either is set.
> 
> Take care to not verify in case we're unpinning during GUP-fast because
> we detected concurrent fork(): we might stumble over an anonymous page
> that is now shared.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Nits:

> @@ -510,6 +563,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  		page = ERR_PTR(-EMLINK);
>  		goto out;
>  	}
> +
> +	VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
> +		  !PageAnonExclusive(page));

Do we rather want VM_BUG_ON_PAGE? Also for the same tests in mm/huge*.c below.

> +
>  	/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
>  	if (unlikely(!try_grab_page(page, flags))) {
>  		page = ERR_PTR(-ENOMEM);
> @@ -2744,8 +2801,10 @@ static unsigned long lockless_pages_from_mm(unsigned long start,
>  	 */
>  	if (gup_flags & FOLL_PIN) {
>  		if (read_seqcount_retry(&current->mm->write_protect_seq, seq)) {
> -			unpin_user_pages(pages, nr_pinned);
> +			unpin_user_pages_lockless(pages, nr_pinned);
>  			return 0;
> +		} else {
> +			sanity_check_pinned_pages(pages, nr_pinned);
>  		}
>  	}
>  	return nr_pinned;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2dc820e8c873..b32774f289d6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1392,6 +1392,9 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  	if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
>  		return ERR_PTR(-EMLINK);
>  
> +	VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
> +		  !PageAnonExclusive(page));
> +
>  	if (!try_grab_page(page, flags))
>  		return ERR_PTR(-ENOMEM);
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 21f2ec446117..48740e6c3476 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6097,6 +6097,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
>  		page = pte_page(huge_ptep_get(pte));
>  
> +		VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
> +			  !PageAnonExclusive(page));
> +
>  		/*
>  		 * If subpage information not requested, update counters
>  		 * and skip the same_page loop below.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 16/16] mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning
  2022-04-19 17:40   ` Vlastimil Babka
@ 2022-04-21  9:15     ` David Hildenbrand
  2022-04-22  6:54       ` Vlastimil Babka
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2022-04-21  9:15 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 19.04.22 19:40, Vlastimil Babka wrote:
> On 3/29/22 18:04, David Hildenbrand wrote:
>> Let's verify when (un)pinning anonymous pages that we always deal with
>> exclusive anonymous pages, which guarantees that we'll have a reliable
>> PIN, meaning that we cannot end up with the GUP pin being inconsistent
>> with he pages mapped into the page tables due to a COW triggered
>> by a write fault.
>>
>> When pinning pages, after conditionally triggering GUP unsharing of
>> possibly shared anonymous pages, we should always only see exclusive
>> anonymous pages. Note that anonymous pages that are mapped writable
>> must be marked exclusive, otherwise we'd have a BUG.
>>
>> When pinning during ordinary GUP, simply add a check after our
>> conditional GUP-triggered unsharing checks. As we know exactly how the
>> page is mapped, we know exactly in which page we have to check for
>> PageAnonExclusive().
>>
>> When pinning via GUP-fast we have to be careful, because we can race with
>> fork(): verify only after we made sure via the seqcount that we didn't
>> race with concurrent fork() that we didn't end up pinning a possibly
>> shared anonymous page.
>>
>> Similarly, when unpinning, verify that the pages are still marked as
>> exclusive: otherwise something turned the pages possibly shared, which
>> can result in random memory corruptions, which we really want to catch.
>>
>> With only the pinned pages at hand and not the actual page table entries
>> we have to be a bit careful: hugetlb pages are always mapped via a
>> single logical page table entry referencing the head page and
>> PG_anon_exclusive of the head page applies. Anon THP are a bit more
>> complicated, because we might have obtained the page reference either via
>> a PMD or a PTE -- depending on the mapping type we either have to check
>> PageAnonExclusive of the head page (PMD-mapped THP) or the tail page
>> (PTE-mapped THP) applies: as we don't know and to make our life easier,
>> check that either is set.
>>
>> Take care to not verify in case we're unpinning during GUP-fast because
>> we detected concurrent fork(): we might stumble over an anonymous page
>> that is now shared.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Nits:
> 
>> @@ -510,6 +563,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>>  		page = ERR_PTR(-EMLINK);
>>  		goto out;
>>  	}
>> +
>> +	VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
>> +		  !PageAnonExclusive(page));
> 
> Do we rather want VM_BUG_ON_PAGE? Also for the same tests in mm/huge*.c below.

Make sense, thanks:

diff --git a/mm/gup.c b/mm/gup.c
index 5c17d4816441..46ffd8c51c6e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -564,8 +564,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
                goto out;
        }
 
-       VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
-                 !PageAnonExclusive(page));
+       VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
+                      !PageAnonExclusive(page), page);
 
        /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
        if (unlikely(!try_grab_page(page, flags))) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 390f22334ee9..a2f44d8d3d47 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1392,8 +1392,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
        if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
                return ERR_PTR(-EMLINK);
 
-       VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
-                 !PageAnonExclusive(page));
+       VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
+                       !PageAnonExclusive(page), page);
 
        if (!try_grab_page(page, flags))
                return ERR_PTR(-ENOMEM);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8a635b5b5270..0ba2b1930b21 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6100,8 +6100,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
                pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
                page = pte_page(huge_ptep_get(pte));
 
-               VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
-                         !PageAnonExclusive(page));
+               VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
+                              !PageAnonExclusive(page), page);
 
                /*
                 * If subpage information not requested, update counters



-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 16/16] mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning
  2022-04-21  9:15     ` David Hildenbrand
@ 2022-04-22  6:54       ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2022-04-22  6:54 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, linux-mm

On 4/21/22 11:15, David Hildenbrand wrote:
> On 19.04.22 19:40, Vlastimil Babka wrote:
>> On 3/29/22 18:04, David Hildenbrand wrote:
>>> Let's verify when (un)pinning anonymous pages that we always deal with
>>> exclusive anonymous pages, which guarantees that we'll have a reliable
>>> PIN, meaning that we cannot end up with the GUP pin being inconsistent
>>> with he pages mapped into the page tables due to a COW triggered
>>> by a write fault.
>>>
>>> When pinning pages, after conditionally triggering GUP unsharing of
>>> possibly shared anonymous pages, we should always only see exclusive
>>> anonymous pages. Note that anonymous pages that are mapped writable
>>> must be marked exclusive, otherwise we'd have a BUG.
>>>
>>> When pinning during ordinary GUP, simply add a check after our
>>> conditional GUP-triggered unsharing checks. As we know exactly how the
>>> page is mapped, we know exactly in which page we have to check for
>>> PageAnonExclusive().
>>>
>>> When pinning via GUP-fast we have to be careful, because we can race with
>>> fork(): verify only after we made sure via the seqcount that we didn't
>>> race with concurrent fork() that we didn't end up pinning a possibly
>>> shared anonymous page.
>>>
>>> Similarly, when unpinning, verify that the pages are still marked as
>>> exclusive: otherwise something turned the pages possibly shared, which
>>> can result in random memory corruptions, which we really want to catch.
>>>
>>> With only the pinned pages at hand and not the actual page table entries
>>> we have to be a bit careful: hugetlb pages are always mapped via a
>>> single logical page table entry referencing the head page and
>>> PG_anon_exclusive of the head page applies. Anon THP are a bit more
>>> complicated, because we might have obtained the page reference either via
>>> a PMD or a PTE -- depending on the mapping type we either have to check
>>> PageAnonExclusive of the head page (PMD-mapped THP) or the tail page
>>> (PTE-mapped THP) applies: as we don't know and to make our life easier,
>>> check that either is set.
>>>
>>> Take care to not verify in case we're unpinning during GUP-fast because
>>> we detected concurrent fork(): we might stumble over an anonymous page
>>> that is now shared.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> 
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>> 
>> Nits:
>> 
>>> @@ -510,6 +563,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>>>  		page = ERR_PTR(-EMLINK);
>>>  		goto out;
>>>  	}
>>> +
>>> +	VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
>>> +		  !PageAnonExclusive(page));
>> 
>> Do we rather want VM_BUG_ON_PAGE? Also for the same tests in mm/huge*.c below.
> 
> Make sense, thanks:

LGTM

> diff --git a/mm/gup.c b/mm/gup.c
> index 5c17d4816441..46ffd8c51c6e 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -564,8 +564,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>                 goto out;
>         }
>  
> -       VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
> -                 !PageAnonExclusive(page));
> +       VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> +                      !PageAnonExclusive(page), page);
>  
>         /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
>         if (unlikely(!try_grab_page(page, flags))) {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 390f22334ee9..a2f44d8d3d47 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1392,8 +1392,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>         if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
>                 return ERR_PTR(-EMLINK);
>  
> -       VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
> -                 !PageAnonExclusive(page));
> +       VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> +                       !PageAnonExclusive(page), page);
>  
>         if (!try_grab_page(page, flags))
>                 return ERR_PTR(-ENOMEM);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8a635b5b5270..0ba2b1930b21 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6100,8 +6100,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>                 pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
>                 page = pte_page(huge_ptep_get(pte));
>  
> -               VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
> -                         !PageAnonExclusive(page));
> +               VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> +                              !PageAnonExclusive(page), page);
>  
>                 /*
>                  * If subpage information not requested, update counters
> 
> 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2022-04-22  6:54 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-29 16:04 [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
2022-03-29 16:04 ` [PATCH v3 01/16] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed David Hildenbrand
2022-04-11 16:04   ` Vlastimil Babka
2022-03-29 16:04 ` [PATCH v3 02/16] mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range() David Hildenbrand
2022-04-11 16:15   ` Vlastimil Babka
2022-03-29 16:04 ` [PATCH v3 03/16] mm/memory: slightly simplify copy_present_pte() David Hildenbrand
2022-04-11 16:38   ` Vlastimil Babka
2022-03-29 16:04 ` [PATCH v3 04/16] mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() David Hildenbrand
2022-04-11 18:18   ` Vlastimil Babka
2022-04-12  8:06     ` David Hildenbrand
2022-03-29 16:04 ` [PATCH v3 05/16] mm/rmap: convert RMAP flags to a proper distinct rmap_t type David Hildenbrand
2022-04-12  8:11   ` Vlastimil Babka
2022-03-29 16:04 ` [PATCH v3 06/16] mm/rmap: remove do_page_add_anon_rmap() David Hildenbrand
2022-04-12  8:13   ` Vlastimil Babka
2022-03-29 16:04 ` [PATCH v3 07/16] mm/rmap: pass rmap flags to hugepage_add_anon_rmap() David Hildenbrand
2022-04-12  8:37   ` Vlastimil Babka
2022-03-29 16:04 ` [PATCH v3 08/16] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() David Hildenbrand
2022-04-12  8:47   ` Vlastimil Babka
2022-04-12  9:37     ` David Hildenbrand
2022-04-13 12:26       ` Matthew Wilcox
2022-04-13 12:28         ` David Hildenbrand
2022-04-13 12:48           ` Matthew Wilcox
2022-04-13 16:20             ` David Hildenbrand
2022-03-29 16:04 ` [PATCH v3 09/16] mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively David Hildenbrand
2022-04-12  9:26   ` Vlastimil Babka
2022-04-12  9:28     ` David Hildenbrand
2022-03-29 16:04 ` [PATCH v3 10/16] mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page() David Hildenbrand
2022-04-12  9:37   ` Vlastimil Babka
2022-03-29 16:04 ` [PATCH v3 11/16] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages David Hildenbrand
2022-04-13  8:25   ` Vlastimil Babka
2022-04-13 10:28     ` David Hildenbrand
2022-04-13 14:55       ` Vlastimil Babka
2022-03-29 16:04 ` [PATCH v3 12/16] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive David Hildenbrand
2022-04-13 16:28   ` Vlastimil Babka
2022-04-13 16:39     ` David Hildenbrand
2022-04-13 18:28       ` Vlastimil Babka
2022-04-19 16:46         ` David Hildenbrand
2022-04-13 18:29   ` Vlastimil Babka
2022-03-29 16:04 ` [PATCH v3 13/16] mm/gup: disallow follow_page(FOLL_PIN) David Hildenbrand
2022-04-14 15:18   ` Vlastimil Babka
2022-03-29 16:04 ` [PATCH v3 14/16] mm: support GUP-triggered unsharing of anonymous pages David Hildenbrand
2022-04-14 17:15   ` Vlastimil Babka
2022-04-19 16:29     ` David Hildenbrand
2022-04-19 16:31       ` Vlastimil Babka
2022-03-29 16:04 ` [PATCH v3 15/16] mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page David Hildenbrand
2022-04-19 15:56   ` Vlastimil Babka
2022-03-29 16:04 ` [PATCH v3 16/16] mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning David Hildenbrand
2022-04-19 17:40   ` Vlastimil Babka
2022-04-21  9:15     ` David Hildenbrand
2022-04-22  6:54       ` Vlastimil Babka
2022-03-29 16:09 ` [PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).