All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/6] mm/autonuma: replace savedwrite infrastructure
@ 2022-11-02 19:12 ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linuxppc-dev, David Hildenbrand, Linus Torvalds,
	Andrew Morton, Mel Gorman, Dave Chinner, Nadav Amit, Peter Xu,
	Andrea Arcangeli, Hugh Dickins, Vlastimil Babka,
	Michael Ellerman, Nicholas Piggin, Mike Rapoport,
	Anshuman Khandual

This series is based on mm-unstable.

As discussed in my talk at LPC, we can reuse the same mechanism for
deciding whether to map a pte writable when upgrading permissions via
mprotect() -- e.g., PROT_READ -> PROT_READ|PROT_WRITE -- to replace the
savedwrite infrastructure used for NUMA hinting faults (e.g., PROT_NONE
-> PROT_READ|PROT_WRITE).

Instead of maintaining previous write permissions for a pte/pmd, we
re-determine if the pte/pmd can be writable. The big benefit is that we
have a common logic for deciding whether we can map a pte/pmd writable on
protection changes.

For private mappings, there should be no difference -- from
what I understand, that is what autonuma benchmarks care about.

I ran autonumabench on a system with 2 NUMA nodes, 96 GiB each via:
	perf stat --null --repeat 10
The numa01 benchmark is quite noisy in my environment and I failed to
reduce the noise so far.

numa01:
	mm-unstable:   146.88 +- 6.54 seconds time elapsed  ( +-  4.45% )
	mm-unstable++: 147.45 +- 13.39 seconds time elapsed  ( +-  9.08% )

numa02:
	mm-unstable:   16.0300 +- 0.0624 seconds time elapsed  ( +-  0.39% )
	mm-unstable++: 16.1281 +- 0.0945 seconds time elapsed  ( +-  0.59% )

It is worth noting that for shared writable mappings that require
writenotify, we will only avoid write faults if the pte/pmd is dirty
(inherited from the older mprotect logic). If we ever care about optimizing
that further, we'd need a different mechanism to identify whether the FS
still needs to get notified on the next write access.

In any case, such an optimiztion will then not be autonuma-specific,
but mprotect() permission upgrades would similarly benefit from it.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>

RFC -> v1:
* "mm/mprotect: allow clean exclusive anon pages to be writable"
 -> Move comment change to patch #2
* "mm/mprotect: minor can_change_pte_writable() cleanups"
 -> Adjust comments
* "mm/huge_memory: try avoiding write faults when changing PMD protection"
 -> Fix wrong check
* "selftests/vm: anon_cow: add mprotect() optimiation tests"
 -> Add basic tests for the mprotect() optimization


David Hildenbrand (5):
  mm/mprotect: minor can_change_pte_writable() cleanups
  mm/huge_memory: try avoiding write faults when changing PMD protection
  mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
  mm: remove unused savedwrite infrastructure
  selftests/vm: anon_cow: add mprotect() optimization tests

Nadav Amit (1):
  mm/mprotect: allow clean exclusive anon pages to be writable

 arch/powerpc/include/asm/book3s/64/pgtable.h | 80 +-------------------
 arch/powerpc/kvm/book3s_hv_rm_mmu.c          |  2 +-
 include/linux/mm.h                           |  2 +
 include/linux/pgtable.h                      | 24 ------
 mm/debug_vm_pgtable.c                        | 32 --------
 mm/huge_memory.c                             | 66 ++++++++++++----
 mm/ksm.c                                     |  9 +--
 mm/memory.c                                  | 19 ++++-
 mm/mprotect.c                                | 33 ++++----
 tools/testing/selftests/vm/anon_cow.c        | 49 +++++++++++-
 10 files changed, 145 insertions(+), 171 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v1 0/6] mm/autonuma: replace savedwrite infrastructure
@ 2022-11-02 19:12 ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrea Arcangeli, David Hildenbrand, linuxppc-dev,
	Anshuman Khandual, Dave Chinner, Mel Gorman, Peter Xu, linux-mm,
	Hugh Dickins, Nadav Amit, Nicholas Piggin, Mike Rapoport,
	Andrew Morton, Linus Torvalds, Vlastimil Babka

This series is based on mm-unstable.

As discussed in my talk at LPC, we can reuse the same mechanism for
deciding whether to map a pte writable when upgrading permissions via
mprotect() -- e.g., PROT_READ -> PROT_READ|PROT_WRITE -- to replace the
savedwrite infrastructure used for NUMA hinting faults (e.g., PROT_NONE
-> PROT_READ|PROT_WRITE).

Instead of maintaining previous write permissions for a pte/pmd, we
re-determine if the pte/pmd can be writable. The big benefit is that we
have a common logic for deciding whether we can map a pte/pmd writable on
protection changes.

For private mappings, there should be no difference -- from
what I understand, that is what autonuma benchmarks care about.

I ran autonumabench on a system with 2 NUMA nodes, 96 GiB each via:
	perf stat --null --repeat 10
The numa01 benchmark is quite noisy in my environment and I failed to
reduce the noise so far.

numa01:
	mm-unstable:   146.88 +- 6.54 seconds time elapsed  ( +-  4.45% )
	mm-unstable++: 147.45 +- 13.39 seconds time elapsed  ( +-  9.08% )

numa02:
	mm-unstable:   16.0300 +- 0.0624 seconds time elapsed  ( +-  0.39% )
	mm-unstable++: 16.1281 +- 0.0945 seconds time elapsed  ( +-  0.59% )

It is worth noting that for shared writable mappings that require
writenotify, we will only avoid write faults if the pte/pmd is dirty
(inherited from the older mprotect logic). If we ever care about optimizing
that further, we'd need a different mechanism to identify whether the FS
still needs to get notified on the next write access.

In any case, such an optimiztion will then not be autonuma-specific,
but mprotect() permission upgrades would similarly benefit from it.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>

RFC -> v1:
* "mm/mprotect: allow clean exclusive anon pages to be writable"
 -> Move comment change to patch #2
* "mm/mprotect: minor can_change_pte_writable() cleanups"
 -> Adjust comments
* "mm/huge_memory: try avoiding write faults when changing PMD protection"
 -> Fix wrong check
* "selftests/vm: anon_cow: add mprotect() optimiation tests"
 -> Add basic tests for the mprotect() optimization


David Hildenbrand (5):
  mm/mprotect: minor can_change_pte_writable() cleanups
  mm/huge_memory: try avoiding write faults when changing PMD protection
  mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
  mm: remove unused savedwrite infrastructure
  selftests/vm: anon_cow: add mprotect() optimization tests

Nadav Amit (1):
  mm/mprotect: allow clean exclusive anon pages to be writable

 arch/powerpc/include/asm/book3s/64/pgtable.h | 80 +-------------------
 arch/powerpc/kvm/book3s_hv_rm_mmu.c          |  2 +-
 include/linux/mm.h                           |  2 +
 include/linux/pgtable.h                      | 24 ------
 mm/debug_vm_pgtable.c                        | 32 --------
 mm/huge_memory.c                             | 66 ++++++++++++----
 mm/ksm.c                                     |  9 +--
 mm/memory.c                                  | 19 ++++-
 mm/mprotect.c                                | 33 ++++----
 tools/testing/selftests/vm/anon_cow.c        | 49 +++++++++++-
 10 files changed, 145 insertions(+), 171 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v1 1/6] mm/mprotect: allow clean exclusive anon pages to be writable
  2022-11-02 19:12 ` David Hildenbrand
@ 2022-11-02 19:12   ` David Hildenbrand
  -1 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linuxppc-dev, David Hildenbrand, Linus Torvalds,
	Andrew Morton, Mel Gorman, Dave Chinner, Nadav Amit, Peter Xu,
	Andrea Arcangeli, Hugh Dickins, Vlastimil Babka,
	Michael Ellerman, Nicholas Piggin, Mike Rapoport,
	Anshuman Khandual

From: Nadav Amit <namit@vmware.com>

Anonymous pages might have the dirty bit clear, but this should not
prevent mprotect from making them writable if they are exclusive.
Therefore, skip the test whether the page is dirty in this case.

Note that there are already other ways to get a writable PTE mapping an
anonymous page that is clean: for example, via MADV_FREE. In an ideal
world, we'd have a different indication from the FS whether writenotify
is still required.

Signed-off-by: Nadav Amit <namit@vmware.com>
[ return directly; update description ]
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/mprotect.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8d770855b591..86a28c0e190f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -46,7 +46,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 
 	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
 
-	if (pte_protnone(pte) || !pte_dirty(pte))
+	if (pte_protnone(pte))
 		return false;
 
 	/* Do we need write faults for softdirty tracking? */
@@ -65,11 +65,10 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 		 * the PT lock.
 		 */
 		page = vm_normal_page(vma, addr, pte);
-		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
-			return false;
+		return page && PageAnon(page) && PageAnonExclusive(page);
 	}
 
-	return true;
+	return pte_dirty(pte);
 }
 
 static unsigned long change_pte_range(struct mmu_gather *tlb,
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 1/6] mm/mprotect: allow clean exclusive anon pages to be writable
@ 2022-11-02 19:12   ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrea Arcangeli, David Hildenbrand, linuxppc-dev,
	Anshuman Khandual, Dave Chinner, Mel Gorman, Peter Xu, linux-mm,
	Hugh Dickins, Nadav Amit, Nicholas Piggin, Mike Rapoport,
	Andrew Morton, Linus Torvalds, Vlastimil Babka

From: Nadav Amit <namit@vmware.com>

Anonymous pages might have the dirty bit clear, but this should not
prevent mprotect from making them writable if they are exclusive.
Therefore, skip the test whether the page is dirty in this case.

Note that there are already other ways to get a writable PTE mapping an
anonymous page that is clean: for example, via MADV_FREE. In an ideal
world, we'd have a different indication from the FS whether writenotify
is still required.

Signed-off-by: Nadav Amit <namit@vmware.com>
[ return directly; update description ]
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/mprotect.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8d770855b591..86a28c0e190f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -46,7 +46,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 
 	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
 
-	if (pte_protnone(pte) || !pte_dirty(pte))
+	if (pte_protnone(pte))
 		return false;
 
 	/* Do we need write faults for softdirty tracking? */
@@ -65,11 +65,10 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 		 * the PT lock.
 		 */
 		page = vm_normal_page(vma, addr, pte);
-		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
-			return false;
+		return page && PageAnon(page) && PageAnonExclusive(page);
 	}
 
-	return true;
+	return pte_dirty(pte);
 }
 
 static unsigned long change_pte_range(struct mmu_gather *tlb,
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 2/6] mm/mprotect: minor can_change_pte_writable() cleanups
  2022-11-02 19:12 ` David Hildenbrand
@ 2022-11-02 19:12   ` David Hildenbrand
  -1 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linuxppc-dev, David Hildenbrand, Linus Torvalds,
	Andrew Morton, Mel Gorman, Dave Chinner, Nadav Amit, Peter Xu,
	Andrea Arcangeli, Hugh Dickins, Vlastimil Babka,
	Michael Ellerman, Nicholas Piggin, Mike Rapoport,
	Anshuman Khandual

We want to replicate this code for handling PMDs soon.

(1) No need to crash the kernel, warning and rejecting is good enough. As
    this will no longer get optimized out, drop the pte_write() check: no
    harm would be done.

(2) Add a comment why PROT_NONE mapped pages are excluded.

(3) Add a comment regarding MAP_SHARED handling and why we rely on the
    dirty bit in the PTE.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/mprotect.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 86a28c0e190f..72aabffb7871 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -44,8 +44,10 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 {
 	struct page *page;
 
-	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
+	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
+		return false;
 
+	/* Don't touch entries that are not even readable. */
 	if (pte_protnone(pte))
 		return false;
 
@@ -59,15 +61,22 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 
 	if (!(vma->vm_flags & VM_SHARED)) {
 		/*
-		 * We can only special-case on exclusive anonymous pages,
-		 * because we know that our write-fault handler similarly would
-		 * map them writable without any additional checks while holding
-		 * the PT lock.
+		 * Writable MAP_PRIVATE mapping: We can only special-case on
+		 * exclusive anonymous pages, because we know that our
+		 * write-fault handler similarly would map them writable without
+		 * any additional checks while holding the PT lock.
 		 */
 		page = vm_normal_page(vma, addr, pte);
 		return page && PageAnon(page) && PageAnonExclusive(page);
 	}
 
+	/*
+	 * Writable MAP_SHARED mapping: "clean" might indicate that the FS still
+	 * needs a real write-fault for writenotify
+	 * (see vma_wants_writenotify()). If "dirty", the assumption is that the
+	 * FS was already notified and we can simply mark the PTE writable
+	 * just like the write-fault handler would do.
+	 */
 	return pte_dirty(pte);
 }
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 2/6] mm/mprotect: minor can_change_pte_writable() cleanups
@ 2022-11-02 19:12   ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrea Arcangeli, David Hildenbrand, linuxppc-dev,
	Anshuman Khandual, Dave Chinner, Mel Gorman, Peter Xu, linux-mm,
	Hugh Dickins, Nadav Amit, Nicholas Piggin, Mike Rapoport,
	Andrew Morton, Linus Torvalds, Vlastimil Babka

We want to replicate this code for handling PMDs soon.

(1) No need to crash the kernel, warning and rejecting is good enough. As
    this will no longer get optimized out, drop the pte_write() check: no
    harm would be done.

(2) Add a comment why PROT_NONE mapped pages are excluded.

(3) Add a comment regarding MAP_SHARED handling and why we rely on the
    dirty bit in the PTE.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/mprotect.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 86a28c0e190f..72aabffb7871 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -44,8 +44,10 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 {
 	struct page *page;
 
-	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
+	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
+		return false;
 
+	/* Don't touch entries that are not even readable. */
 	if (pte_protnone(pte))
 		return false;
 
@@ -59,15 +61,22 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 
 	if (!(vma->vm_flags & VM_SHARED)) {
 		/*
-		 * We can only special-case on exclusive anonymous pages,
-		 * because we know that our write-fault handler similarly would
-		 * map them writable without any additional checks while holding
-		 * the PT lock.
+		 * Writable MAP_PRIVATE mapping: We can only special-case on
+		 * exclusive anonymous pages, because we know that our
+		 * write-fault handler similarly would map them writable without
+		 * any additional checks while holding the PT lock.
 		 */
 		page = vm_normal_page(vma, addr, pte);
 		return page && PageAnon(page) && PageAnonExclusive(page);
 	}
 
+	/*
+	 * Writable MAP_SHARED mapping: "clean" might indicate that the FS still
+	 * needs a real write-fault for writenotify
+	 * (see vma_wants_writenotify()). If "dirty", the assumption is that the
+	 * FS was already notified and we can simply mark the PTE writable
+	 * just like the write-fault handler would do.
+	 */
 	return pte_dirty(pte);
 }
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 3/6] mm/huge_memory: try avoiding write faults when changing PMD protection
  2022-11-02 19:12 ` David Hildenbrand
@ 2022-11-02 19:12   ` David Hildenbrand
  -1 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linuxppc-dev, David Hildenbrand, Linus Torvalds,
	Andrew Morton, Mel Gorman, Dave Chinner, Nadav Amit, Peter Xu,
	Andrea Arcangeli, Hugh Dickins, Vlastimil Babka,
	Michael Ellerman, Nicholas Piggin, Mike Rapoport,
	Anshuman Khandual

Let's replicate what we have for PTEs in can_change_pte_writable() also
for PMDs.

While this might look like a pure performance improvement, we'll us this to
get rid of savedwrite handling in do_huge_pmd_numa_page() next. Place
do_huge_pmd_numa_page() strategically good for that purpose.

Note that MM_CP_TRY_CHANGE_WRITABLE is currently only set when we come
via mprotect_fixup().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a524db74e9e6..2ad68e91896a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1341,6 +1341,36 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	return VM_FAULT_FALLBACK;
 }
 
+static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
+					   unsigned long addr, pmd_t pmd)
+{
+	struct page *page;
+
+	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
+		return false;
+
+	/* Don't touch entries that are not even readable (NUMA hinting). */
+	if (pmd_protnone(pmd))
+		return false;
+
+	/* Do we need write faults for softdirty tracking? */
+	if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
+		return false;
+
+	/* Do we need write faults for uffd-wp tracking? */
+	if (userfaultfd_huge_pmd_wp(vma, pmd))
+		return false;
+
+	if (!(vma->vm_flags & VM_SHARED)) {
+		/* See can_change_pte_writable(). */
+		page = vm_normal_page_pmd(vma, addr, pmd);
+		return page && PageAnon(page) && PageAnonExclusive(page);
+	}
+
+	/* See can_change_pte_writable(). */
+	return pmd_dirty(pmd);
+}
+
 /* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
 static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
 					struct vm_area_struct *vma,
@@ -1844,13 +1874,17 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 */
 		entry = pmd_clear_uffd_wp(entry);
 	}
+
+	/* See change_pte_range(). */
+	if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
+	    can_change_pmd_writable(vma, addr, entry))
+		entry = pmd_mkwrite(entry);
+
 	ret = HPAGE_PMD_NR;
 	set_pmd_at(mm, addr, pmd, entry);
 
 	if (huge_pmd_needs_flush(oldpmd, entry))
 		tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
-
-	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
 unlock:
 	spin_unlock(ptl);
 	return ret;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 3/6] mm/huge_memory: try avoiding write faults when changing PMD protection
@ 2022-11-02 19:12   ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrea Arcangeli, David Hildenbrand, linuxppc-dev,
	Anshuman Khandual, Dave Chinner, Mel Gorman, Peter Xu, linux-mm,
	Hugh Dickins, Nadav Amit, Nicholas Piggin, Mike Rapoport,
	Andrew Morton, Linus Torvalds, Vlastimil Babka

Let's replicate what we have for PTEs in can_change_pte_writable() also
for PMDs.

While this might look like a pure performance improvement, we'll us this to
get rid of savedwrite handling in do_huge_pmd_numa_page() next. Place
do_huge_pmd_numa_page() strategically good for that purpose.

Note that MM_CP_TRY_CHANGE_WRITABLE is currently only set when we come
via mprotect_fixup().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a524db74e9e6..2ad68e91896a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1341,6 +1341,36 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	return VM_FAULT_FALLBACK;
 }
 
+static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
+					   unsigned long addr, pmd_t pmd)
+{
+	struct page *page;
+
+	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
+		return false;
+
+	/* Don't touch entries that are not even readable (NUMA hinting). */
+	if (pmd_protnone(pmd))
+		return false;
+
+	/* Do we need write faults for softdirty tracking? */
+	if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
+		return false;
+
+	/* Do we need write faults for uffd-wp tracking? */
+	if (userfaultfd_huge_pmd_wp(vma, pmd))
+		return false;
+
+	if (!(vma->vm_flags & VM_SHARED)) {
+		/* See can_change_pte_writable(). */
+		page = vm_normal_page_pmd(vma, addr, pmd);
+		return page && PageAnon(page) && PageAnonExclusive(page);
+	}
+
+	/* See can_change_pte_writable(). */
+	return pmd_dirty(pmd);
+}
+
 /* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
 static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
 					struct vm_area_struct *vma,
@@ -1844,13 +1874,17 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 */
 		entry = pmd_clear_uffd_wp(entry);
 	}
+
+	/* See change_pte_range(). */
+	if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
+	    can_change_pmd_writable(vma, addr, entry))
+		entry = pmd_mkwrite(entry);
+
 	ret = HPAGE_PMD_NR;
 	set_pmd_at(mm, addr, pmd, entry);
 
 	if (huge_pmd_needs_flush(oldpmd, entry))
 		tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
-
-	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
 unlock:
 	spin_unlock(ptl);
 	return ret;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
  2022-11-02 19:12 ` David Hildenbrand
@ 2022-11-02 19:12   ` David Hildenbrand
  -1 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linuxppc-dev, David Hildenbrand, Linus Torvalds,
	Andrew Morton, Mel Gorman, Dave Chinner, Nadav Amit, Peter Xu,
	Andrea Arcangeli, Hugh Dickins, Vlastimil Babka,
	Michael Ellerman, Nicholas Piggin, Mike Rapoport,
	Anshuman Khandual

commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
NUMA hinting fault") added remembering write permissions using ordinary
pte_write() for PROT_NONE mapped pages to avoid write faults when
remapping the page !PROT_NONE on NUMA hinting faults.

That commit noted:

    The patch looks hacky but the alternatives looked worse. The tidest was
    to rewalk the page tables after a hinting fault but it was more complex
    than this approach and the performance was worse. It's not generally
    safe to just mark the page writable during the fault if it's a write
    fault as it may have been read-only for COW so that approach was
    discarded.

Later, commit 288bc54949fc ("mm/autonuma: let architecture override how
the write bit should be stashed in a protnone pte.") introduced a family
of savedwrite PTE functions that didn't necessarily improve the whole
situation.

One confusing thing is that nowadays, if a page is pte_protnone()
and pte_savedwrite() then also pte_write() is true. Another source of
confusion is that there is only a single pte_mk_savedwrite() call in the
kernel. All other write-protection code seems to silently rely on
pte_wrprotect().

Ever since PageAnonExclusive was introduced and we started using it in
mprotect context via commit 64fe24a3e05e ("mm/mprotect: try avoiding write
faults for exclusive anonymous pages when changing protection"), we do
have machinery in place to avoid write faults when changing protection,
which is exactly what we want to do here.

Let's similarly do what ordinary mprotect() does nowadays when upgrading
write permissions and reuse can_change_pte_writable() and
can_change_pmd_writable() to detect if we can upgrade PTE permissions to be
writable.

For anonymous pages there should be absolutely no change: if an
anonymous page is not exclusive, it could not have been mapped writable --
because only exclusive anonymous pages can be mapped writable.

However, there *might* be a change for writable shared mappings that
require writenotify: if they are not dirty, we cannot map them writable.
While it might not matter in practice, we'd need a different way to
identify whether writenotify is actually required -- and ordinary mprotect
would benefit from that as well.

We'll remove all savedwrite leftovers next.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h |  2 ++
 mm/huge_memory.c   | 28 +++++++++++++++++-----------
 mm/ksm.c           |  9 ++++-----
 mm/memory.c        | 19 ++++++++++++++++---
 mm/mprotect.c      |  7 ++-----
 5 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25ff9a14a777..a0deeece5e87 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1975,6 +1975,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
 
+bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte);
 extern unsigned long change_protection(struct mmu_gather *tlb,
 			      struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ad68e91896a..45abd27d75a0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1462,8 +1462,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	int page_nid = NUMA_NO_NODE;
 	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
-	bool migrated = false;
-	bool was_writable = pmd_savedwrite(oldpmd);
+	bool try_change_writable, migrated = false;
 	int flags = 0;
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -1472,13 +1471,22 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 		goto out;
 	}
 
+	/* See mprotect_fixup(). */
+	if (vma->vm_flags & VM_SHARED)
+		try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
+	else
+		try_change_writable = !!(vma->vm_flags & VM_WRITE);
+
 	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
 	page = vm_normal_page_pmd(vma, haddr, pmd);
 	if (!page)
 		goto out_map;
 
 	/* See similar comment in do_numa_page for explanation */
-	if (!was_writable)
+	if (try_change_writable && !pmd_write(pmd) &&
+	     can_change_pmd_writable(vma, vmf->address, pmd))
+		pmd = pmd_mkwrite(pmd);
+	if (!pmd_write(pmd))
 		flags |= TNF_NO_GROUP;
 
 	page_nid = page_to_nid(page);
@@ -1523,8 +1531,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	/* Restore the PMD */
 	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
 	pmd = pmd_mkyoung(pmd);
-	if (was_writable)
+
+	/* Similar to mprotect() protection updates, avoid write faults. */
+	if (try_change_writable && !pmd_write(pmd) &&
+	     can_change_pmd_writable(vma, vmf->address, pmd))
 		pmd = pmd_mkwrite(pmd);
+
 	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
 	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 	spin_unlock(vmf->ptl);
@@ -1764,11 +1776,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
 	pmd_t oldpmd, entry;
-	bool preserve_write;
-	int ret;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	int ret = 1;
 
 	tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 
@@ -1779,9 +1790,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	if (!ptl)
 		return 0;
 
-	preserve_write = prot_numa && pmd_write(*pmd);
-	ret = 1;
-
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 	if (is_swap_pmd(*pmd)) {
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
@@ -1861,8 +1869,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	oldpmd = pmdp_invalidate_ad(vma, addr, pmd);
 
 	entry = pmd_modify(oldpmd, newprot);
-	if (preserve_write)
-		entry = pmd_mk_savedwrite(entry);
 	if (uffd_wp) {
 		entry = pmd_wrprotect(entry);
 		entry = pmd_mkuffd_wp(entry);
diff --git a/mm/ksm.c b/mm/ksm.c
index dc15c4a2a6ff..dd02780c387f 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1069,7 +1069,6 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	anon_exclusive = PageAnonExclusive(page);
 	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
-	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||
 	    anon_exclusive || mm_tlb_flush_pending(mm)) {
 		pte_t entry;
 
@@ -1107,11 +1106,11 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 		if (pte_dirty(entry))
 			set_page_dirty(page);
+		entry = pte_mkclean(entry);
+
+		if (pte_write(entry))
+			entry = pte_wrprotect(entry);
 
-		if (pte_protnone(entry))
-			entry = pte_mkclean(pte_clear_savedwrite(entry));
-		else
-			entry = pte_mkclean(pte_wrprotect(entry));
 		set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
 	}
 	*orig_pte = *pvmw.pte;
diff --git a/mm/memory.c b/mm/memory.c
index c5599a9279b1..286c29ee3aba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4672,12 +4672,12 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 static vm_fault_t do_numa_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	bool try_change_writable;
 	struct page *page = NULL;
 	int page_nid = NUMA_NO_NODE;
 	int last_cpupid;
 	int target_nid;
 	pte_t pte, old_pte;
-	bool was_writable = pte_savedwrite(vmf->orig_pte);
 	int flags = 0;
 
 	/*
@@ -4692,6 +4692,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		goto out;
 	}
 
+	/* See mprotect_fixup(). */
+	if (vma->vm_flags & VM_SHARED)
+		try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
+	else
+		try_change_writable = !!(vma->vm_flags & VM_WRITE);
+
 	/* Get the normal PTE  */
 	old_pte = ptep_get(vmf->pte);
 	pte = pte_modify(old_pte, vma->vm_page_prot);
@@ -4712,7 +4718,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 * pte_dirty has unpredictable behaviour between PTE scan updates,
 	 * background writeback, dirty balancing and application behaviour.
 	 */
-	if (!was_writable)
+	if (try_change_writable && !pte_write(pte) &&
+	     can_change_pte_writable(vma, vmf->address, pte))
+		pte = pte_mkwrite(pte);
+	if (!pte_write(pte))
 		flags |= TNF_NO_GROUP;
 
 	/*
@@ -4767,8 +4776,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
 	pte = pte_modify(old_pte, vma->vm_page_prot);
 	pte = pte_mkyoung(pte);
-	if (was_writable)
+
+	/* Similar to mprotect() protection updates, avoid write faults. */
+	if (try_change_writable && !pte_write(pte) &&
+	     can_change_pte_writable(vma, vmf->address, pte))
 		pte = pte_mkwrite(pte);
+
 	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 72aabffb7871..6c6248b65fd5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -39,8 +39,8 @@
 
 #include "internal.h"
 
-static inline bool can_change_pte_writable(struct vm_area_struct *vma,
-					   unsigned long addr, pte_t pte)
+bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte)
 {
 	struct page *page;
 
@@ -121,7 +121,6 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
-			bool preserve_write = prot_numa && pte_write(oldpte);
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
@@ -177,8 +176,6 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
 			ptent = pte_modify(oldpte, newprot);
-			if (preserve_write)
-				ptent = pte_mk_savedwrite(ptent);
 
 			if (uffd_wp) {
 				ptent = pte_wrprotect(ptent);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
@ 2022-11-02 19:12   ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrea Arcangeli, David Hildenbrand, linuxppc-dev,
	Anshuman Khandual, Dave Chinner, Mel Gorman, Peter Xu, linux-mm,
	Hugh Dickins, Nadav Amit, Nicholas Piggin, Mike Rapoport,
	Andrew Morton, Linus Torvalds, Vlastimil Babka

commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
NUMA hinting fault") added remembering write permissions using ordinary
pte_write() for PROT_NONE mapped pages to avoid write faults when
remapping the page !PROT_NONE on NUMA hinting faults.

That commit noted:

    The patch looks hacky but the alternatives looked worse. The tidest was
    to rewalk the page tables after a hinting fault but it was more complex
    than this approach and the performance was worse. It's not generally
    safe to just mark the page writable during the fault if it's a write
    fault as it may have been read-only for COW so that approach was
    discarded.

Later, commit 288bc54949fc ("mm/autonuma: let architecture override how
the write bit should be stashed in a protnone pte.") introduced a family
of savedwrite PTE functions that didn't necessarily improve the whole
situation.

One confusing thing is that nowadays, if a page is pte_protnone()
and pte_savedwrite() then also pte_write() is true. Another source of
confusion is that there is only a single pte_mk_savedwrite() call in the
kernel. All other write-protection code seems to silently rely on
pte_wrprotect().

Ever since PageAnonExclusive was introduced and we started using it in
mprotect context via commit 64fe24a3e05e ("mm/mprotect: try avoiding write
faults for exclusive anonymous pages when changing protection"), we do
have machinery in place to avoid write faults when changing protection,
which is exactly what we want to do here.

Let's similarly do what ordinary mprotect() does nowadays when upgrading
write permissions and reuse can_change_pte_writable() and
can_change_pmd_writable() to detect if we can upgrade PTE permissions to be
writable.

For anonymous pages there should be absolutely no change: if an
anonymous page is not exclusive, it could not have been mapped writable --
because only exclusive anonymous pages can be mapped writable.

However, there *might* be a change for writable shared mappings that
require writenotify: if they are not dirty, we cannot map them writable.
While it might not matter in practice, we'd need a different way to
identify whether writenotify is actually required -- and ordinary mprotect
would benefit from that as well.

We'll remove all savedwrite leftovers next.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h |  2 ++
 mm/huge_memory.c   | 28 +++++++++++++++++-----------
 mm/ksm.c           |  9 ++++-----
 mm/memory.c        | 19 ++++++++++++++++---
 mm/mprotect.c      |  7 ++-----
 5 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25ff9a14a777..a0deeece5e87 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1975,6 +1975,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
 
+bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte);
 extern unsigned long change_protection(struct mmu_gather *tlb,
 			      struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ad68e91896a..45abd27d75a0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1462,8 +1462,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	int page_nid = NUMA_NO_NODE;
 	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
-	bool migrated = false;
-	bool was_writable = pmd_savedwrite(oldpmd);
+	bool try_change_writable, migrated = false;
 	int flags = 0;
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -1472,13 +1471,22 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 		goto out;
 	}
 
+	/* See mprotect_fixup(). */
+	if (vma->vm_flags & VM_SHARED)
+		try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
+	else
+		try_change_writable = !!(vma->vm_flags & VM_WRITE);
+
 	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
 	page = vm_normal_page_pmd(vma, haddr, pmd);
 	if (!page)
 		goto out_map;
 
 	/* See similar comment in do_numa_page for explanation */
-	if (!was_writable)
+	if (try_change_writable && !pmd_write(pmd) &&
+	     can_change_pmd_writable(vma, vmf->address, pmd))
+		pmd = pmd_mkwrite(pmd);
+	if (!pmd_write(pmd))
 		flags |= TNF_NO_GROUP;
 
 	page_nid = page_to_nid(page);
@@ -1523,8 +1531,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	/* Restore the PMD */
 	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
 	pmd = pmd_mkyoung(pmd);
-	if (was_writable)
+
+	/* Similar to mprotect() protection updates, avoid write faults. */
+	if (try_change_writable && !pmd_write(pmd) &&
+	     can_change_pmd_writable(vma, vmf->address, pmd))
 		pmd = pmd_mkwrite(pmd);
+
 	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
 	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 	spin_unlock(vmf->ptl);
@@ -1764,11 +1776,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
 	pmd_t oldpmd, entry;
-	bool preserve_write;
-	int ret;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	int ret = 1;
 
 	tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 
@@ -1779,9 +1790,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	if (!ptl)
 		return 0;
 
-	preserve_write = prot_numa && pmd_write(*pmd);
-	ret = 1;
-
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 	if (is_swap_pmd(*pmd)) {
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
@@ -1861,8 +1869,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	oldpmd = pmdp_invalidate_ad(vma, addr, pmd);
 
 	entry = pmd_modify(oldpmd, newprot);
-	if (preserve_write)
-		entry = pmd_mk_savedwrite(entry);
 	if (uffd_wp) {
 		entry = pmd_wrprotect(entry);
 		entry = pmd_mkuffd_wp(entry);
diff --git a/mm/ksm.c b/mm/ksm.c
index dc15c4a2a6ff..dd02780c387f 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1069,7 +1069,6 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	anon_exclusive = PageAnonExclusive(page);
 	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
-	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||
 	    anon_exclusive || mm_tlb_flush_pending(mm)) {
 		pte_t entry;
 
@@ -1107,11 +1106,11 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 		if (pte_dirty(entry))
 			set_page_dirty(page);
+		entry = pte_mkclean(entry);
+
+		if (pte_write(entry))
+			entry = pte_wrprotect(entry);
 
-		if (pte_protnone(entry))
-			entry = pte_mkclean(pte_clear_savedwrite(entry));
-		else
-			entry = pte_mkclean(pte_wrprotect(entry));
 		set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
 	}
 	*orig_pte = *pvmw.pte;
diff --git a/mm/memory.c b/mm/memory.c
index c5599a9279b1..286c29ee3aba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4672,12 +4672,12 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 static vm_fault_t do_numa_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	bool try_change_writable;
 	struct page *page = NULL;
 	int page_nid = NUMA_NO_NODE;
 	int last_cpupid;
 	int target_nid;
 	pte_t pte, old_pte;
-	bool was_writable = pte_savedwrite(vmf->orig_pte);
 	int flags = 0;
 
 	/*
@@ -4692,6 +4692,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		goto out;
 	}
 
+	/* See mprotect_fixup(). */
+	if (vma->vm_flags & VM_SHARED)
+		try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
+	else
+		try_change_writable = !!(vma->vm_flags & VM_WRITE);
+
 	/* Get the normal PTE  */
 	old_pte = ptep_get(vmf->pte);
 	pte = pte_modify(old_pte, vma->vm_page_prot);
@@ -4712,7 +4718,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 * pte_dirty has unpredictable behaviour between PTE scan updates,
 	 * background writeback, dirty balancing and application behaviour.
 	 */
-	if (!was_writable)
+	if (try_change_writable && !pte_write(pte) &&
+	     can_change_pte_writable(vma, vmf->address, pte))
+		pte = pte_mkwrite(pte);
+	if (!pte_write(pte))
 		flags |= TNF_NO_GROUP;
 
 	/*
@@ -4767,8 +4776,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
 	pte = pte_modify(old_pte, vma->vm_page_prot);
 	pte = pte_mkyoung(pte);
-	if (was_writable)
+
+	/* Similar to mprotect() protection updates, avoid write faults. */
+	if (try_change_writable && !pte_write(pte) &&
+	     can_change_pte_writable(vma, vmf->address, pte))
 		pte = pte_mkwrite(pte);
+
 	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 72aabffb7871..6c6248b65fd5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -39,8 +39,8 @@
 
 #include "internal.h"
 
-static inline bool can_change_pte_writable(struct vm_area_struct *vma,
-					   unsigned long addr, pte_t pte)
+bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte)
 {
 	struct page *page;
 
@@ -121,7 +121,6 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
-			bool preserve_write = prot_numa && pte_write(oldpte);
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
@@ -177,8 +176,6 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
 			ptent = pte_modify(oldpte, newprot);
-			if (preserve_write)
-				ptent = pte_mk_savedwrite(ptent);
 
 			if (uffd_wp) {
 				ptent = pte_wrprotect(ptent);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 5/6] mm: remove unused savedwrite infrastructure
  2022-11-02 19:12 ` David Hildenbrand
@ 2022-11-02 19:12   ` David Hildenbrand
  -1 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linuxppc-dev, David Hildenbrand, Linus Torvalds,
	Andrew Morton, Mel Gorman, Dave Chinner, Nadav Amit, Peter Xu,
	Andrea Arcangeli, Hugh Dickins, Vlastimil Babka,
	Michael Ellerman, Nicholas Piggin, Mike Rapoport,
	Anshuman Khandual

NUMA hinting no longer uses savedwrite, let's rip it out.

... and while at it, drop __pte_write() and __pmd_write() on ppc64.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 80 +-------------------
 arch/powerpc/kvm/book3s_hv_rm_mmu.c          |  2 +-
 include/linux/pgtable.h                      | 24 ------
 mm/debug_vm_pgtable.c                        | 32 --------
 4 files changed, 5 insertions(+), 133 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index c436d8422654..cb4c67bf45d7 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -401,35 +401,9 @@ static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
 #define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
 #define pmdp_clear_flush_young pmdp_test_and_clear_young
 
-static inline int __pte_write(pte_t pte)
-{
-	return !!(pte_raw(pte) & cpu_to_be64(_PAGE_WRITE));
-}
-
-#ifdef CONFIG_NUMA_BALANCING
-#define pte_savedwrite pte_savedwrite
-static inline bool pte_savedwrite(pte_t pte)
-{
-	/*
-	 * Saved write ptes are prot none ptes that doesn't have
-	 * privileged bit sit. We mark prot none as one which has
-	 * present and pviliged bit set and RWX cleared. To mark
-	 * protnone which used to have _PAGE_WRITE set we clear
-	 * the privileged bit.
-	 */
-	return !(pte_raw(pte) & cpu_to_be64(_PAGE_RWX | _PAGE_PRIVILEGED));
-}
-#else
-#define pte_savedwrite pte_savedwrite
-static inline bool pte_savedwrite(pte_t pte)
-{
-	return false;
-}
-#endif
-
 static inline int pte_write(pte_t pte)
 {
-	return __pte_write(pte) || pte_savedwrite(pte);
+	return !!(pte_raw(pte) & cpu_to_be64(_PAGE_WRITE));
 }
 
 static inline int pte_read(pte_t pte)
@@ -441,24 +415,16 @@ static inline int pte_read(pte_t pte)
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 				      pte_t *ptep)
 {
-	if (__pte_write(*ptep))
+	if (pte_write(*ptep))
 		pte_update(mm, addr, ptep, _PAGE_WRITE, 0, 0);
-	else if (unlikely(pte_savedwrite(*ptep)))
-		pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 0);
 }
 
 #define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
 static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
 					   unsigned long addr, pte_t *ptep)
 {
-	/*
-	 * We should not find protnone for hugetlb, but this complete the
-	 * interface.
-	 */
-	if (__pte_write(*ptep))
+	if (pte_write(*ptep))
 		pte_update(mm, addr, ptep, _PAGE_WRITE, 0, 1);
-	else if (unlikely(pte_savedwrite(*ptep)))
-		pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 1);
 }
 
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
@@ -535,36 +501,6 @@ static inline int pte_protnone(pte_t pte)
 	return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE | _PAGE_RWX)) ==
 		cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE);
 }
-
-#define pte_mk_savedwrite pte_mk_savedwrite
-static inline pte_t pte_mk_savedwrite(pte_t pte)
-{
-	/*
-	 * Used by Autonuma subsystem to preserve the write bit
-	 * while marking the pte PROT_NONE. Only allow this
-	 * on PROT_NONE pte
-	 */
-	VM_BUG_ON((pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_RWX | _PAGE_PRIVILEGED)) !=
-		  cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED));
-	return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_PRIVILEGED));
-}
-
-#define pte_clear_savedwrite pte_clear_savedwrite
-static inline pte_t pte_clear_savedwrite(pte_t pte)
-{
-	/*
-	 * Used by KSM subsystem to make a protnone pte readonly.
-	 */
-	VM_BUG_ON(!pte_protnone(pte));
-	return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_PRIVILEGED));
-}
-#else
-#define pte_clear_savedwrite pte_clear_savedwrite
-static inline pte_t pte_clear_savedwrite(pte_t pte)
-{
-	VM_WARN_ON(1);
-	return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_WRITE));
-}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static inline bool pte_hw_valid(pte_t pte)
@@ -641,8 +577,6 @@ static inline unsigned long pte_pfn(pte_t pte)
 /* Generic modifiers for PTE bits */
 static inline pte_t pte_wrprotect(pte_t pte)
 {
-	if (unlikely(pte_savedwrite(pte)))
-		return pte_clear_savedwrite(pte);
 	return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_WRITE));
 }
 
@@ -1139,8 +1073,6 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 #define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
-#define pmd_mk_savedwrite(pmd)	pte_pmd(pte_mk_savedwrite(pmd_pte(pmd)))
-#define pmd_clear_savedwrite(pmd)	pte_pmd(pte_clear_savedwrite(pmd_pte(pmd)))
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #define pmd_soft_dirty(pmd)    pte_soft_dirty(pmd_pte(pmd))
@@ -1162,8 +1094,6 @@ static inline int pmd_protnone(pmd_t pmd)
 #endif /* CONFIG_NUMA_BALANCING */
 
 #define pmd_write(pmd)		pte_write(pmd_pte(pmd))
-#define __pmd_write(pmd)	__pte_write(pmd_pte(pmd))
-#define pmd_savedwrite(pmd)	pte_savedwrite(pmd_pte(pmd))
 
 #define pmd_access_permitted pmd_access_permitted
 static inline bool pmd_access_permitted(pmd_t pmd, bool write)
@@ -1241,10 +1171,8 @@ static inline int __pmdp_test_and_clear_young(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 				      pmd_t *pmdp)
 {
-	if (__pmd_write((*pmdp)))
+	if (pmd_write(*pmdp))
 		pmd_hugepage_update(mm, addr, pmdp, _PAGE_WRITE, 0);
-	else if (unlikely(pmd_savedwrite(*pmdp)))
-		pmd_hugepage_update(mm, addr, pmdp, 0, _PAGE_PRIVILEGED);
 }
 
 /*
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 5a05953ae13f..9182324dbef9 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -265,7 +265,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		}
 		pte = kvmppc_read_update_linux_pte(ptep, writing);
 		if (pte_present(pte) && !pte_protnone(pte)) {
-			if (writing && !__pte_write(pte))
+			if (writing && !pte_write(pte))
 				/* make the actual HPTE be read-only */
 				ptel = hpte_make_readonly(ptel);
 			is_ci = pte_ci(pte);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a108b60a6962..35c80f4f06ec 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -485,30 +485,6 @@ static inline pte_t pte_sw_mkyoung(pte_t pte)
 #define pte_sw_mkyoung	pte_sw_mkyoung
 #endif
 
-#ifndef pte_savedwrite
-#define pte_savedwrite pte_write
-#endif
-
-#ifndef pte_mk_savedwrite
-#define pte_mk_savedwrite pte_mkwrite
-#endif
-
-#ifndef pte_clear_savedwrite
-#define pte_clear_savedwrite pte_wrprotect
-#endif
-
-#ifndef pmd_savedwrite
-#define pmd_savedwrite pmd_write
-#endif
-
-#ifndef pmd_mk_savedwrite
-#define pmd_mk_savedwrite pmd_mkwrite
-#endif
-
-#ifndef pmd_clear_savedwrite
-#define pmd_clear_savedwrite pmd_wrprotect
-#endif
-
 #ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 2b61fde8c38c..c631ade3f1d2 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -171,18 +171,6 @@ static void __init pte_advanced_tests(struct pgtable_debug_args *args)
 	ptep_get_and_clear_full(args->mm, args->vaddr, args->ptep, 1);
 }
 
-static void __init pte_savedwrite_tests(struct pgtable_debug_args *args)
-{
-	pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot_none);
-
-	if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
-		return;
-
-	pr_debug("Validating PTE saved write\n");
-	WARN_ON(!pte_savedwrite(pte_mk_savedwrite(pte_clear_savedwrite(pte))));
-	WARN_ON(pte_savedwrite(pte_clear_savedwrite(pte_mk_savedwrite(pte))));
-}
-
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static void __init pmd_basic_tests(struct pgtable_debug_args *args, int idx)
 {
@@ -302,22 +290,6 @@ static void __init pmd_leaf_tests(struct pgtable_debug_args *args)
 	WARN_ON(!pmd_leaf(pmd));
 }
 
-static void __init pmd_savedwrite_tests(struct pgtable_debug_args *args)
-{
-	pmd_t pmd;
-
-	if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
-		return;
-
-	if (!has_transparent_hugepage())
-		return;
-
-	pr_debug("Validating PMD saved write\n");
-	pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot_none);
-	WARN_ON(!pmd_savedwrite(pmd_mk_savedwrite(pmd_clear_savedwrite(pmd))));
-	WARN_ON(pmd_savedwrite(pmd_clear_savedwrite(pmd_mk_savedwrite(pmd))));
-}
-
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static void __init pud_basic_tests(struct pgtable_debug_args *args, int idx)
 {
@@ -451,7 +423,6 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args) { }
 static void __init pud_advanced_tests(struct pgtable_debug_args *args) { }
 static void __init pmd_leaf_tests(struct pgtable_debug_args *args) { }
 static void __init pud_leaf_tests(struct pgtable_debug_args *args) { }
-static void __init pmd_savedwrite_tests(struct pgtable_debug_args *args) { }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
@@ -1288,9 +1259,6 @@ static int __init debug_vm_pgtable(void)
 	pmd_leaf_tests(&args);
 	pud_leaf_tests(&args);
 
-	pte_savedwrite_tests(&args);
-	pmd_savedwrite_tests(&args);
-
 	pte_special_tests(&args);
 	pte_protnone_tests(&args);
 	pmd_protnone_tests(&args);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 5/6] mm: remove unused savedwrite infrastructure
@ 2022-11-02 19:12   ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrea Arcangeli, David Hildenbrand, linuxppc-dev,
	Anshuman Khandual, Dave Chinner, Mel Gorman, Peter Xu, linux-mm,
	Hugh Dickins, Nadav Amit, Nicholas Piggin, Mike Rapoport,
	Andrew Morton, Linus Torvalds, Vlastimil Babka

NUMA hinting no longer uses savedwrite, let's rip it out.

... and while at it, drop __pte_write() and __pmd_write() on ppc64.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 80 +-------------------
 arch/powerpc/kvm/book3s_hv_rm_mmu.c          |  2 +-
 include/linux/pgtable.h                      | 24 ------
 mm/debug_vm_pgtable.c                        | 32 --------
 4 files changed, 5 insertions(+), 133 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index c436d8422654..cb4c67bf45d7 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -401,35 +401,9 @@ static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
 #define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
 #define pmdp_clear_flush_young pmdp_test_and_clear_young
 
-static inline int __pte_write(pte_t pte)
-{
-	return !!(pte_raw(pte) & cpu_to_be64(_PAGE_WRITE));
-}
-
-#ifdef CONFIG_NUMA_BALANCING
-#define pte_savedwrite pte_savedwrite
-static inline bool pte_savedwrite(pte_t pte)
-{
-	/*
-	 * Saved write ptes are prot none ptes that doesn't have
-	 * privileged bit sit. We mark prot none as one which has
-	 * present and pviliged bit set and RWX cleared. To mark
-	 * protnone which used to have _PAGE_WRITE set we clear
-	 * the privileged bit.
-	 */
-	return !(pte_raw(pte) & cpu_to_be64(_PAGE_RWX | _PAGE_PRIVILEGED));
-}
-#else
-#define pte_savedwrite pte_savedwrite
-static inline bool pte_savedwrite(pte_t pte)
-{
-	return false;
-}
-#endif
-
 static inline int pte_write(pte_t pte)
 {
-	return __pte_write(pte) || pte_savedwrite(pte);
+	return !!(pte_raw(pte) & cpu_to_be64(_PAGE_WRITE));
 }
 
 static inline int pte_read(pte_t pte)
@@ -441,24 +415,16 @@ static inline int pte_read(pte_t pte)
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 				      pte_t *ptep)
 {
-	if (__pte_write(*ptep))
+	if (pte_write(*ptep))
 		pte_update(mm, addr, ptep, _PAGE_WRITE, 0, 0);
-	else if (unlikely(pte_savedwrite(*ptep)))
-		pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 0);
 }
 
 #define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
 static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
 					   unsigned long addr, pte_t *ptep)
 {
-	/*
-	 * We should not find protnone for hugetlb, but this complete the
-	 * interface.
-	 */
-	if (__pte_write(*ptep))
+	if (pte_write(*ptep))
 		pte_update(mm, addr, ptep, _PAGE_WRITE, 0, 1);
-	else if (unlikely(pte_savedwrite(*ptep)))
-		pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 1);
 }
 
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
@@ -535,36 +501,6 @@ static inline int pte_protnone(pte_t pte)
 	return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE | _PAGE_RWX)) ==
 		cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE);
 }
-
-#define pte_mk_savedwrite pte_mk_savedwrite
-static inline pte_t pte_mk_savedwrite(pte_t pte)
-{
-	/*
-	 * Used by Autonuma subsystem to preserve the write bit
-	 * while marking the pte PROT_NONE. Only allow this
-	 * on PROT_NONE pte
-	 */
-	VM_BUG_ON((pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_RWX | _PAGE_PRIVILEGED)) !=
-		  cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED));
-	return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_PRIVILEGED));
-}
-
-#define pte_clear_savedwrite pte_clear_savedwrite
-static inline pte_t pte_clear_savedwrite(pte_t pte)
-{
-	/*
-	 * Used by KSM subsystem to make a protnone pte readonly.
-	 */
-	VM_BUG_ON(!pte_protnone(pte));
-	return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_PRIVILEGED));
-}
-#else
-#define pte_clear_savedwrite pte_clear_savedwrite
-static inline pte_t pte_clear_savedwrite(pte_t pte)
-{
-	VM_WARN_ON(1);
-	return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_WRITE));
-}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static inline bool pte_hw_valid(pte_t pte)
@@ -641,8 +577,6 @@ static inline unsigned long pte_pfn(pte_t pte)
 /* Generic modifiers for PTE bits */
 static inline pte_t pte_wrprotect(pte_t pte)
 {
-	if (unlikely(pte_savedwrite(pte)))
-		return pte_clear_savedwrite(pte);
 	return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_WRITE));
 }
 
@@ -1139,8 +1073,6 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 #define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
-#define pmd_mk_savedwrite(pmd)	pte_pmd(pte_mk_savedwrite(pmd_pte(pmd)))
-#define pmd_clear_savedwrite(pmd)	pte_pmd(pte_clear_savedwrite(pmd_pte(pmd)))
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #define pmd_soft_dirty(pmd)    pte_soft_dirty(pmd_pte(pmd))
@@ -1162,8 +1094,6 @@ static inline int pmd_protnone(pmd_t pmd)
 #endif /* CONFIG_NUMA_BALANCING */
 
 #define pmd_write(pmd)		pte_write(pmd_pte(pmd))
-#define __pmd_write(pmd)	__pte_write(pmd_pte(pmd))
-#define pmd_savedwrite(pmd)	pte_savedwrite(pmd_pte(pmd))
 
 #define pmd_access_permitted pmd_access_permitted
 static inline bool pmd_access_permitted(pmd_t pmd, bool write)
@@ -1241,10 +1171,8 @@ static inline int __pmdp_test_and_clear_young(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 				      pmd_t *pmdp)
 {
-	if (__pmd_write((*pmdp)))
+	if (pmd_write(*pmdp))
 		pmd_hugepage_update(mm, addr, pmdp, _PAGE_WRITE, 0);
-	else if (unlikely(pmd_savedwrite(*pmdp)))
-		pmd_hugepage_update(mm, addr, pmdp, 0, _PAGE_PRIVILEGED);
 }
 
 /*
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 5a05953ae13f..9182324dbef9 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -265,7 +265,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		}
 		pte = kvmppc_read_update_linux_pte(ptep, writing);
 		if (pte_present(pte) && !pte_protnone(pte)) {
-			if (writing && !__pte_write(pte))
+			if (writing && !pte_write(pte))
 				/* make the actual HPTE be read-only */
 				ptel = hpte_make_readonly(ptel);
 			is_ci = pte_ci(pte);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a108b60a6962..35c80f4f06ec 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -485,30 +485,6 @@ static inline pte_t pte_sw_mkyoung(pte_t pte)
 #define pte_sw_mkyoung	pte_sw_mkyoung
 #endif
 
-#ifndef pte_savedwrite
-#define pte_savedwrite pte_write
-#endif
-
-#ifndef pte_mk_savedwrite
-#define pte_mk_savedwrite pte_mkwrite
-#endif
-
-#ifndef pte_clear_savedwrite
-#define pte_clear_savedwrite pte_wrprotect
-#endif
-
-#ifndef pmd_savedwrite
-#define pmd_savedwrite pmd_write
-#endif
-
-#ifndef pmd_mk_savedwrite
-#define pmd_mk_savedwrite pmd_mkwrite
-#endif
-
-#ifndef pmd_clear_savedwrite
-#define pmd_clear_savedwrite pmd_wrprotect
-#endif
-
 #ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 2b61fde8c38c..c631ade3f1d2 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -171,18 +171,6 @@ static void __init pte_advanced_tests(struct pgtable_debug_args *args)
 	ptep_get_and_clear_full(args->mm, args->vaddr, args->ptep, 1);
 }
 
-static void __init pte_savedwrite_tests(struct pgtable_debug_args *args)
-{
-	pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot_none);
-
-	if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
-		return;
-
-	pr_debug("Validating PTE saved write\n");
-	WARN_ON(!pte_savedwrite(pte_mk_savedwrite(pte_clear_savedwrite(pte))));
-	WARN_ON(pte_savedwrite(pte_clear_savedwrite(pte_mk_savedwrite(pte))));
-}
-
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static void __init pmd_basic_tests(struct pgtable_debug_args *args, int idx)
 {
@@ -302,22 +290,6 @@ static void __init pmd_leaf_tests(struct pgtable_debug_args *args)
 	WARN_ON(!pmd_leaf(pmd));
 }
 
-static void __init pmd_savedwrite_tests(struct pgtable_debug_args *args)
-{
-	pmd_t pmd;
-
-	if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
-		return;
-
-	if (!has_transparent_hugepage())
-		return;
-
-	pr_debug("Validating PMD saved write\n");
-	pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot_none);
-	WARN_ON(!pmd_savedwrite(pmd_mk_savedwrite(pmd_clear_savedwrite(pmd))));
-	WARN_ON(pmd_savedwrite(pmd_clear_savedwrite(pmd_mk_savedwrite(pmd))));
-}
-
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static void __init pud_basic_tests(struct pgtable_debug_args *args, int idx)
 {
@@ -451,7 +423,6 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args) { }
 static void __init pud_advanced_tests(struct pgtable_debug_args *args) { }
 static void __init pmd_leaf_tests(struct pgtable_debug_args *args) { }
 static void __init pud_leaf_tests(struct pgtable_debug_args *args) { }
-static void __init pmd_savedwrite_tests(struct pgtable_debug_args *args) { }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
@@ -1288,9 +1259,6 @@ static int __init debug_vm_pgtable(void)
 	pmd_leaf_tests(&args);
 	pud_leaf_tests(&args);
 
-	pte_savedwrite_tests(&args);
-	pmd_savedwrite_tests(&args);
-
 	pte_special_tests(&args);
 	pte_protnone_tests(&args);
 	pmd_protnone_tests(&args);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 6/6] selftests/vm: anon_cow: add mprotect() optimization tests
  2022-11-02 19:12 ` David Hildenbrand
@ 2022-11-02 19:12   ` David Hildenbrand
  -1 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linuxppc-dev, David Hildenbrand, Linus Torvalds,
	Andrew Morton, Mel Gorman, Dave Chinner, Nadav Amit, Peter Xu,
	Andrea Arcangeli, Hugh Dickins, Vlastimil Babka,
	Michael Ellerman, Nicholas Piggin, Mike Rapoport,
	Anshuman Khandual

Let's extend the test to cover the possible mprotect() optimization when
removing write-protection. mprotect() must not allow write-access to a
COW-shared page by accident.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 tools/testing/selftests/vm/anon_cow.c | 49 +++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/anon_cow.c b/tools/testing/selftests/vm/anon_cow.c
index 705bd0b3db11..bbb251eb5025 100644
--- a/tools/testing/selftests/vm/anon_cow.c
+++ b/tools/testing/selftests/vm/anon_cow.c
@@ -190,7 +190,8 @@ static int child_vmsplice_memcmp_fn(char *mem, size_t size,
 
 typedef int (*child_fn)(char *mem, size_t size, struct comm_pipes *comm_pipes);
 
-static void do_test_cow_in_parent(char *mem, size_t size, child_fn fn)
+static void do_test_cow_in_parent(char *mem, size_t size, bool do_mprotect,
+				  child_fn fn)
 {
 	struct comm_pipes comm_pipes;
 	char buf;
@@ -212,6 +213,22 @@ static void do_test_cow_in_parent(char *mem, size_t size, child_fn fn)
 
 	while (read(comm_pipes.child_ready[0], &buf, 1) != 1)
 		;
+
+	if (do_mprotect) {
+		/*
+		 * mprotect() optimizations might try avoiding
+		 * write-faults by directly mapping pages writable.
+		 */
+		ret = mprotect(mem, size, PROT_READ);
+		ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);
+		if (ret) {
+			ksft_test_result_fail("mprotect() failed\n");
+			write(comm_pipes.parent_ready[1], "0", 1);
+			wait(&ret);
+			goto close_comm_pipes;
+		}
+	}
+
 	/* Modify the page. */
 	memset(mem, 0xff, size);
 	write(comm_pipes.parent_ready[1], "0", 1);
@@ -229,12 +246,22 @@ static void do_test_cow_in_parent(char *mem, size_t size, child_fn fn)
 
 static void test_cow_in_parent(char *mem, size_t size)
 {
-	do_test_cow_in_parent(mem, size, child_memcmp_fn);
+	do_test_cow_in_parent(mem, size, false, child_memcmp_fn);
+}
+
+static void test_cow_in_parent_mprotect(char *mem, size_t size)
+{
+	do_test_cow_in_parent(mem, size, true, child_memcmp_fn);
 }
 
 static void test_vmsplice_in_child(char *mem, size_t size)
 {
-	do_test_cow_in_parent(mem, size, child_vmsplice_memcmp_fn);
+	do_test_cow_in_parent(mem, size, false, child_vmsplice_memcmp_fn);
+}
+
+static void test_vmsplice_in_child_mprotect(char *mem, size_t size)
+{
+	do_test_cow_in_parent(mem, size, true, child_vmsplice_memcmp_fn);
 }
 
 static void do_test_vmsplice_in_parent(char *mem, size_t size,
@@ -969,6 +996,14 @@ static const struct test_case test_cases[] = {
 		"Basic COW after fork()",
 		test_cow_in_parent,
 	},
+	/*
+	 * Basic test, but do an additional mprotect(PROT_READ)+
+	 * mprotect(PROT_READ|PROT_WRITE) in the parent before write access.
+	 */
+	{
+		"Basic COW after fork() with mprotect() optimization",
+		test_cow_in_parent_mprotect,
+	},
 	/*
 	 * vmsplice() [R/O GUP] + unmap in the child; modify in the parent. If
 	 * we miss to break COW, the child observes modifications by the parent.
@@ -978,6 +1013,14 @@ static const struct test_case test_cases[] = {
 		"vmsplice() + unmap in child",
 		test_vmsplice_in_child
 	},
+	/*
+	 * vmsplice() test, but do an additional mprotect(PROT_READ)+
+	 * mprotect(PROT_READ|PROT_WRITE) in the parent before write access.
+	 */
+	{
+		"vmsplice() + unmap in child with mprotect() optimization",
+		test_vmsplice_in_child_mprotect
+	},
 	/*
 	 * vmsplice() [R/O GUP] in parent before fork(), unmap in parent after
 	 * fork(); modify in the child. If we miss to break COW, the parent
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v1 6/6] selftests/vm: anon_cow: add mprotect() optimization tests
@ 2022-11-02 19:12   ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-02 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrea Arcangeli, David Hildenbrand, linuxppc-dev,
	Anshuman Khandual, Dave Chinner, Mel Gorman, Peter Xu, linux-mm,
	Hugh Dickins, Nadav Amit, Nicholas Piggin, Mike Rapoport,
	Andrew Morton, Linus Torvalds, Vlastimil Babka

Let's extend the test to cover the possible mprotect() optimization when
removing write-protection. mprotect() must not allow write-access to a
COW-shared page by accident.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 tools/testing/selftests/vm/anon_cow.c | 49 +++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/anon_cow.c b/tools/testing/selftests/vm/anon_cow.c
index 705bd0b3db11..bbb251eb5025 100644
--- a/tools/testing/selftests/vm/anon_cow.c
+++ b/tools/testing/selftests/vm/anon_cow.c
@@ -190,7 +190,8 @@ static int child_vmsplice_memcmp_fn(char *mem, size_t size,
 
 typedef int (*child_fn)(char *mem, size_t size, struct comm_pipes *comm_pipes);
 
-static void do_test_cow_in_parent(char *mem, size_t size, child_fn fn)
+static void do_test_cow_in_parent(char *mem, size_t size, bool do_mprotect,
+				  child_fn fn)
 {
 	struct comm_pipes comm_pipes;
 	char buf;
@@ -212,6 +213,22 @@ static void do_test_cow_in_parent(char *mem, size_t size, child_fn fn)
 
 	while (read(comm_pipes.child_ready[0], &buf, 1) != 1)
 		;
+
+	if (do_mprotect) {
+		/*
+		 * mprotect() optimizations might try avoiding
+		 * write-faults by directly mapping pages writable.
+		 */
+		ret = mprotect(mem, size, PROT_READ);
+		ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);
+		if (ret) {
+			ksft_test_result_fail("mprotect() failed\n");
+			write(comm_pipes.parent_ready[1], "0", 1);
+			wait(&ret);
+			goto close_comm_pipes;
+		}
+	}
+
 	/* Modify the page. */
 	memset(mem, 0xff, size);
 	write(comm_pipes.parent_ready[1], "0", 1);
@@ -229,12 +246,22 @@ static void do_test_cow_in_parent(char *mem, size_t size, child_fn fn)
 
 static void test_cow_in_parent(char *mem, size_t size)
 {
-	do_test_cow_in_parent(mem, size, child_memcmp_fn);
+	do_test_cow_in_parent(mem, size, false, child_memcmp_fn);
+}
+
+static void test_cow_in_parent_mprotect(char *mem, size_t size)
+{
+	do_test_cow_in_parent(mem, size, true, child_memcmp_fn);
 }
 
 static void test_vmsplice_in_child(char *mem, size_t size)
 {
-	do_test_cow_in_parent(mem, size, child_vmsplice_memcmp_fn);
+	do_test_cow_in_parent(mem, size, false, child_vmsplice_memcmp_fn);
+}
+
+static void test_vmsplice_in_child_mprotect(char *mem, size_t size)
+{
+	do_test_cow_in_parent(mem, size, true, child_vmsplice_memcmp_fn);
 }
 
 static void do_test_vmsplice_in_parent(char *mem, size_t size,
@@ -969,6 +996,14 @@ static const struct test_case test_cases[] = {
 		"Basic COW after fork()",
 		test_cow_in_parent,
 	},
+	/*
+	 * Basic test, but do an additional mprotect(PROT_READ)+
+	 * mprotect(PROT_READ|PROT_WRITE) in the parent before write access.
+	 */
+	{
+		"Basic COW after fork() with mprotect() optimization",
+		test_cow_in_parent_mprotect,
+	},
 	/*
 	 * vmsplice() [R/O GUP] + unmap in the child; modify in the parent. If
 	 * we miss to break COW, the child observes modifications by the parent.
@@ -978,6 +1013,14 @@ static const struct test_case test_cases[] = {
 		"vmsplice() + unmap in child",
 		test_vmsplice_in_child
 	},
+	/*
+	 * vmsplice() test, but do an additional mprotect(PROT_READ)+
+	 * mprotect(PROT_READ|PROT_WRITE) in the parent before write access.
+	 */
+	{
+		"vmsplice() + unmap in child with mprotect() optimization",
+		test_vmsplice_in_child_mprotect
+	},
 	/*
 	 * vmsplice() [R/O GUP] in parent before fork(), unmap in parent after
 	 * fork(); modify in the child. If we miss to break COW, the parent
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
  2022-11-02 19:12   ` David Hildenbrand
@ 2022-11-02 21:22     ` Nadav Amit
  -1 siblings, 0 replies; 20+ messages in thread
From: Nadav Amit @ 2022-11-02 21:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kernel list, Linux-MM, linuxppc-dev, Linus Torvalds,
	Andrew Morton, Mel Gorman, Dave Chinner, Peter Xu,
	Andrea Arcangeli, Hugh Dickins, Vlastimil Babka,
	Michael Ellerman, Nicholas Piggin, Mike Rapoport,
	Anshuman Khandual

On Nov 2, 2022, at 12:12 PM, David Hildenbrand <david@redhat.com> wrote:

> !! External Email
> 
> commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
> NUMA hinting fault") added remembering write permissions using ordinary
> pte_write() for PROT_NONE mapped pages to avoid write faults when
> remapping the page !PROT_NONE on NUMA hinting faults.
> 

[ snip ]

Here’s a very shallow reviewed with some minor points...

> ---
> include/linux/mm.h |  2 ++
> mm/huge_memory.c   | 28 +++++++++++++++++-----------
> mm/ksm.c           |  9 ++++-----
> mm/memory.c        | 19 ++++++++++++++++---
> mm/mprotect.c      |  7 ++-----
> 5 files changed, 41 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 25ff9a14a777..a0deeece5e87 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1975,6 +1975,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
> #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>                                            MM_CP_UFFD_WP_RESOLVE)
> 
> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> +                            pte_t pte);

It might not be customary, but how about marking it as __pure?

> extern unsigned long change_protection(struct mmu_gather *tlb,
>                              struct vm_area_struct *vma, unsigned long start,
>                              unsigned long end, pgprot_t newprot,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2ad68e91896a..45abd27d75a0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1462,8 +1462,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>        unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>        int page_nid = NUMA_NO_NODE;
>        int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
> -       bool migrated = false;
> -       bool was_writable = pmd_savedwrite(oldpmd);
> +       bool try_change_writable, migrated = false;
>        int flags = 0;
> 
>        vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> @@ -1472,13 +1471,22 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>                goto out;
>        }
> 
> +       /* See mprotect_fixup(). */
> +       if (vma->vm_flags & VM_SHARED)
> +               try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
> +       else
> +               try_change_writable = !!(vma->vm_flags & VM_WRITE);

Do you find it better to copy the code instead of extracting it to a
separate function?

> +
>        pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>        page = vm_normal_page_pmd(vma, haddr, pmd);
>        if (!page)
>                goto out_map;
> 
>        /* See similar comment in do_numa_page for explanation */
> -       if (!was_writable)
> +       if (try_change_writable && !pmd_write(pmd) &&
> +            can_change_pmd_writable(vma, vmf->address, pmd))
> +               pmd = pmd_mkwrite(pmd);
> +       if (!pmd_write(pmd))
>                flags |= TNF_NO_GROUP;
> 
>        page_nid = page_to_nid(page);
> @@ -1523,8 +1531,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>        /* Restore the PMD */
>        pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>        pmd = pmd_mkyoung(pmd);
> -       if (was_writable)
> +
> +       /* Similar to mprotect() protection updates, avoid write faults. */
> +       if (try_change_writable && !pmd_write(pmd) &&
> +            can_change_pmd_writable(vma, vmf->address, pmd))

Why do I have a deja-vu? :)

There must be a way to avoid the redundant code and specifically the call to
can_change_pmd_writable(), no?

>                pmd = pmd_mkwrite(pmd);
> +
>        set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
>        update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
>        spin_unlock(vmf->ptl);
> @@ -1764,11 +1776,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>        struct mm_struct *mm = vma->vm_mm;
>        spinlock_t *ptl;
>        pmd_t oldpmd, entry;
> -       bool preserve_write;
> -       int ret;
>        bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>        bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>        bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +       int ret = 1;
> 
>        tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> 
> @@ -1779,9 +1790,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>        if (!ptl)
>                return 0;
> 
> -       preserve_write = prot_numa && pmd_write(*pmd);
> -       ret = 1;
> -
> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>        if (is_swap_pmd(*pmd)) {
>                swp_entry_t entry = pmd_to_swp_entry(*pmd);
> @@ -1861,8 +1869,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>        oldpmd = pmdp_invalidate_ad(vma, addr, pmd);
> 
>        entry = pmd_modify(oldpmd, newprot);
> -       if (preserve_write)
> -               entry = pmd_mk_savedwrite(entry);
>        if (uffd_wp) {
>                entry = pmd_wrprotect(entry);
>                entry = pmd_mkuffd_wp(entry);
> diff --git a/mm/ksm.c b/mm/ksm.c
> index dc15c4a2a6ff..dd02780c387f 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1069,7 +1069,6 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> 
>        anon_exclusive = PageAnonExclusive(page);
>        if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
> -           (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||

Not related to your code, but it does not make me comfortable that PTE’s
status bits (which are volatile) are not accessed in this manner.

Especially since the PTE is later saved into orig_pte. It would feel safer
to do READ_ONCE(*pvmw.pte) and work on it (probably in a separate patch).

>            anon_exclusive || mm_tlb_flush_pending(mm)) {
>                pte_t entry;
> 
> @@ -1107,11 +1106,11 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> 
>                if (pte_dirty(entry))
>                        set_page_dirty(page);
> +               entry = pte_mkclean(entry);
> +
> +               if (pte_write(entry))
> +                       entry = pte_wrprotect(entry);
> 
> -               if (pte_protnone(entry))
> -                       entry = pte_mkclean(pte_clear_savedwrite(entry));
> -               else
> -                       entry = pte_mkclean(pte_wrprotect(entry));
>                set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
>        }
>        *orig_pte = *pvmw.pte;
> diff --git a/mm/memory.c b/mm/memory.c
> index c5599a9279b1..286c29ee3aba 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4672,12 +4672,12 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
> static vm_fault_t do_numa_page(struct vm_fault *vmf)
> {
>        struct vm_area_struct *vma = vmf->vma;
> +       bool try_change_writable;
>        struct page *page = NULL;
>        int page_nid = NUMA_NO_NODE;
>        int last_cpupid;
>        int target_nid;
>        pte_t pte, old_pte;
> -       bool was_writable = pte_savedwrite(vmf->orig_pte);
>        int flags = 0;
> 
>        /*
> @@ -4692,6 +4692,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>                goto out;
>        }
> 
> +       /* See mprotect_fixup(). */
> +       if (vma->vm_flags & VM_SHARED)
> +               try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
> +       else
> +               try_change_writable = !!(vma->vm_flags & VM_WRITE);

It really cannot be extracted into a separate function?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
@ 2022-11-02 21:22     ` Nadav Amit
  0 siblings, 0 replies; 20+ messages in thread
From: Nadav Amit @ 2022-11-02 21:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrea Arcangeli, Mel Gorman, Anshuman Khandual, linuxppc-dev,
	Dave Chinner, kernel list, Peter Xu, Linux-MM, Hugh Dickins,
	Nicholas Piggin, Mike Rapoport, Andrew Morton, Linus Torvalds,
	Vlastimil Babka

On Nov 2, 2022, at 12:12 PM, David Hildenbrand <david@redhat.com> wrote:

> !! External Email
> 
> commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
> NUMA hinting fault") added remembering write permissions using ordinary
> pte_write() for PROT_NONE mapped pages to avoid write faults when
> remapping the page !PROT_NONE on NUMA hinting faults.
> 

[ snip ]

Here’s a very shallow reviewed with some minor points...

> ---
> include/linux/mm.h |  2 ++
> mm/huge_memory.c   | 28 +++++++++++++++++-----------
> mm/ksm.c           |  9 ++++-----
> mm/memory.c        | 19 ++++++++++++++++---
> mm/mprotect.c      |  7 ++-----
> 5 files changed, 41 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 25ff9a14a777..a0deeece5e87 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1975,6 +1975,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
> #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>                                            MM_CP_UFFD_WP_RESOLVE)
> 
> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> +                            pte_t pte);

It might not be customary, but how about marking it as __pure?

> extern unsigned long change_protection(struct mmu_gather *tlb,
>                              struct vm_area_struct *vma, unsigned long start,
>                              unsigned long end, pgprot_t newprot,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2ad68e91896a..45abd27d75a0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1462,8 +1462,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>        unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>        int page_nid = NUMA_NO_NODE;
>        int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
> -       bool migrated = false;
> -       bool was_writable = pmd_savedwrite(oldpmd);
> +       bool try_change_writable, migrated = false;
>        int flags = 0;
> 
>        vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> @@ -1472,13 +1471,22 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>                goto out;
>        }
> 
> +       /* See mprotect_fixup(). */
> +       if (vma->vm_flags & VM_SHARED)
> +               try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
> +       else
> +               try_change_writable = !!(vma->vm_flags & VM_WRITE);

Do you find it better to copy the code instead of extracting it to a
separate function?

> +
>        pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>        page = vm_normal_page_pmd(vma, haddr, pmd);
>        if (!page)
>                goto out_map;
> 
>        /* See similar comment in do_numa_page for explanation */
> -       if (!was_writable)
> +       if (try_change_writable && !pmd_write(pmd) &&
> +            can_change_pmd_writable(vma, vmf->address, pmd))
> +               pmd = pmd_mkwrite(pmd);
> +       if (!pmd_write(pmd))
>                flags |= TNF_NO_GROUP;
> 
>        page_nid = page_to_nid(page);
> @@ -1523,8 +1531,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>        /* Restore the PMD */
>        pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>        pmd = pmd_mkyoung(pmd);
> -       if (was_writable)
> +
> +       /* Similar to mprotect() protection updates, avoid write faults. */
> +       if (try_change_writable && !pmd_write(pmd) &&
> +            can_change_pmd_writable(vma, vmf->address, pmd))

Why do I have a deja-vu? :)

There must be a way to avoid the redundant code and specifically the call to
can_change_pmd_writable(), no?

>                pmd = pmd_mkwrite(pmd);
> +
>        set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
>        update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
>        spin_unlock(vmf->ptl);
> @@ -1764,11 +1776,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>        struct mm_struct *mm = vma->vm_mm;
>        spinlock_t *ptl;
>        pmd_t oldpmd, entry;
> -       bool preserve_write;
> -       int ret;
>        bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>        bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>        bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +       int ret = 1;
> 
>        tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> 
> @@ -1779,9 +1790,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>        if (!ptl)
>                return 0;
> 
> -       preserve_write = prot_numa && pmd_write(*pmd);
> -       ret = 1;
> -
> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>        if (is_swap_pmd(*pmd)) {
>                swp_entry_t entry = pmd_to_swp_entry(*pmd);
> @@ -1861,8 +1869,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>        oldpmd = pmdp_invalidate_ad(vma, addr, pmd);
> 
>        entry = pmd_modify(oldpmd, newprot);
> -       if (preserve_write)
> -               entry = pmd_mk_savedwrite(entry);
>        if (uffd_wp) {
>                entry = pmd_wrprotect(entry);
>                entry = pmd_mkuffd_wp(entry);
> diff --git a/mm/ksm.c b/mm/ksm.c
> index dc15c4a2a6ff..dd02780c387f 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1069,7 +1069,6 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> 
>        anon_exclusive = PageAnonExclusive(page);
>        if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
> -           (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||

Not related to your code, but it does not make me comfortable that PTE’s
status bits (which are volatile) are not accessed in this manner.

Especially since the PTE is later saved into orig_pte. It would feel safer
to do READ_ONCE(*pvmw.pte) and work on it (probably in a separate patch).

>            anon_exclusive || mm_tlb_flush_pending(mm)) {
>                pte_t entry;
> 
> @@ -1107,11 +1106,11 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> 
>                if (pte_dirty(entry))
>                        set_page_dirty(page);
> +               entry = pte_mkclean(entry);
> +
> +               if (pte_write(entry))
> +                       entry = pte_wrprotect(entry);
> 
> -               if (pte_protnone(entry))
> -                       entry = pte_mkclean(pte_clear_savedwrite(entry));
> -               else
> -                       entry = pte_mkclean(pte_wrprotect(entry));
>                set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
>        }
>        *orig_pte = *pvmw.pte;
> diff --git a/mm/memory.c b/mm/memory.c
> index c5599a9279b1..286c29ee3aba 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4672,12 +4672,12 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
> static vm_fault_t do_numa_page(struct vm_fault *vmf)
> {
>        struct vm_area_struct *vma = vmf->vma;
> +       bool try_change_writable;
>        struct page *page = NULL;
>        int page_nid = NUMA_NO_NODE;
>        int last_cpupid;
>        int target_nid;
>        pte_t pte, old_pte;
> -       bool was_writable = pte_savedwrite(vmf->orig_pte);
>        int flags = 0;
> 
>        /*
> @@ -4692,6 +4692,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>                goto out;
>        }
> 
> +       /* See mprotect_fixup(). */
> +       if (vma->vm_flags & VM_SHARED)
> +               try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
> +       else
> +               try_change_writable = !!(vma->vm_flags & VM_WRITE);

It really cannot be extracted into a separate function?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
  2022-11-02 21:22     ` Nadav Amit
@ 2022-11-03 10:45       ` David Hildenbrand
  -1 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-03 10:45 UTC (permalink / raw)
  To: Nadav Amit
  Cc: kernel list, Linux-MM, linuxppc-dev, Linus Torvalds,
	Andrew Morton, Mel Gorman, Dave Chinner, Peter Xu,
	Andrea Arcangeli, Hugh Dickins, Vlastimil Babka,
	Michael Ellerman, Nicholas Piggin, Mike Rapoport,
	Anshuman Khandual

On 02.11.22 22:22, Nadav Amit wrote:
> On Nov 2, 2022, at 12:12 PM, David Hildenbrand <david@redhat.com> wrote:
> 
>> !! External Email
>>
>> commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
>> NUMA hinting fault") added remembering write permissions using ordinary
>> pte_write() for PROT_NONE mapped pages to avoid write faults when
>> remapping the page !PROT_NONE on NUMA hinting faults.
>>
> 
> [ snip ]
> 
> Here’s a very shallow reviewed with some minor points...

Appreciated.

> 
>> ---
>> include/linux/mm.h |  2 ++
>> mm/huge_memory.c   | 28 +++++++++++++++++-----------
>> mm/ksm.c           |  9 ++++-----
>> mm/memory.c        | 19 ++++++++++++++++---
>> mm/mprotect.c      |  7 ++-----
>> 5 files changed, 41 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 25ff9a14a777..a0deeece5e87 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1975,6 +1975,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>> #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>                                             MM_CP_UFFD_WP_RESOLVE)
>>
>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> +                            pte_t pte);
> 
> It might not be customary, but how about marking it as __pure?

Right, there is no a single use of __pure in the mm domain.

> 
>> extern unsigned long change_protection(struct mmu_gather *tlb,
>>                               struct vm_area_struct *vma, unsigned long start,
>>                               unsigned long end, pgprot_t newprot,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2ad68e91896a..45abd27d75a0 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1462,8 +1462,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>         unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>         int page_nid = NUMA_NO_NODE;
>>         int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>> -       bool migrated = false;
>> -       bool was_writable = pmd_savedwrite(oldpmd);
>> +       bool try_change_writable, migrated = false;
>>         int flags = 0;
>>
>>         vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>> @@ -1472,13 +1471,22 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>                 goto out;
>>         }
>>
>> +       /* See mprotect_fixup(). */
>> +       if (vma->vm_flags & VM_SHARED)
>> +               try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
>> +       else
>> +               try_change_writable = !!(vma->vm_flags & VM_WRITE);
> 
> Do you find it better to copy the code instead of extracting it to a
> separate function?

Yeah, you're right ;) usually the issue is coming up with a suitable name. Let me try.

vma_wants_manual_writability_change() hmm ...

> 
>> +
>>         pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>>         page = vm_normal_page_pmd(vma, haddr, pmd);
>>         if (!page)
>>                 goto out_map;
>>
>>         /* See similar comment in do_numa_page for explanation */
>> -       if (!was_writable)
>> +       if (try_change_writable && !pmd_write(pmd) &&
>> +            can_change_pmd_writable(vma, vmf->address, pmd))
>> +               pmd = pmd_mkwrite(pmd);
>> +       if (!pmd_write(pmd))
>>                 flags |= TNF_NO_GROUP;
>>
>>         page_nid = page_to_nid(page);
>> @@ -1523,8 +1531,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>         /* Restore the PMD */
>>         pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>>         pmd = pmd_mkyoung(pmd);
>> -       if (was_writable)
>> +
>> +       /* Similar to mprotect() protection updates, avoid write faults. */
>> +       if (try_change_writable && !pmd_write(pmd) &&
>> +            can_change_pmd_writable(vma, vmf->address, pmd))
> 
> Why do I have a deja-vu? :)
> 
> There must be a way to avoid the redundant code and specifically the call to
> can_change_pmd_writable(), no?

The issue is that as soon as we drop the page table lock, that information is stale.
Especially, after we fail migration.

So the following should work, however, if we fail migration we wouldn't map the
page writable and would have to re-calculate:

diff --git a/mm/memory.c b/mm/memory.c
index c5599a9279b1..a997625641e4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4674,10 +4674,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
         struct vm_area_struct *vma = vmf->vma;
         struct page *page = NULL;
         int page_nid = NUMA_NO_NODE;
+       bool writable = false;
         int last_cpupid;
         int target_nid;
         pte_t pte, old_pte;
-       bool was_writable = pte_savedwrite(vmf->orig_pte);
         int flags = 0;
  
         /*
@@ -4696,6 +4696,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
         old_pte = ptep_get(vmf->pte);
         pte = pte_modify(old_pte, vma->vm_page_prot);
  
+       /*
+        * Detect now whether the PTE is or can be writable. Note that this
+        * information is valid as long as we're holding the PT lock, so also on
+        * the remap path below.
+        */
+       writable = pte_write(pte);
+       if (!writable && vma_wants_manual_writability_change(vma) &&
+           can_change_pte_writable(vma, vmf->address, pte);
+           writable = true;
+       }
+
         page = vm_normal_page(vma, vmf->address, pte);
         if (!page || is_zone_device_page(page))
                 goto out_map;
@@ -4712,7 +4723,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
          * pte_dirty has unpredictable behaviour between PTE scan updates,
          * background writeback, dirty balancing and application behaviour.
          */
-       if (!was_writable)
+       if (!writable)
                 flags |= TNF_NO_GROUP;
  
         /*
@@ -4738,6 +4749,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
                 put_page(page);
                 goto out_map;
         }
+       writable = false;
         pte_unmap_unlock(vmf->pte, vmf->ptl);
  
         /* Migrate to the requested node */
@@ -4767,7 +4779,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
         old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
         pte = pte_modify(old_pte, vma->vm_page_prot);
         pte = pte_mkyoung(pte);
-       if (was_writable)
+       if (writable)
                 pte = pte_mkwrite(pte);
         ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
         update_mmu_cache(vma, vmf->address, vmf->pte);


To me, the less error-prone approach is to re-calculate.


[...]
>> --- a/mm/ksm.c
>> +++ b/mm/ksm.c
>> @@ -1069,7 +1069,6 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>>
>>         anon_exclusive = PageAnonExclusive(page);
>>         if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
>> -           (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||
> 
> Not related to your code, but it does not make me comfortable that PTE’s
> status bits (which are volatile) are not accessed in this manner.
> 
> Especially since the PTE is later saved into orig_pte. It would feel safer
> to do READ_ONCE(*pvmw.pte) and work on it (probably in a separate patch).

I assume you are talking about the dirty bit. I don't immediately see how something
could go wrong here, but I agree that it might look cleaner that way.

Anyhow, independent of this series, so I'll leave that alone for now but add a
note for the future.

-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
@ 2022-11-03 10:45       ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-03 10:45 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrea Arcangeli, Mel Gorman, Anshuman Khandual, linuxppc-dev,
	Dave Chinner, kernel list, Peter Xu, Linux-MM, Hugh Dickins,
	Nicholas Piggin, Mike Rapoport, Andrew Morton, Linus Torvalds,
	Vlastimil Babka

On 02.11.22 22:22, Nadav Amit wrote:
> On Nov 2, 2022, at 12:12 PM, David Hildenbrand <david@redhat.com> wrote:
> 
>> !! External Email
>>
>> commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
>> NUMA hinting fault") added remembering write permissions using ordinary
>> pte_write() for PROT_NONE mapped pages to avoid write faults when
>> remapping the page !PROT_NONE on NUMA hinting faults.
>>
> 
> [ snip ]
> 
> Here’s a very shallow reviewed with some minor points...

Appreciated.

> 
>> ---
>> include/linux/mm.h |  2 ++
>> mm/huge_memory.c   | 28 +++++++++++++++++-----------
>> mm/ksm.c           |  9 ++++-----
>> mm/memory.c        | 19 ++++++++++++++++---
>> mm/mprotect.c      |  7 ++-----
>> 5 files changed, 41 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 25ff9a14a777..a0deeece5e87 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1975,6 +1975,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>> #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>                                             MM_CP_UFFD_WP_RESOLVE)
>>
>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> +                            pte_t pte);
> 
> It might not be customary, but how about marking it as __pure?

Right, there is no a single use of __pure in the mm domain.

> 
>> extern unsigned long change_protection(struct mmu_gather *tlb,
>>                               struct vm_area_struct *vma, unsigned long start,
>>                               unsigned long end, pgprot_t newprot,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2ad68e91896a..45abd27d75a0 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1462,8 +1462,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>         unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>         int page_nid = NUMA_NO_NODE;
>>         int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>> -       bool migrated = false;
>> -       bool was_writable = pmd_savedwrite(oldpmd);
>> +       bool try_change_writable, migrated = false;
>>         int flags = 0;
>>
>>         vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>> @@ -1472,13 +1471,22 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>                 goto out;
>>         }
>>
>> +       /* See mprotect_fixup(). */
>> +       if (vma->vm_flags & VM_SHARED)
>> +               try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
>> +       else
>> +               try_change_writable = !!(vma->vm_flags & VM_WRITE);
> 
> Do you find it better to copy the code instead of extracting it to a
> separate function?

Yeah, you're right ;) usually the issue is coming up with a suitable name. Let me try.

vma_wants_manual_writability_change() hmm ...

> 
>> +
>>         pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>>         page = vm_normal_page_pmd(vma, haddr, pmd);
>>         if (!page)
>>                 goto out_map;
>>
>>         /* See similar comment in do_numa_page for explanation */
>> -       if (!was_writable)
>> +       if (try_change_writable && !pmd_write(pmd) &&
>> +            can_change_pmd_writable(vma, vmf->address, pmd))
>> +               pmd = pmd_mkwrite(pmd);
>> +       if (!pmd_write(pmd))
>>                 flags |= TNF_NO_GROUP;
>>
>>         page_nid = page_to_nid(page);
>> @@ -1523,8 +1531,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>         /* Restore the PMD */
>>         pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>>         pmd = pmd_mkyoung(pmd);
>> -       if (was_writable)
>> +
>> +       /* Similar to mprotect() protection updates, avoid write faults. */
>> +       if (try_change_writable && !pmd_write(pmd) &&
>> +            can_change_pmd_writable(vma, vmf->address, pmd))
> 
> Why do I have a deja-vu? :)
> 
> There must be a way to avoid the redundant code and specifically the call to
> can_change_pmd_writable(), no?

The issue is that as soon as we drop the page table lock, that information is stale.
Especially, after we fail migration.

So the following should work, however, if we fail migration we wouldn't map the
page writable and would have to re-calculate:

diff --git a/mm/memory.c b/mm/memory.c
index c5599a9279b1..a997625641e4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4674,10 +4674,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
         struct vm_area_struct *vma = vmf->vma;
         struct page *page = NULL;
         int page_nid = NUMA_NO_NODE;
+       bool writable = false;
         int last_cpupid;
         int target_nid;
         pte_t pte, old_pte;
-       bool was_writable = pte_savedwrite(vmf->orig_pte);
         int flags = 0;
  
         /*
@@ -4696,6 +4696,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
         old_pte = ptep_get(vmf->pte);
         pte = pte_modify(old_pte, vma->vm_page_prot);
  
+       /*
+        * Detect now whether the PTE is or can be writable. Note that this
+        * information is valid as long as we're holding the PT lock, so also on
+        * the remap path below.
+        */
+       writable = pte_write(pte);
+       if (!writable && vma_wants_manual_writability_change(vma) &&
+           can_change_pte_writable(vma, vmf->address, pte);
+           writable = true;
+       }
+
         page = vm_normal_page(vma, vmf->address, pte);
         if (!page || is_zone_device_page(page))
                 goto out_map;
@@ -4712,7 +4723,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
          * pte_dirty has unpredictable behaviour between PTE scan updates,
          * background writeback, dirty balancing and application behaviour.
          */
-       if (!was_writable)
+       if (!writable)
                 flags |= TNF_NO_GROUP;
  
         /*
@@ -4738,6 +4749,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
                 put_page(page);
                 goto out_map;
         }
+       writable = false;
         pte_unmap_unlock(vmf->pte, vmf->ptl);
  
         /* Migrate to the requested node */
@@ -4767,7 +4779,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
         old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
         pte = pte_modify(old_pte, vma->vm_page_prot);
         pte = pte_mkyoung(pte);
-       if (was_writable)
+       if (writable)
                 pte = pte_mkwrite(pte);
         ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
         update_mmu_cache(vma, vmf->address, vmf->pte);


To me, the less error-prone approach is to re-calculate.


[...]
>> --- a/mm/ksm.c
>> +++ b/mm/ksm.c
>> @@ -1069,7 +1069,6 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>>
>>         anon_exclusive = PageAnonExclusive(page);
>>         if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
>> -           (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||
> 
> Not related to your code, but it does not make me comfortable that PTE’s
> status bits (which are volatile) are not accessed in this manner.
> 
> Especially since the PTE is later saved into orig_pte. It would feel safer
> to do READ_ONCE(*pvmw.pte) and work on it (probably in a separate patch).

I assume you are talking about the dirty bit. I don't immediately see how something
could go wrong here, but I agree that it might look cleaner that way.

Anyhow, independent of this series, so I'll leave that alone for now but add a
note for the future.

-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
  2022-11-03 10:45       ` David Hildenbrand
@ 2022-11-03 10:51         ` David Hildenbrand
  -1 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-03 10:51 UTC (permalink / raw)
  To: Nadav Amit
  Cc: kernel list, Linux-MM, linuxppc-dev, Linus Torvalds,
	Andrew Morton, Mel Gorman, Dave Chinner, Peter Xu,
	Andrea Arcangeli, Hugh Dickins, Vlastimil Babka,
	Michael Ellerman, Nicholas Piggin, Mike Rapoport,
	Anshuman Khandual

On 03.11.22 11:45, David Hildenbrand wrote:
> On 02.11.22 22:22, Nadav Amit wrote:
>> On Nov 2, 2022, at 12:12 PM, David Hildenbrand <david@redhat.com> wrote:
>>
>>> !! External Email
>>>
>>> commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
>>> NUMA hinting fault") added remembering write permissions using ordinary
>>> pte_write() for PROT_NONE mapped pages to avoid write faults when
>>> remapping the page !PROT_NONE on NUMA hinting faults.
>>>
>>
>> [ snip ]
>>
>> Here’s a very shallow reviewed with some minor points...
> 
> Appreciated.
> 
>>
>>> ---
>>> include/linux/mm.h |  2 ++
>>> mm/huge_memory.c   | 28 +++++++++++++++++-----------
>>> mm/ksm.c           |  9 ++++-----
>>> mm/memory.c        | 19 ++++++++++++++++---
>>> mm/mprotect.c      |  7 ++-----
>>> 5 files changed, 41 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 25ff9a14a777..a0deeece5e87 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1975,6 +1975,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>> #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>>                                              MM_CP_UFFD_WP_RESOLVE)
>>>
>>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> +                            pte_t pte);
>>
>> It might not be customary, but how about marking it as __pure?
> 
> Right, there is no a single use of __pure in the mm domain.
> 
>>
>>> extern unsigned long change_protection(struct mmu_gather *tlb,
>>>                                struct vm_area_struct *vma, unsigned long start,
>>>                                unsigned long end, pgprot_t newprot,
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 2ad68e91896a..45abd27d75a0 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1462,8 +1462,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>          unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>>          int page_nid = NUMA_NO_NODE;
>>>          int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>>> -       bool migrated = false;
>>> -       bool was_writable = pmd_savedwrite(oldpmd);
>>> +       bool try_change_writable, migrated = false;
>>>          int flags = 0;
>>>
>>>          vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>> @@ -1472,13 +1471,22 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>                  goto out;
>>>          }
>>>
>>> +       /* See mprotect_fixup(). */
>>> +       if (vma->vm_flags & VM_SHARED)
>>> +               try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
>>> +       else
>>> +               try_change_writable = !!(vma->vm_flags & VM_WRITE);
>>
>> Do you find it better to copy the code instead of extracting it to a
>> separate function?
> 
> Yeah, you're right ;) usually the issue is coming up with a suitable name. Let me try.
> 
> vma_wants_manual_writability_change() hmm ...
> 
>>
>>> +
>>>          pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>>>          page = vm_normal_page_pmd(vma, haddr, pmd);
>>>          if (!page)
>>>                  goto out_map;
>>>
>>>          /* See similar comment in do_numa_page for explanation */
>>> -       if (!was_writable)
>>> +       if (try_change_writable && !pmd_write(pmd) &&
>>> +            can_change_pmd_writable(vma, vmf->address, pmd))
>>> +               pmd = pmd_mkwrite(pmd);
>>> +       if (!pmd_write(pmd))
>>>                  flags |= TNF_NO_GROUP;
>>>
>>>          page_nid = page_to_nid(page);
>>> @@ -1523,8 +1531,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>          /* Restore the PMD */
>>>          pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>>>          pmd = pmd_mkyoung(pmd);
>>> -       if (was_writable)
>>> +
>>> +       /* Similar to mprotect() protection updates, avoid write faults. */
>>> +       if (try_change_writable && !pmd_write(pmd) &&
>>> +            can_change_pmd_writable(vma, vmf->address, pmd))
>>
>> Why do I have a deja-vu? :)
>>
>> There must be a way to avoid the redundant code and specifically the call to
>> can_change_pmd_writable(), no?
> 
> The issue is that as soon as we drop the page table lock, that information is stale.
> Especially, after we fail migration.
> 
> So the following should work, however, if we fail migration we wouldn't map the
> page writable and would have to re-calculate:
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index c5599a9279b1..a997625641e4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4674,10 +4674,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>           struct vm_area_struct *vma = vmf->vma;
>           struct page *page = NULL;
>           int page_nid = NUMA_NO_NODE;
> +       bool writable = false;
>           int last_cpupid;
>           int target_nid;
>           pte_t pte, old_pte;
> -       bool was_writable = pte_savedwrite(vmf->orig_pte);
>           int flags = 0;
>    
>           /*
> @@ -4696,6 +4696,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>           old_pte = ptep_get(vmf->pte);
>           pte = pte_modify(old_pte, vma->vm_page_prot);
>    
> +       /*
> +        * Detect now whether the PTE is or can be writable. Note that this
> +        * information is valid as long as we're holding the PT lock, so also on
> +        * the remap path below.
> +        */
> +       writable = pte_write(pte);
> +       if (!writable && vma_wants_manual_writability_change(vma) &&
> +           can_change_pte_writable(vma, vmf->address, pte);
> +           writable = true;
> +       }
> +
>           page = vm_normal_page(vma, vmf->address, pte);
>           if (!page || is_zone_device_page(page))
>                   goto out_map;
> @@ -4712,7 +4723,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>            * pte_dirty has unpredictable behaviour between PTE scan updates,
>            * background writeback, dirty balancing and application behaviour.
>            */
> -       if (!was_writable)
> +       if (!writable)
>                   flags |= TNF_NO_GROUP;
>    
>           /*
> @@ -4738,6 +4749,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>                   put_page(page);
>                   goto out_map;
>           }
> +       writable = false;
>           pte_unmap_unlock(vmf->pte, vmf->ptl);
>    
>           /* Migrate to the requested node */
> @@ -4767,7 +4779,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>           old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
>           pte = pte_modify(old_pte, vma->vm_page_prot);
>           pte = pte_mkyoung(pte);
> -       if (was_writable)
> +       if (writable)
>                   pte = pte_mkwrite(pte);
>           ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
>           update_mmu_cache(vma, vmf->address, vmf->pte);
> 
> 
> To me, the less error-prone approach is to re-calculate.

Hmm, thinking again, the "if (unlikely(!pte_same(*vmf->pte, 
vmf->orig_pte))) {" check might actually not require us to recalculate.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite
@ 2022-11-03 10:51         ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-11-03 10:51 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrea Arcangeli, Mel Gorman, Anshuman Khandual, linuxppc-dev,
	Dave Chinner, kernel list, Peter Xu, Linux-MM, Hugh Dickins,
	Nicholas Piggin, Mike Rapoport, Andrew Morton, Linus Torvalds,
	Vlastimil Babka

On 03.11.22 11:45, David Hildenbrand wrote:
> On 02.11.22 22:22, Nadav Amit wrote:
>> On Nov 2, 2022, at 12:12 PM, David Hildenbrand <david@redhat.com> wrote:
>>
>>> !! External Email
>>>
>>> commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
>>> NUMA hinting fault") added remembering write permissions using ordinary
>>> pte_write() for PROT_NONE mapped pages to avoid write faults when
>>> remapping the page !PROT_NONE on NUMA hinting faults.
>>>
>>
>> [ snip ]
>>
>> Here’s a very shallow reviewed with some minor points...
> 
> Appreciated.
> 
>>
>>> ---
>>> include/linux/mm.h |  2 ++
>>> mm/huge_memory.c   | 28 +++++++++++++++++-----------
>>> mm/ksm.c           |  9 ++++-----
>>> mm/memory.c        | 19 ++++++++++++++++---
>>> mm/mprotect.c      |  7 ++-----
>>> 5 files changed, 41 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 25ff9a14a777..a0deeece5e87 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1975,6 +1975,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>> #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>>                                              MM_CP_UFFD_WP_RESOLVE)
>>>
>>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> +                            pte_t pte);
>>
>> It might not be customary, but how about marking it as __pure?
> 
> Right, there is no a single use of __pure in the mm domain.
> 
>>
>>> extern unsigned long change_protection(struct mmu_gather *tlb,
>>>                                struct vm_area_struct *vma, unsigned long start,
>>>                                unsigned long end, pgprot_t newprot,
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 2ad68e91896a..45abd27d75a0 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1462,8 +1462,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>          unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>>          int page_nid = NUMA_NO_NODE;
>>>          int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>>> -       bool migrated = false;
>>> -       bool was_writable = pmd_savedwrite(oldpmd);
>>> +       bool try_change_writable, migrated = false;
>>>          int flags = 0;
>>>
>>>          vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>> @@ -1472,13 +1471,22 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>                  goto out;
>>>          }
>>>
>>> +       /* See mprotect_fixup(). */
>>> +       if (vma->vm_flags & VM_SHARED)
>>> +               try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
>>> +       else
>>> +               try_change_writable = !!(vma->vm_flags & VM_WRITE);
>>
>> Do you find it better to copy the code instead of extracting it to a
>> separate function?
> 
> Yeah, you're right ;) usually the issue is coming up with a suitable name. Let me try.
> 
> vma_wants_manual_writability_change() hmm ...
> 
>>
>>> +
>>>          pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>>>          page = vm_normal_page_pmd(vma, haddr, pmd);
>>>          if (!page)
>>>                  goto out_map;
>>>
>>>          /* See similar comment in do_numa_page for explanation */
>>> -       if (!was_writable)
>>> +       if (try_change_writable && !pmd_write(pmd) &&
>>> +            can_change_pmd_writable(vma, vmf->address, pmd))
>>> +               pmd = pmd_mkwrite(pmd);
>>> +       if (!pmd_write(pmd))
>>>                  flags |= TNF_NO_GROUP;
>>>
>>>          page_nid = page_to_nid(page);
>>> @@ -1523,8 +1531,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>          /* Restore the PMD */
>>>          pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>>>          pmd = pmd_mkyoung(pmd);
>>> -       if (was_writable)
>>> +
>>> +       /* Similar to mprotect() protection updates, avoid write faults. */
>>> +       if (try_change_writable && !pmd_write(pmd) &&
>>> +            can_change_pmd_writable(vma, vmf->address, pmd))
>>
>> Why do I have a deja-vu? :)
>>
>> There must be a way to avoid the redundant code and specifically the call to
>> can_change_pmd_writable(), no?
> 
> The issue is that as soon as we drop the page table lock, that information is stale.
> Especially, after we fail migration.
> 
> So the following should work, however, if we fail migration we wouldn't map the
> page writable and would have to re-calculate:
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index c5599a9279b1..a997625641e4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4674,10 +4674,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>           struct vm_area_struct *vma = vmf->vma;
>           struct page *page = NULL;
>           int page_nid = NUMA_NO_NODE;
> +       bool writable = false;
>           int last_cpupid;
>           int target_nid;
>           pte_t pte, old_pte;
> -       bool was_writable = pte_savedwrite(vmf->orig_pte);
>           int flags = 0;
>    
>           /*
> @@ -4696,6 +4696,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>           old_pte = ptep_get(vmf->pte);
>           pte = pte_modify(old_pte, vma->vm_page_prot);
>    
> +       /*
> +        * Detect now whether the PTE is or can be writable. Note that this
> +        * information is valid as long as we're holding the PT lock, so also on
> +        * the remap path below.
> +        */
> +       writable = pte_write(pte);
> +       if (!writable && vma_wants_manual_writability_change(vma) &&
> +           can_change_pte_writable(vma, vmf->address, pte);
> +           writable = true;
> +       }
> +
>           page = vm_normal_page(vma, vmf->address, pte);
>           if (!page || is_zone_device_page(page))
>                   goto out_map;
> @@ -4712,7 +4723,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>            * pte_dirty has unpredictable behaviour between PTE scan updates,
>            * background writeback, dirty balancing and application behaviour.
>            */
> -       if (!was_writable)
> +       if (!writable)
>                   flags |= TNF_NO_GROUP;
>    
>           /*
> @@ -4738,6 +4749,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>                   put_page(page);
>                   goto out_map;
>           }
> +       writable = false;
>           pte_unmap_unlock(vmf->pte, vmf->ptl);
>    
>           /* Migrate to the requested node */
> @@ -4767,7 +4779,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>           old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
>           pte = pte_modify(old_pte, vma->vm_page_prot);
>           pte = pte_mkyoung(pte);
> -       if (was_writable)
> +       if (writable)
>                   pte = pte_mkwrite(pte);
>           ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
>           update_mmu_cache(vma, vmf->address, vmf->pte);
> 
> 
> To me, the less error-prone approach is to re-calculate.

Hmm, thinking again, the "if (unlikely(!pte_same(*vmf->pte, 
vmf->orig_pte))) {" check might actually not require us to recalculate.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-11-03 10:52 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-02 19:12 [PATCH v1 0/6] mm/autonuma: replace savedwrite infrastructure David Hildenbrand
2022-11-02 19:12 ` David Hildenbrand
2022-11-02 19:12 ` [PATCH v1 1/6] mm/mprotect: allow clean exclusive anon pages to be writable David Hildenbrand
2022-11-02 19:12   ` David Hildenbrand
2022-11-02 19:12 ` [PATCH v1 2/6] mm/mprotect: minor can_change_pte_writable() cleanups David Hildenbrand
2022-11-02 19:12   ` David Hildenbrand
2022-11-02 19:12 ` [PATCH v1 3/6] mm/huge_memory: try avoiding write faults when changing PMD protection David Hildenbrand
2022-11-02 19:12   ` David Hildenbrand
2022-11-02 19:12 ` [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite David Hildenbrand
2022-11-02 19:12   ` David Hildenbrand
2022-11-02 21:22   ` Nadav Amit
2022-11-02 21:22     ` Nadav Amit
2022-11-03 10:45     ` David Hildenbrand
2022-11-03 10:45       ` David Hildenbrand
2022-11-03 10:51       ` David Hildenbrand
2022-11-03 10:51         ` David Hildenbrand
2022-11-02 19:12 ` [PATCH v1 5/6] mm: remove unused savedwrite infrastructure David Hildenbrand
2022-11-02 19:12   ` David Hildenbrand
2022-11-02 19:12 ` [PATCH v1 6/6] selftests/vm: anon_cow: add mprotect() optimization tests David Hildenbrand
2022-11-02 19:12   ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.