linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi.
@ 2022-07-18 12:01 Nadav Amit
  2022-07-18 12:01 ` [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect Nadav Amit
                   ` (13 more replies)
  0 siblings, 14 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

Following the optimizations to avoid unnecessary TLB flushes [1],
mprotect() and userfaultfd() did not cause unnecessary TLB flushes when
protection was unchanged. This enabled userfaultfd to write-unprotect a
page without triggering a TLB flush (and potentially shootdown).

After these changes, David added another feature to mprotect [2],
allowing pages that can safely be mapped as writable, to be mapped as
such directly from mprotect(), instead of going through the page fault
handler. This saves the overhead of a page-fault when write-unprotecting
private exclusive pages as writable, for instance.

This change introduced, however, some undesired behaviors, especially if
we adopt this new feature for userfaultfd. First, the newly mapped PTE
is not set as dirty, which might induce on x86 over 500 cycles of
overhead (if the page was not dirty before).  Second, once again we can
have an expensive TLB shootdown when we write-unprotect a page: when we
relax the protection (i.e., give more permission), we would do a TLB
flush. If the application is multithreaded, or a userfaultfd monitor
uses write-unprotect (which is a common case), a TLB shootdown would be
needed.

This patch-set allows userfaultfd to map pages as writeable directly
upon write-(un)protect ioctl, while addressing the undesired behaviors
that occur when one uses userfaultfd write-unprotect or mprotect to add
permissions. It also does some cleanup and micro-optimizations along the
way.

The main change that is done in the patch-set - x86 specific, at the
moment - is the introduction of "relaxed" TLB flushes when permissions
are added. Upon a "relaxed" TLB flush, the mm's TLB generation is
advanced and the local TLB is flushed, but no TLB shootdown takes place.
If a spurious page-fault occurs and the local generation of the TLB is
found to be out-of-sync with the mm generation, a full TLB flush is
performed on the faulting core to prevent further spurious page-faults.

To a certain extent "relaxed flushes" are similar to the changes that
were proposed some time ago for kernel mappings [3]. However, it does
not have any complicated interactions with with NMI handlers.

Experiments on Haswell show the performance improvement.  Running, for a
single page, a loop of (1) mprotect(READ); (2) mprotect(READ|WRITE) and
then (3) access provides the following result (on bare metal this time):

mprotect(PROT_READ) time in cycles:

			1 Thread	2 Threads
Before (5.19rc4+)	2499		4655
+patch			2495		4363 (-6%)


mprotect(PROT_READ|PROT_WRITE) in cycles:

			1 Thread	2 Threads
Before (5.19rc4+)	2529		4675
+patch			2496		2615 (-44%)

If we ran MADV_FREE or the page was not dirty, we can also shorten the
PROT_READ time by skipping the TLB shootdown with this patch-set.

[1] https://lore.kernel.org/all/20220401180821.1986781-1-namit@vmware.com/
[2] https://lore.kernel.org/all/20220614093629.76309-1-david@redhat.com/
[3] https://lore.kernel.org/all/4797D64D.1060105@goop.org/

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>

Nadav Amit (14):
  userfaultfd: set dirty and young on writeprotect
  userfaultfd: try to map write-unprotected pages
  mm/mprotect: allow exclusive anon pages to be writable
  mm/mprotect: preserve write with MM_CP_TRY_CHANGE_WRITABLE
  x86/mm: check exec permissions on fault
  mm/rmap: avoid flushing on page_vma_mkclean_one() when possible
  mm: do fix spurious page-faults for instruction faults
  x86/mm: introduce flush_tlb_fix_spurious_fault
  mm: introduce relaxed TLB flushes
  x86/mm: introduce relaxed TLB flushes
  x86/mm: use relaxed TLB flushes when protection is removed
  x86/tlb: no flush on PTE change from RW->RO when PTE is clean
  mm/mprotect: do not check flush type if a strict is needed
  mm: conditional check of pfn in pte_flush_type

 arch/x86/include/asm/pgtable.h  |   4 +-
 arch/x86/include/asm/tlb.h      |   3 +-
 arch/x86/include/asm/tlbflush.h |  90 +++++++++++++++++--------
 arch/x86/kernel/alternative.c   |   2 +-
 arch/x86/kernel/ldt.c           |   3 +-
 arch/x86/mm/fault.c             |  22 +++++-
 arch/x86/mm/tlb.c               |  21 +++++-
 include/asm-generic/tlb.h       | 116 +++++++++++++++++++-------------
 include/linux/mm.h              |   2 +
 include/linux/mm_types.h        |   6 ++
 mm/huge_memory.c                |   9 ++-
 mm/hugetlb.c                    |   2 +-
 mm/memory.c                     |   2 +-
 mm/mmu_gather.c                 |   1 +
 mm/mprotect.c                   |  31 ++++++---
 mm/rmap.c                       |  16 +++--
 mm/userfaultfd.c                |  10 ++-
 17 files changed, 237 insertions(+), 103 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
@ 2022-07-18 12:01 ` Nadav Amit
  2022-07-19 20:47   ` Peter Xu
  2022-07-20  9:42   ` David Hildenbrand
  2022-07-18 12:02 ` [RFC PATCH 02/14] userfaultfd: try to map write-unprotected pages Nadav Amit
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

When userfaultfd makes a PTE writable, it can now change the PTE
directly, in some cases, without going triggering a page-fault first.
Yet, doing so might leave the PTE that was write-unprotected as old and
clean. At least on x86, this would cause a >500 cycles overhead when the
PTE is first accessed.

Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
gets a hint that the page is likely to be used. Avoid changing the PTE
to young and dirty in other cases to avoid excessive writeback and
messing with the page reclamation logic.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
---
 include/linux/mm.h | 2 ++
 mm/mprotect.c      | 9 ++++++++-
 mm/userfaultfd.c   | 8 ++++++--
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9cc02a7e503b..4afd75ce5875 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 /* Whether this change is for write protecting */
 #define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
 #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
+/* Whether to try to mark entries as dirty as they are to be written */
+#define  MM_CP_WILL_NEED		   (1UL << 4)
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 996a97e213ad..34c2dfb68c42 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	bool will_need = cp_flags & MM_CP_WILL_NEED;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 
@@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 				ptent = pte_clear_uffd_wp(ptent);
 			}
 
+			if (will_need)
+				ptent = pte_mkyoung(ptent);
+
 			/*
 			 * In some writable, shared mappings, we might want
 			 * to catch actual write access -- see
@@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 			 */
 			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
 			    !pte_write(ptent) &&
-			    can_change_pte_writable(vma, addr, ptent))
+			    can_change_pte_writable(vma, addr, ptent)) {
 				ptent = pte_mkwrite(ptent);
+				if (will_need)
+					ptent = pte_mkdirty(ptent);
+			}
 
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 			if (pte_needs_flush(oldpte, ptent))
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 954c6980b29f..e0492f5f06a0 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -749,6 +749,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 	bool enable_wp = uffd_flags & UFFD_FLAGS_WP;
 	struct vm_area_struct *dst_vma;
 	unsigned long page_mask;
+	unsigned long cp_flags;
 	struct mmu_gather tlb;
 	pgprot_t newprot;
 	int err;
@@ -795,9 +796,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 	else
 		newprot = vm_get_page_prot(dst_vma->vm_flags);
 
+	cp_flags = enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE;
+	if (uffd_flags & (UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY))
+		cp_flags |= MM_CP_WILL_NEED;
+
 	tlb_gather_mmu(&tlb, dst_mm);
-	change_protection(&tlb, dst_vma, start, start + len, newprot,
-			  enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
+	change_protection(&tlb, dst_vma, start, start + len, newprot, cp_flags);
 	tlb_finish_mmu(&tlb);
 
 	err = 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 02/14] userfaultfd: try to map write-unprotected pages
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
  2022-07-18 12:01 ` [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-19 20:49   ` Peter Xu
  2022-07-18 12:02 ` [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable Nadav Amit
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

When using userfaultfd write-(un)protect ioctl, try to change the PTE to
be writable. This would save a page-fault afterwards.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/userfaultfd.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e0492f5f06a0..6013b217e9f3 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -799,6 +799,8 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 	cp_flags = enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE;
 	if (uffd_flags & (UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY))
 		cp_flags |= MM_CP_WILL_NEED;
+	if (!enable_wp && (uffd_flags & UFFD_FLAGS_WRITE_LIKELY))
+		cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
 
 	tlb_gather_mmu(&tlb, dst_mm);
 	change_protection(&tlb, dst_vma, start, start + len, newprot, cp_flags);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
  2022-07-18 12:01 ` [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 02/14] userfaultfd: try to map write-unprotected pages Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-20 15:19   ` David Hildenbrand
  2022-07-18 12:02 ` [RFC PATCH 04/14] mm/mprotect: preserve write with MM_CP_TRY_CHANGE_WRITABLE Nadav Amit
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

Anonymous pages might have the dirty bit clear, but this should not
prevent mprotect from making them writable if they are exclusive.
Therefore, skip the test whether the page is dirty in this case.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/mprotect.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 34c2dfb68c42..da5b9bf8204f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 
 	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
 
-	if (pte_protnone(pte) || !pte_dirty(pte))
+	if (pte_protnone(pte))
 		return false;
 
 	/* Do we need write faults for softdirty tracking? */
@@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pte);
 		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
 			return false;
-	}
+	} else if (!pte_dirty(pte))
+		return false;
 
 	return true;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 04/14] mm/mprotect: preserve write with MM_CP_TRY_CHANGE_WRITABLE
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
                   ` (2 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 05/14] x86/mm: check exec permissions on fault Nadav Amit
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

When MM_CP_TRY_CHANGE_WRITABLE is used, change_pte_range() tries to set
PTEs as writable.

Yet, writable PTEs might still become read-only, due to various
limitations of the logic that determines whether a PTE can become
writable (see can_change_pte_writable()). Anyhow, it is much easier to
keep the writable bit set when MM_CP_TRY_CHANGE_WRITABLE is used than to
first clear it and then figure out whether it can be set again.

Preserve the write-bit when MM_CP_TRY_CHANGE_WRITABLE is used, similarly
to the way it is done with NUMA.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/mprotect.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index da5b9bf8204f..92bfb17dcb8a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -84,6 +84,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 	bool will_need = cp_flags & MM_CP_WILL_NEED;
+	bool try_change_writable = cp_flags & MM_CP_TRY_CHANGE_WRITABLE;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 
@@ -114,7 +115,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
-			bool preserve_write = prot_numa && pte_write(oldpte);
+			bool preserve_write = (prot_numa || try_change_writable) &&
+					       pte_write(oldpte);
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
@@ -190,8 +192,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 			 * example, if a PTE is already dirty and no other
 			 * COW or special handling is required.
 			 */
-			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
-			    !pte_write(ptent) &&
+			if (try_change_writable && !pte_write(ptent) &&
 			    can_change_pte_writable(vma, addr, ptent)) {
 				ptent = pte_mkwrite(ptent);
 				if (will_need)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 05/14] x86/mm: check exec permissions on fault
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
                   ` (3 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 04/14] mm/mprotect: preserve write with MM_CP_TRY_CHANGE_WRITABLE Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 06/14] mm/rmap: avoid flushing on page_vma_mkclean_one() when possible Nadav Amit
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

access_error() currently does not check for execution permission
violation. As a result, spurious page-faults due to execution permission
violation cause SIGSEGV.

It appears not to be an issue so far, but the next patches avoid TLB
flushes on permission promotion, which can lead to this scenario. nodejs
for instance crashes when TLB flush is avoided on permission promotion.

Add a check to prevent access_error() from returning mistakenly that
spurious page-faults due to instruction fetch are a reason for an access
error.

It is assumed that error code bits of "instruction fetch" and "write" in
the hardware error code are mutual exclusive, and the change assumes so.
However, to be on the safe side, especially if hypervisors misbehave,
assert this is the case and warn otherwise.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/fault.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fe10c6d76bac..00013c1fac3f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
 				       (error_code & X86_PF_INSTR), foreign))
 		return 1;
 
-	if (error_code & X86_PF_WRITE) {
+	if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
+		/*
+		 * CPUs are not expected to set the two error code bits
+		 * together, but to ensure that hypervisors do not misbehave,
+		 * run an additional sanity check.
+		 */
+		if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
+					(X86_PF_WRITE|X86_PF_INSTR)) {
+			WARN_ON_ONCE(1);
+			return 1;
+		}
+
 		/* write, present and write, not present: */
-		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+		if ((error_code & X86_PF_WRITE) &&
+		    unlikely(!(vma->vm_flags & VM_WRITE)))
+			return 1;
+
+		/* exec, present and exec, not present: */
+		if ((error_code & X86_PF_INSTR) &&
+		    unlikely(!(vma->vm_flags & VM_EXEC)))
 			return 1;
+
 		return 0;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 06/14] mm/rmap: avoid flushing on page_vma_mkclean_one() when possible
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
                   ` (4 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 05/14] x86/mm: check exec permissions on fault Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 07/14] mm: do fix spurious page-faults for instruction faults Nadav Amit
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

x86 is capable to avoid TLB flush on clean writable entries.
page_vma_mkclean_one() does not take advantage of this behavior. Adapt
it.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/rmap.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 83172ee0ea35..23997c387858 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -961,17 +961,25 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 
 		address = pvmw->address;
 		if (pvmw->pte) {
-			pte_t entry;
+			pte_t entry, oldpte;
 			pte_t *pte = pvmw->pte;
 
 			if (!pte_dirty(*pte) && !pte_write(*pte))
 				continue;
 
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			entry = ptep_clear_flush(vma, address, pte);
-			entry = pte_wrprotect(entry);
+			oldpte = ptep_modify_prot_start(pvmw->vma, address,
+							pte);
+
+			entry = pte_wrprotect(oldpte);
 			entry = pte_mkclean(entry);
-			set_pte_at(vma->vm_mm, address, pte, entry);
+
+			if (pte_needs_flush(oldpte, entry) ||
+			    mm_tlb_flush_pending(vma->vm_mm))
+				flush_tlb_page(vma, address);
+
+			ptep_modify_prot_commit(vma, address, pte, oldpte,
+						entry);
 			ret = 1;
 		} else {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 07/14] mm: do fix spurious page-faults for instruction faults
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
                   ` (5 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 06/14] mm/rmap: avoid flushing on page_vma_mkclean_one() when possible Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 08/14] x86/mm: introduce flush_tlb_fix_spurious_fault Nadav Amit
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

The next patches might cause spurious instruction faults on x86. To
prevent them from occurring too much, call
flush_tlb_fix_spurious_fault() for page-faults on code fetching as well.
The callee is expected to do a full flush, or whatever is necessary to
avoid further TLB flushes.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 31ec3f0071a2..152a47876c36 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4924,7 +4924,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		 * This still avoids useless tlb flushes for .text page faults
 		 * with threads.
 		 */
-		if (vmf->flags & FAULT_FLAG_WRITE)
+		if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_INSTRUCTION))
 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);
 	}
 unlock:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 08/14] x86/mm: introduce flush_tlb_fix_spurious_fault
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
                   ` (6 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 07/14] mm: do fix spurious page-faults for instruction faults Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 09/14] mm: introduce relaxed TLB flushes Nadav Amit
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

The next patches introduce relaxed TLB flushes for x86, which would
require a full TLB flush upon spurious page-fault. If a spurious
page-fault occurs on x86, check if the local TLB generation is out of
sync and perform a TLB flush if needed.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/pgtable.h |  4 +++-
 arch/x86/mm/tlb.c              | 17 +++++++++++++++++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 44e2d6f1dbaa..1fbdaff1bb7a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1079,7 +1079,9 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm,
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
 }
 
-#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
+extern void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
+					 unsigned long address);
+#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
 
 #define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d400b6d9d246..ff3bcc55435e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -955,6 +955,23 @@ static void put_flush_tlb_info(void)
 #endif
 }
 
+void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
+				  unsigned long address)
+{
+	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	u64 mm_tlb_gen = atomic64_read(&vma->vm_mm->context.tlb_gen);
+	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	struct flush_tlb_info *info;
+
+	if (local_tlb_gen == mm_tlb_gen)
+		return;
+
+	preempt_disable();
+	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false, 0);
+	flush_tlb_func(info);
+	preempt_enable();
+}
+
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 09/14] mm: introduce relaxed TLB flushes
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
                   ` (7 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 08/14] x86/mm: introduce flush_tlb_fix_spurious_fault Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 10/14] x86/mm: " Nadav Amit
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

Introduce the concept of strict and relaxed TLB flushes. Relaxed TLB
flushes are TLB flushes that can be skipped but might lead to degraded
performance. It is down to arch code (in the next patches) to deal with
relaxed flushes correctly. One such behavior is flushing the local TLB
eagerly and remote TLBs lazily.

Track whether a flush is strict in the mmu_gather struct and introduce
the required constants for tracking.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h |  41 ++++++------
 include/asm-generic/tlb.h       | 114 ++++++++++++++++++--------------
 include/linux/mm_types.h        |   6 ++
 mm/huge_memory.c                |   7 +-
 mm/hugetlb.c                    |   2 +-
 mm/mmu_gather.c                 |   1 +
 mm/mprotect.c                   |   8 ++-
 mm/rmap.c                       |   2 +-
 8 files changed, 107 insertions(+), 74 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4af5579c7ef7..77d4810e5a5d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,7 +259,7 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
-static inline bool pte_flags_need_flush(unsigned long oldflags,
+static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
 					unsigned long newflags,
 					bool ignore_access)
 {
@@ -290,71 +290,72 @@ static inline bool pte_flags_need_flush(unsigned long oldflags,
 		diff &= ~_PAGE_ACCESSED;
 
 	/*
-	 * Did any of the 'flush_on_clear' flags was clleared set from between
-	 * 'oldflags' and 'newflags'?
+	 * Did any of the 'flush_on_clear' flags was cleared between 'oldflags'
+	 * and 'newflags'?
 	 */
 	if (diff & oldflags & flush_on_clear)
-		return true;
+		return PTE_FLUSH_STRICT;
 
 	/* Flush on modified flags. */
 	if (diff & flush_on_change)
-		return true;
+		return PTE_FLUSH_STRICT;
 
 	/* Ensure there are no flags that were left behind */
 	if (IS_ENABLED(CONFIG_DEBUG_VM) &&
 	    (diff & ~(flush_on_clear | software_flags | flush_on_change))) {
 		VM_WARN_ON_ONCE(1);
-		return true;
+		return PTE_FLUSH_STRICT;
 	}
 
-	return false;
+	return PTE_FLUSH_NONE;
 }
 
 /*
- * pte_needs_flush() checks whether permissions were demoted and require a
- * flush. It should only be used for userspace PTEs.
+ * pte_flush_type() checks whether permissions were demoted or promoted and
+ * whether a strict or relaxed TLB flush is need. It should only be used on
+ * userspace PTEs.
  */
-static inline bool pte_needs_flush(pte_t oldpte, pte_t newpte)
+static inline enum pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
 {
 	/* !PRESENT -> * ; no need for flush */
 	if (!(pte_flags(oldpte) & _PAGE_PRESENT))
-		return false;
+		return PTE_FLUSH_NONE;
 
 	/* PFN changed ; needs flush */
 	if (pte_pfn(oldpte) != pte_pfn(newpte))
-		return true;
+		return PTE_FLUSH_STRICT;
 
 	/*
 	 * check PTE flags; ignore access-bit; see comment in
 	 * ptep_clear_flush_young().
 	 */
-	return pte_flags_need_flush(pte_flags(oldpte), pte_flags(newpte),
+	return pte_flags_flush_type(pte_flags(oldpte), pte_flags(newpte),
 				    true);
 }
-#define pte_needs_flush pte_needs_flush
+#define pte_flush_type pte_flush_type
 
 /*
- * huge_pmd_needs_flush() checks whether permissions were demoted and require a
+ * huge_pmd_flush_type() checks whether permissions were demoted and require a
  * flush. It should only be used for userspace huge PMDs.
  */
-static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
+static inline enum pte_flush_type huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd)
 {
 	/* !PRESENT -> * ; no need for flush */
 	if (!(pmd_flags(oldpmd) & _PAGE_PRESENT))
-		return false;
+		return PTE_FLUSH_NONE;
 
 	/* PFN changed ; needs flush */
 	if (pmd_pfn(oldpmd) != pmd_pfn(newpmd))
-		return true;
+		return PTE_FLUSH_STRICT;
 
 	/*
 	 * check PMD flags; do not ignore access-bit; see
 	 * pmdp_clear_flush_young().
 	 */
-	return pte_flags_need_flush(pmd_flags(oldpmd), pmd_flags(newpmd),
+	return pte_flags_flush_type(pmd_flags(oldpmd), pmd_flags(newpmd),
 				    false);
 }
-#define huge_pmd_needs_flush huge_pmd_needs_flush
+#define huge_pmd_flush_type huge_pmd_flush_type
 
 #endif /* !MODULE */
 
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index ff3e82553a76..07b3eb8caf63 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -289,6 +289,11 @@ struct mmu_gather {
 	unsigned int		vma_exec : 1;
 	unsigned int		vma_huge : 1;
 
+	/*
+	 * wheteher we made flushing strict (add protection) or changed
+	 * mappings.
+	 */
+	unsigned int		strict : 1;
 	unsigned int		batch_count;
 
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -325,6 +330,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 	tlb->cleared_pmds = 0;
 	tlb->cleared_puds = 0;
 	tlb->cleared_p4ds = 0;
+	tlb->strict = 0;
 	/*
 	 * Do not reset mmu_gather::vma_* fields here, we do not
 	 * call into tlb_start_vma() again to set them if there is an
@@ -518,31 +524,43 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
  * and set corresponding cleared_*.
  */
 static inline void tlb_flush_pte_range(struct mmu_gather *tlb,
-				     unsigned long address, unsigned long size)
+				     unsigned long address, unsigned long size,
+				     bool strict)
 {
 	__tlb_adjust_range(tlb, address, size);
 	tlb->cleared_ptes = 1;
+	if (strict)
+		tlb->strict = 1;
 }
 
 static inline void tlb_flush_pmd_range(struct mmu_gather *tlb,
-				     unsigned long address, unsigned long size)
+				     unsigned long address, unsigned long size,
+				     bool strict)
 {
 	__tlb_adjust_range(tlb, address, size);
 	tlb->cleared_pmds = 1;
+	if (strict)
+		tlb->strict = 1;
 }
 
 static inline void tlb_flush_pud_range(struct mmu_gather *tlb,
-				     unsigned long address, unsigned long size)
+				     unsigned long address, unsigned long size,
+				     bool strict)
 {
 	__tlb_adjust_range(tlb, address, size);
 	tlb->cleared_puds = 1;
+	if (strict)
+		tlb->strict = 1;
 }
 
 static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
-				     unsigned long address, unsigned long size)
+				     unsigned long address, unsigned long size,
+				     bool strict)
 {
 	__tlb_adjust_range(tlb, address, size);
 	tlb->cleared_p4ds = 1;
+	if (strict)
+		tlb->strict = 1;
 }
 
 #ifndef __tlb_remove_tlb_entry
@@ -556,24 +574,24 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
  * so we can later optimise away the tlb invalidate.   This helps when
  * userspace is unmapping already-unmapped pages, which happens quite a lot.
  */
-#define tlb_remove_tlb_entry(tlb, ptep, address)		\
-	do {							\
-		tlb_flush_pte_range(tlb, address, PAGE_SIZE);	\
-		__tlb_remove_tlb_entry(tlb, ptep, address);	\
+#define tlb_remove_tlb_entry(tlb, ptep, address)			\
+	do {								\
+		tlb_flush_pte_range(tlb, address, PAGE_SIZE, true);	\
+		__tlb_remove_tlb_entry(tlb, ptep, address);		\
 	} while (0)
 
-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)	\
-	do {							\
-		unsigned long _sz = huge_page_size(h);		\
-		if (_sz >= P4D_SIZE)				\
-			tlb_flush_p4d_range(tlb, address, _sz);	\
-		else if (_sz >= PUD_SIZE)			\
-			tlb_flush_pud_range(tlb, address, _sz);	\
-		else if (_sz >= PMD_SIZE)			\
-			tlb_flush_pmd_range(tlb, address, _sz);	\
-		else						\
-			tlb_flush_pte_range(tlb, address, _sz);	\
-		__tlb_remove_tlb_entry(tlb, ptep, address);	\
+#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)		\
+	do {								\
+		unsigned long _sz = huge_page_size(h);			\
+		if (_sz >= P4D_SIZE)					\
+			tlb_flush_p4d_range(tlb, address, _sz, true);	\
+		else if (_sz >= PUD_SIZE)				\
+			tlb_flush_pud_range(tlb, address, _sz, true);	\
+		else if (_sz >= PMD_SIZE)				\
+			tlb_flush_pmd_range(tlb, address, _sz, true);	\
+		else							\
+			tlb_flush_pte_range(tlb, address, _sz, true);	\
+		__tlb_remove_tlb_entry(tlb, ptep, address);		\
 	} while (0)
 
 /**
@@ -586,7 +604,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)			\
 	do {								\
-		tlb_flush_pmd_range(tlb, address, HPAGE_PMD_SIZE);	\
+		tlb_flush_pmd_range(tlb, address, HPAGE_PMD_SIZE, true);\
 		__tlb_remove_pmd_tlb_entry(tlb, pmdp, address);		\
 	} while (0)
 
@@ -600,7 +618,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 
 #define tlb_remove_pud_tlb_entry(tlb, pudp, address)			\
 	do {								\
-		tlb_flush_pud_range(tlb, address, HPAGE_PUD_SIZE);	\
+		tlb_flush_pud_range(tlb, address, HPAGE_PUD_SIZE, true);\
 		__tlb_remove_pud_tlb_entry(tlb, pudp, address);		\
 	} while (0)
 
@@ -623,52 +641,52 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
  */
 
 #ifndef pte_free_tlb
-#define pte_free_tlb(tlb, ptep, address)			\
-	do {							\
-		tlb_flush_pmd_range(tlb, address, PAGE_SIZE);	\
-		tlb->freed_tables = 1;				\
-		__pte_free_tlb(tlb, ptep, address);		\
+#define pte_free_tlb(tlb, ptep, address)				\
+	do {								\
+		tlb_flush_pmd_range(tlb, address, PAGE_SIZE, true);	\
+		tlb->freed_tables = 1;					\
+		__pte_free_tlb(tlb, ptep, address);			\
 	} while (0)
 #endif
 
 #ifndef pmd_free_tlb
-#define pmd_free_tlb(tlb, pmdp, address)			\
-	do {							\
-		tlb_flush_pud_range(tlb, address, PAGE_SIZE);	\
-		tlb->freed_tables = 1;				\
-		__pmd_free_tlb(tlb, pmdp, address);		\
+#define pmd_free_tlb(tlb, pmdp, address)				\
+	do {								\
+		tlb_flush_pud_range(tlb, address, PAGE_SIZE, true);	\
+		tlb->freed_tables = 1;					\
+		__pmd_free_tlb(tlb, pmdp, address);			\
 	} while (0)
 #endif
 
 #ifndef pud_free_tlb
-#define pud_free_tlb(tlb, pudp, address)			\
-	do {							\
-		tlb_flush_p4d_range(tlb, address, PAGE_SIZE);	\
-		tlb->freed_tables = 1;				\
-		__pud_free_tlb(tlb, pudp, address);		\
+#define pud_free_tlb(tlb, pudp, address)				\
+	do {								\
+		tlb_flush_p4d_range(tlb, address, PAGE_SIZE, true);	\
+		tlb->freed_tables = 1;					\
+		__pud_free_tlb(tlb, pudp, address);			\
 	} while (0)
 #endif
 
 #ifndef p4d_free_tlb
-#define p4d_free_tlb(tlb, pudp, address)			\
-	do {							\
-		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
-		tlb->freed_tables = 1;				\
-		__p4d_free_tlb(tlb, pudp, address);		\
+#define p4d_free_tlb(tlb, pudp, address)				\
+	do {								\
+		__tlb_adjust_range(tlb, address, PAGE_SIZE, true);	\
+		tlb->freed_tables = 1;					\
+		__p4d_free_tlb(tlb, pudp, address);			\
 	} while (0)
 #endif
 
-#ifndef pte_needs_flush
-static inline bool pte_needs_flush(pte_t oldpte, pte_t newpte)
+#ifndef pte_flush_type
+static inline struct pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
 {
-	return true;
+	return PTE_FLUSH_STRICT;
 }
 #endif
 
-#ifndef huge_pmd_needs_flush
-static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
+#ifndef huge_pmd_flush_type
+static inline bool huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd)
 {
-	return true;
+	return PTE_FLUSH_STRICT;
 }
 #endif
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6b961a29bf26..8825f1314a28 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -698,6 +698,12 @@ extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_finish_mmu(struct mmu_gather *tlb);
 
+enum pte_flush_type {
+	PTE_FLUSH_NONE		= 0,	/* not necessary */
+	PTE_FLUSH_STRICT	= 1,	/* required */
+	PTE_FLUSH_RELAXED	= 2,	/* can cause spurious page-faults */
+};
+
 struct vm_fault;
 
 /**
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 60d742c33de3..09e6608a6431 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1713,6 +1713,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	enum pte_flush_type flush_type;
 
 	tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
 
@@ -1815,8 +1816,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	ret = HPAGE_PMD_NR;
 	set_pmd_at(mm, addr, pmd, entry);
 
-	if (huge_pmd_needs_flush(oldpmd, entry))
-		tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
+	flush_type = huge_pmd_flush_type(oldpmd, entry);
+	if (flush_type != PTE_FLUSH_NONE)
+		tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE,
+				    flush_type == PTE_FLUSH_STRICT);
 
 	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
 unlock:
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6621d3fe4991..9a667237a69a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5022,7 +5022,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		ptl = huge_pte_lock(h, mm, ptep);
 		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
 			spin_unlock(ptl);
-			tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
+			tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE, true);
 			force_flush = true;
 			continue;
 		}
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index a71924bd38c0..9a8bd2f23543 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -348,6 +348,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb)
 		tlb->fullmm = 1;
 		__tlb_reset_range(tlb);
 		tlb->freed_tables = 1;
+		tlb->strict = 1;
 	}
 
 	tlb_flush_mmu(tlb);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 92bfb17dcb8a..ead20dc66d34 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -117,6 +117,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 			pte_t ptent;
 			bool preserve_write = (prot_numa || try_change_writable) &&
 					       pte_write(oldpte);
+			enum pte_flush_type flush_type;
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
@@ -200,8 +201,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 			}
 
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
-			if (pte_needs_flush(oldpte, ptent))
-				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
+
+			flush_type = pte_flush_type(oldpte, ptent);
+			if (flush_type != PTE_FLUSH_NONE)
+				tlb_flush_pte_range(tlb, addr, PAGE_SIZE,
+						flush_type == PTE_FLUSH_STRICT);
 			pages++;
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 23997c387858..62f4b2a4f067 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -974,7 +974,7 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 			entry = pte_wrprotect(oldpte);
 			entry = pte_mkclean(entry);
 
-			if (pte_needs_flush(oldpte, entry) ||
+			if (pte_flush_type(oldpte, entry) != PTE_FLUSH_NONE ||
 			    mm_tlb_flush_pending(vma->vm_mm))
 				flush_tlb_page(vma, address);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 10/14] x86/mm: introduce relaxed TLB flushes
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
                   ` (8 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 09/14] mm: introduce relaxed TLB flushes Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 11/14] x86/mm: use relaxed TLB flushes when protection is removed Nadav Amit
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

Introduce relaxed TLB flushes in x86. When protection is removed from
PTEs (i.e., PTEs become writeable or executable), relaxed TLB flushes
would be used. Relaxed TLB flushes do flush the local TLB, but do not
flush remote TLBs.

If later a spurious page-fault is encountered, and the local TLB
generation is found to be out of sync with the mm's TLB generation, a
full TLB flush takes place to prevent further spurious page-faults from
occurring.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlb.h      | 3 ++-
 arch/x86/include/asm/tlbflush.h | 9 +++++----
 arch/x86/kernel/alternative.c   | 2 +-
 arch/x86/kernel/ldt.c           | 3 ++-
 arch/x86/mm/tlb.c               | 4 ++--
 5 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 1bfe979bb9bc..51c85136f9a8 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -20,7 +20,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 		end = tlb->end;
 	}
 
-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables,
+			   tlb->strict);
 }
 
 /*
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 77d4810e5a5d..230cd1d24fe6 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -220,23 +220,24 @@ void flush_tlb_multi(const struct cpumask *cpumask,
 #endif
 
 #define flush_tlb_mm(mm)						\
-		flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true)
+		flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true, true)
 
 #define flush_tlb_range(vma, start, end)				\
 	flush_tlb_mm_range((vma)->vm_mm, start, end,			\
 			   ((vma)->vm_flags & VM_HUGETLB)		\
 				? huge_page_shift(hstate_vma(vma))	\
-				: PAGE_SHIFT, false)
+				: PAGE_SHIFT, false, true)
 
 extern void flush_tlb_all(void);
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
-				bool freed_tables);
+				bool freed_tables, bool strict);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
-	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
+	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false,
+			   true);
 }
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index e257f6c80372..48945a47fd76 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1099,7 +1099,7 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
 	 */
 	flush_tlb_mm_range(poking_mm, poking_addr, poking_addr +
 			   (cross_page_boundary ? 2 : 1) * PAGE_SIZE,
-			   PAGE_SHIFT, false);
+			   PAGE_SHIFT, false, true);
 
 	if (func == text_poke_memcpy) {
 		/*
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 525876e7b9f4..7c7bc97324bc 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -372,7 +372,8 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
 	}
 
 	va = (unsigned long)ldt_slot_va(ldt->slot);
-	flush_tlb_mm_range(mm, va, va + nr_pages * PAGE_SIZE, PAGE_SHIFT, false);
+	flush_tlb_mm_range(mm, va, va + nr_pages * PAGE_SIZE, PAGE_SHIFT, false,
+			   true);
 }
 
 #else /* !CONFIG_PAGE_TABLE_ISOLATION */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ff3bcc55435e..ec5033d28a97 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -974,7 +974,7 @@ void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
-				bool freed_tables)
+				bool freed_tables, bool strict)
 {
 	struct flush_tlb_info *info;
 	u64 new_tlb_gen;
@@ -1000,7 +1000,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+	if (strict && cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
 		flush_tlb_multi(mm_cpumask(mm), info);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 11/14] x86/mm: use relaxed TLB flushes when protection is removed
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
                   ` (9 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 10/14] x86/mm: " Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 12/14] x86/tlb: no flush on PTE change from RW->RO when PTE is clean Nadav Amit
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

When checking x86 PTE flags to determine whether a TLB flush is needed,
determine whether a relaxed TLB flush is sufficient. If protection is
added (NX removed or W added), indicate that a relaxed TLB flush would
suffice.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 230cd1d24fe6..4f98735ab07a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -271,18 +271,23 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
 	 * dirty/access bit if needed without a fault.
 	 */
 	const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT |
-					_PAGE_ACCESSED;
+					_PAGE_ACCESSED | _PAGE_RW;
+	const pteval_t flush_on_set = _PAGE_NX;
+	const pteval_t flush_on_set_relaxed = _PAGE_RW;
+	const pteval_t flush_on_clear_relaxed = _PAGE_NX;
 	const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
 					_PAGE_SOFTW3 | _PAGE_SOFTW4;
-	const pteval_t flush_on_change = _PAGE_RW | _PAGE_USER | _PAGE_PWT |
+	const pteval_t flush_on_change = _PAGE_USER | _PAGE_PWT |
 			  _PAGE_PCD | _PAGE_PSE | _PAGE_GLOBAL | _PAGE_PAT |
 			  _PAGE_PAT_LARGE | _PAGE_PKEY_BIT0 | _PAGE_PKEY_BIT1 |
-			  _PAGE_PKEY_BIT2 | _PAGE_PKEY_BIT3 | _PAGE_NX;
+			  _PAGE_PKEY_BIT2 | _PAGE_PKEY_BIT3;
 	unsigned long diff = oldflags ^ newflags;
 
 	BUILD_BUG_ON(flush_on_clear & software_flags);
 	BUILD_BUG_ON(flush_on_clear & flush_on_change);
 	BUILD_BUG_ON(flush_on_change & software_flags);
+	BUILD_BUG_ON(flush_on_change & flush_on_clear_relaxed);
+	BUILD_BUG_ON(flush_on_change & flush_on_set_relaxed);
 
 	/* Ignore software flags */
 	diff &= ~software_flags;
@@ -301,9 +306,16 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
 	if (diff & flush_on_change)
 		return PTE_FLUSH_STRICT;
 
+	if (diff & oldflags & flush_on_clear_relaxed)
+		return PTE_FLUSH_RELAXED;
+
+	if (diff & newflags & flush_on_set_relaxed)
+		return PTE_FLUSH_RELAXED;
+
 	/* Ensure there are no flags that were left behind */
 	if (IS_ENABLED(CONFIG_DEBUG_VM) &&
-	    (diff & ~(flush_on_clear | software_flags | flush_on_change))) {
+	    (diff & ~(flush_on_clear | flush_on_set |
+		      software_flags | flush_on_change))) {
 		VM_WARN_ON_ONCE(1);
 		return PTE_FLUSH_STRICT;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 12/14] x86/tlb: no flush on PTE change from RW->RO when PTE is clean
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
                   ` (10 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 11/14] x86/mm: use relaxed TLB flushes when protection is removed Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 13/14] mm/mprotect: do not check flush type if a strict is needed Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 14/14] mm: conditional check of pfn in pte_flush_type Nadav Amit
  13 siblings, 0 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

On x86 it is possible to skip a TLB flush when a RW entry become RO and
the PTE is clean. Add logic to detect this case and skip the flush.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4f98735ab07a..58c95e36b098 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -271,8 +271,9 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
 	 * dirty/access bit if needed without a fault.
 	 */
 	const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT |
-					_PAGE_ACCESSED | _PAGE_RW;
+					_PAGE_ACCESSED;
 	const pteval_t flush_on_set = _PAGE_NX;
+	const pteval_t flush_on_special = _PAGE_RW;
 	const pteval_t flush_on_set_relaxed = _PAGE_RW;
 	const pteval_t flush_on_clear_relaxed = _PAGE_NX;
 	const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
@@ -302,6 +303,17 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
 	if (diff & oldflags & flush_on_clear)
 		return PTE_FLUSH_STRICT;
 
+	/*
+	 * Did any of the 'flush_on_set' flags was set between 'oldflags' and
+	 * 'newflags'?
+	 */
+	if (diff & newflags & flush_on_set)
+		return PTE_FLUSH_STRICT;
+
+	/* On RW->RO, a flush is needed if the old entry is dirty */
+	if ((diff & oldflags & _PAGE_RW) && (oldflags & _PAGE_DIRTY))
+		return PTE_FLUSH_STRICT;
+
 	/* Flush on modified flags. */
 	if (diff & flush_on_change)
 		return PTE_FLUSH_STRICT;
@@ -314,7 +326,7 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
 
 	/* Ensure there are no flags that were left behind */
 	if (IS_ENABLED(CONFIG_DEBUG_VM) &&
-	    (diff & ~(flush_on_clear | flush_on_set |
+	    (diff & ~(flush_on_clear | flush_on_set | flush_on_special |
 		      software_flags | flush_on_change))) {
 		VM_WARN_ON_ONCE(1);
 		return PTE_FLUSH_STRICT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 13/14] mm/mprotect: do not check flush type if a strict is needed
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
                   ` (11 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 12/14] x86/tlb: no flush on PTE change from RW->RO when PTE is clean Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 14/14] mm: conditional check of pfn in pte_flush_type Nadav Amit
  13 siblings, 0 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

Once it was determined that a strict TLB flush is needed, it is both
likely that other PTEs would need strict TLB flush and little benefit
from not extending the range that is flushed.

Skip the check which TLB flush is needed, if it was determined a strict
flush is already needed.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/huge_memory.c | 4 +++-
 mm/mprotect.c    | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 09e6608a6431..b32b7da0f6f7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1816,7 +1816,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	ret = HPAGE_PMD_NR;
 	set_pmd_at(mm, addr, pmd, entry);
 
-	flush_type = huge_pmd_flush_type(oldpmd, entry);
+	flush_type = PTE_FLUSH_STRICT;
+	if (!tlb->strict)
+		flush_type = huge_pmd_flush_type(oldpmd, entry);
 	if (flush_type != PTE_FLUSH_NONE)
 		tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE,
 				    flush_type == PTE_FLUSH_STRICT);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ead20dc66d34..cf775f6c8c08 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -202,7 +202,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 
-			flush_type = pte_flush_type(oldpte, ptent);
+			flush_type = PTE_FLUSH_STRICT;
+			if (!tlb->strict)
+				flush_type = pte_flush_type(oldpte, ptent);
 			if (flush_type != PTE_FLUSH_NONE)
 				tlb_flush_pte_range(tlb, addr, PAGE_SIZE,
 						flush_type == PTE_FLUSH_STRICT);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 14/14] mm: conditional check of pfn in pte_flush_type
  2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
                   ` (12 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 13/14] mm/mprotect: do not check flush type if a strict is needed Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  13 siblings, 0 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

Checking whether PFNs in two PTEs are the same takes surprisingly large
number of instructions. Yet in fact, in most cases the caller to
pte_flush_type() already knows if the PFN was changed. For instance,
mprotect() does not change the PFN, but only modifies the protection
flags.

Add argument to pte_flush_type() to indicate whether the PFN should be
checked. Keep checking it in mm-debug to see if some caller was wrong to
assume the PFN is the same.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h | 14 ++++++++++----
 include/asm-generic/tlb.h       |  6 ++++--
 mm/huge_memory.c                |  2 +-
 mm/mprotect.c                   |  2 +-
 mm/rmap.c                       |  2 +-
 5 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 58c95e36b098..50349861fdc9 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -340,14 +340,17 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
  * whether a strict or relaxed TLB flush is need. It should only be used on
  * userspace PTEs.
  */
-static inline enum pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
+static inline enum pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte,
+						   bool check_pfn)
 {
 	/* !PRESENT -> * ; no need for flush */
 	if (!(pte_flags(oldpte) & _PAGE_PRESENT))
 		return PTE_FLUSH_NONE;
 
 	/* PFN changed ; needs flush */
-	if (pte_pfn(oldpte) != pte_pfn(newpte))
+	if (!check_pfn)
+		VM_BUG_ON(pte_pfn(oldpte) != pte_pfn(newpte));
+	else if (pte_pfn(oldpte) != pte_pfn(newpte))
 		return PTE_FLUSH_STRICT;
 
 	/*
@@ -363,14 +366,17 @@ static inline enum pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
  * huge_pmd_flush_type() checks whether permissions were demoted and require a
  * flush. It should only be used for userspace huge PMDs.
  */
-static inline enum pte_flush_type huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd)
+static inline enum pte_flush_type huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd,
+						      bool check_pfn)
 {
 	/* !PRESENT -> * ; no need for flush */
 	if (!(pmd_flags(oldpmd) & _PAGE_PRESENT))
 		return PTE_FLUSH_NONE;
 
 	/* PFN changed ; needs flush */
-	if (pmd_pfn(oldpmd) != pmd_pfn(newpmd))
+	if (!check_pfn)
+		VM_BUG_ON(pmd_pfn(oldpmd) != pmd_pfn(newpmd));
+	else if (pmd_pfn(oldpmd) != pmd_pfn(newpmd))
 		return PTE_FLUSH_STRICT;
 
 	/*
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 07b3eb8caf63..aee9da6cc5d5 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -677,14 +677,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 #endif
 
 #ifndef pte_flush_type
-static inline struct pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
+static inline struct pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte,
+						   bool check_pfn)
 {
 	return PTE_FLUSH_STRICT;
 }
 #endif
 
 #ifndef huge_pmd_flush_type
-static inline bool huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd)
+static inline bool huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd,
+				       bool check_pfn)
 {
 	return PTE_FLUSH_STRICT;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b32b7da0f6f7..92a7b3ca317f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1818,7 +1818,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 	flush_type = PTE_FLUSH_STRICT;
 	if (!tlb->strict)
-		flush_type = huge_pmd_flush_type(oldpmd, entry);
+		flush_type = huge_pmd_flush_type(oldpmd, entry, false);
 	if (flush_type != PTE_FLUSH_NONE)
 		tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE,
 				    flush_type == PTE_FLUSH_STRICT);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index cf775f6c8c08..78081d7f4edf 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -204,7 +204,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 
 			flush_type = PTE_FLUSH_STRICT;
 			if (!tlb->strict)
-				flush_type = pte_flush_type(oldpte, ptent);
+				flush_type = pte_flush_type(oldpte, ptent, false);
 			if (flush_type != PTE_FLUSH_NONE)
 				tlb_flush_pte_range(tlb, addr, PAGE_SIZE,
 						flush_type == PTE_FLUSH_STRICT);
diff --git a/mm/rmap.c b/mm/rmap.c
index 62f4b2a4f067..63261619b607 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -974,7 +974,7 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 			entry = pte_wrprotect(oldpte);
 			entry = pte_mkclean(entry);
 
-			if (pte_flush_type(oldpte, entry) != PTE_FLUSH_NONE ||
+			if (pte_flush_type(oldpte, entry, false) != PTE_FLUSH_NONE ||
 			    mm_tlb_flush_pending(vma->vm_mm))
 				flush_tlb_page(vma, address);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-18 12:01 ` [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect Nadav Amit
@ 2022-07-19 20:47   ` Peter Xu
  2022-07-20  9:39     ` David Hildenbrand
  2022-07-20  9:42   ` David Hildenbrand
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Xu @ 2022-07-19 20:47 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, David Hildenbrand, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

On Mon, Jul 18, 2022 at 05:01:59AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> When userfaultfd makes a PTE writable, it can now change the PTE
> directly, in some cases, without going triggering a page-fault first.
> Yet, doing so might leave the PTE that was write-unprotected as old and
> clean. At least on x86, this would cause a >500 cycles overhead when the
> PTE is first accessed.
> 
> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
> gets a hint that the page is likely to be used. Avoid changing the PTE
> to young and dirty in other cases to avoid excessive writeback and
> messing with the page reclamation logic.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> ---
>  include/linux/mm.h | 2 ++
>  mm/mprotect.c      | 9 ++++++++-
>  mm/userfaultfd.c   | 8 ++++++--
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9cc02a7e503b..4afd75ce5875 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>  /* Whether this change is for write protecting */
>  #define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
>  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
> +/* Whether to try to mark entries as dirty as they are to be written */
> +#define  MM_CP_WILL_NEED		   (1UL << 4)
>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>  					    MM_CP_UFFD_WP_RESOLVE)
>  
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 996a97e213ad..34c2dfb68c42 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	bool will_need = cp_flags & MM_CP_WILL_NEED;
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  
> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  				ptent = pte_clear_uffd_wp(ptent);
>  			}
>  
> +			if (will_need)
> +				ptent = pte_mkyoung(ptent);

For uffd path, UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY are new
internal flags used with or without the new feature bit set.  It means even
with !ACCESS_HINT we'll start to set young bit while we used not to?  Is
that some kind of a light abi change?

I'd suggest we only set will_need if ACCESS_HINT is set.

> +
>  			/*
>  			 * In some writable, shared mappings, we might want
>  			 * to catch actual write access -- see
> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  			 */
>  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>  			    !pte_write(ptent) &&
> -			    can_change_pte_writable(vma, addr, ptent))
> +			    can_change_pte_writable(vma, addr, ptent)) {
>  				ptent = pte_mkwrite(ptent);
> +				if (will_need)
> +					ptent = pte_mkdirty(ptent);

Can we make this unconditional?  IOW to cover both:

  (1) When will_need is not set, or
  (2) mprotect() too

David's patch is good in that we merged the unprotect and CoW.  However
that's not complete because the dirty bit ops are missing.

Here IMHO we should have a standalone patch to just add the dirty bit into
this logic when we'll grant write bit.  IMHO it'll make the write+dirty
bits coherent again in all paths.

> +			}
>  
>  			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>  			if (pte_needs_flush(oldpte, ptent))
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 954c6980b29f..e0492f5f06a0 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -749,6 +749,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
>  	bool enable_wp = uffd_flags & UFFD_FLAGS_WP;
>  	struct vm_area_struct *dst_vma;
>  	unsigned long page_mask;
> +	unsigned long cp_flags;
>  	struct mmu_gather tlb;
>  	pgprot_t newprot;
>  	int err;
> @@ -795,9 +796,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
>  	else
>  		newprot = vm_get_page_prot(dst_vma->vm_flags);
>  
> +	cp_flags = enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE;
> +	if (uffd_flags & (UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY))
> +		cp_flags |= MM_CP_WILL_NEED;
> +
>  	tlb_gather_mmu(&tlb, dst_mm);
> -	change_protection(&tlb, dst_vma, start, start + len, newprot,
> -			  enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
> +	change_protection(&tlb, dst_vma, start, start + len, newprot, cp_flags);
>  	tlb_finish_mmu(&tlb);
>  
>  	err = 0;
> -- 
> 2.25.1
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 02/14] userfaultfd: try to map write-unprotected pages
  2022-07-18 12:02 ` [RFC PATCH 02/14] userfaultfd: try to map write-unprotected pages Nadav Amit
@ 2022-07-19 20:49   ` Peter Xu
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Xu @ 2022-07-19 20:49 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, David Hildenbrand, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

On Mon, Jul 18, 2022 at 05:02:00AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> When using userfaultfd write-(un)protect ioctl, try to change the PTE to
> be writable. This would save a page-fault afterwards.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-19 20:47   ` Peter Xu
@ 2022-07-20  9:39     ` David Hildenbrand
  2022-07-20 13:10       ` Peter Xu
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2022-07-20  9:39 UTC (permalink / raw)
  To: Peter Xu, Nadav Amit
  Cc: linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On 19.07.22 22:47, Peter Xu wrote:
> On Mon, Jul 18, 2022 at 05:01:59AM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>>
>> When userfaultfd makes a PTE writable, it can now change the PTE
>> directly, in some cases, without going triggering a page-fault first.
>> Yet, doing so might leave the PTE that was write-unprotected as old and
>> clean. At least on x86, this would cause a >500 cycles overhead when the
>> PTE is first accessed.
>>
>> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
>> gets a hint that the page is likely to be used. Avoid changing the PTE
>> to young and dirty in other cases to avoid excessive writeback and
>> messing with the page reclamation logic.
>>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> ---
>>  include/linux/mm.h | 2 ++
>>  mm/mprotect.c      | 9 ++++++++-
>>  mm/userfaultfd.c   | 8 ++++++--
>>  3 files changed, 16 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 9cc02a7e503b..4afd75ce5875 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>  /* Whether this change is for write protecting */
>>  #define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
>>  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
>> +/* Whether to try to mark entries as dirty as they are to be written */
>> +#define  MM_CP_WILL_NEED		   (1UL << 4)
>>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>  					    MM_CP_UFFD_WP_RESOLVE)
>>  
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 996a97e213ad..34c2dfb68c42 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> +	bool will_need = cp_flags & MM_CP_WILL_NEED;
>>  
>>  	tlb_change_page_size(tlb, PAGE_SIZE);
>>  
>> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>  				ptent = pte_clear_uffd_wp(ptent);
>>  			}
>>  
>> +			if (will_need)
>> +				ptent = pte_mkyoung(ptent);
> 
> For uffd path, UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY are new
> internal flags used with or without the new feature bit set.  It means even
> with !ACCESS_HINT we'll start to set young bit while we used not to?  Is
> that some kind of a light abi change?
> 
> I'd suggest we only set will_need if ACCESS_HINT is set.
> 
>> +
>>  			/*
>>  			 * In some writable, shared mappings, we might want
>>  			 * to catch actual write access -- see
>> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>  			 */
>>  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>  			    !pte_write(ptent) &&
>> -			    can_change_pte_writable(vma, addr, ptent))
>> +			    can_change_pte_writable(vma, addr, ptent)) {
>>  				ptent = pte_mkwrite(ptent);
>> +				if (will_need)
>> +					ptent = pte_mkdirty(ptent);
> 
> Can we make this unconditional?  IOW to cover both:
> 
>   (1) When will_need is not set, or
>   (2) mprotect() too
> 
> David's patch is good in that we merged the unprotect and CoW.  However
> that's not complete because the dirty bit ops are missing.
> 
> Here IMHO we should have a standalone patch to just add the dirty bit into
> this logic when we'll grant write bit.  IMHO it'll make the write+dirty
> bits coherent again in all paths.

I'm not sure I follow.

We *surely* don't want to dirty random pages (especially once in the
pagecache/swapcache) simply because we change protection.

Just like we don't set all pages write+dirty in a writable VMA on a read
fault.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-18 12:01 ` [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect Nadav Amit
  2022-07-19 20:47   ` Peter Xu
@ 2022-07-20  9:42   ` David Hildenbrand
  2022-07-20 17:36     ` Nadav Amit
  1 sibling, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2022-07-20  9:42 UTC (permalink / raw)
  To: Nadav Amit, linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Xu, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On 18.07.22 14:01, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> When userfaultfd makes a PTE writable, it can now change the PTE
> directly, in some cases, without going triggering a page-fault first.
> Yet, doing so might leave the PTE that was write-unprotected as old and
> clean. At least on x86, this would cause a >500 cycles overhead when the
> PTE is first accessed.
> 
> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
> gets a hint that the page is likely to be used. Avoid changing the PTE
> to young and dirty in other cases to avoid excessive writeback and
> messing with the page reclamation logic.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> ---
>  include/linux/mm.h | 2 ++
>  mm/mprotect.c      | 9 ++++++++-
>  mm/userfaultfd.c   | 8 ++++++--
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9cc02a7e503b..4afd75ce5875 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>  /* Whether this change is for write protecting */
>  #define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
>  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
> +/* Whether to try to mark entries as dirty as they are to be written */
> +#define  MM_CP_WILL_NEED		   (1UL << 4)
>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>  					    MM_CP_UFFD_WP_RESOLVE)
>  
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 996a97e213ad..34c2dfb68c42 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	bool will_need = cp_flags & MM_CP_WILL_NEED;
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  
> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  				ptent = pte_clear_uffd_wp(ptent);
>  			}
>  
> +			if (will_need)
> +				ptent = pte_mkyoung(ptent);
> +
>  			/*
>  			 * In some writable, shared mappings, we might want
>  			 * to catch actual write access -- see
> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  			 */
>  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>  			    !pte_write(ptent) &&


Why would we want to check if we can set something writable if it
already *is* writable? That doesn't make sense to me.

> -			    can_change_pte_writable(vma, addr, ptent))
> +			    can_change_pte_writable(vma, addr, ptent)) {
>  				ptent = pte_mkwrite(ptent);
> +				if (will_need)
> +					ptent = pte_mkdirty(ptent);
> +			}

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20  9:39     ` David Hildenbrand
@ 2022-07-20 13:10       ` Peter Xu
  2022-07-20 15:10         ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Xu @ 2022-07-20 13:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On Wed, Jul 20, 2022 at 11:39:23AM +0200, David Hildenbrand wrote:
> On 19.07.22 22:47, Peter Xu wrote:
> > On Mon, Jul 18, 2022 at 05:01:59AM -0700, Nadav Amit wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >>
> >> When userfaultfd makes a PTE writable, it can now change the PTE
> >> directly, in some cases, without going triggering a page-fault first.
> >> Yet, doing so might leave the PTE that was write-unprotected as old and
> >> clean. At least on x86, this would cause a >500 cycles overhead when the
> >> PTE is first accessed.
> >>
> >> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
> >> gets a hint that the page is likely to be used. Avoid changing the PTE
> >> to young and dirty in other cases to avoid excessive writeback and
> >> messing with the page reclamation logic.
> >>
> >> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Andy Lutomirski <luto@kernel.org>
> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: Peter Xu <peterx@redhat.com>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Thomas Gleixner <tglx@linutronix.de>
> >> Cc: Will Deacon <will@kernel.org>
> >> Cc: Yu Zhao <yuzhao@google.com>
> >> Cc: Nick Piggin <npiggin@gmail.com>
> >> ---
> >>  include/linux/mm.h | 2 ++
> >>  mm/mprotect.c      | 9 ++++++++-
> >>  mm/userfaultfd.c   | 8 ++++++--
> >>  3 files changed, 16 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 9cc02a7e503b..4afd75ce5875 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
> >>  /* Whether this change is for write protecting */
> >>  #define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
> >>  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
> >> +/* Whether to try to mark entries as dirty as they are to be written */
> >> +#define  MM_CP_WILL_NEED		   (1UL << 4)
> >>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
> >>  					    MM_CP_UFFD_WP_RESOLVE)
> >>  
> >> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >> index 996a97e213ad..34c2dfb68c42 100644
> >> --- a/mm/mprotect.c
> >> +++ b/mm/mprotect.c
> >> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
> >>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
> >>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> >>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> >> +	bool will_need = cp_flags & MM_CP_WILL_NEED;
> >>  
> >>  	tlb_change_page_size(tlb, PAGE_SIZE);
> >>  
> >> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
> >>  				ptent = pte_clear_uffd_wp(ptent);
> >>  			}
> >>  
> >> +			if (will_need)
> >> +				ptent = pte_mkyoung(ptent);
> > 
> > For uffd path, UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY are new
> > internal flags used with or without the new feature bit set.  It means even
> > with !ACCESS_HINT we'll start to set young bit while we used not to?  Is
> > that some kind of a light abi change?
> > 
> > I'd suggest we only set will_need if ACCESS_HINT is set.
> > 
> >> +
> >>  			/*
> >>  			 * In some writable, shared mappings, we might want
> >>  			 * to catch actual write access -- see
> >> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
> >>  			 */
> >>  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> >>  			    !pte_write(ptent) &&
> >> -			    can_change_pte_writable(vma, addr, ptent))
> >> +			    can_change_pte_writable(vma, addr, ptent)) {
> >>  				ptent = pte_mkwrite(ptent);
> >> +				if (will_need)
> >> +					ptent = pte_mkdirty(ptent);
> > 
> > Can we make this unconditional?  IOW to cover both:
> > 
> >   (1) When will_need is not set, or
> >   (2) mprotect() too
> > 
> > David's patch is good in that we merged the unprotect and CoW.  However
> > that's not complete because the dirty bit ops are missing.
> > 
> > Here IMHO we should have a standalone patch to just add the dirty bit into
> > this logic when we'll grant write bit.  IMHO it'll make the write+dirty
> > bits coherent again in all paths.
> 
> I'm not sure I follow.
> 
> We *surely* don't want to dirty random pages (especially once in the
> pagecache/swapcache) simply because we change protection.
> 
> Just like we don't set all pages write+dirty in a writable VMA on a read
> fault.

IMO unmprotect (in generic mprotect form or uffd form) has a stronger sign
of page being written, unlike read faults, as many of them happen because
page being written and message generated.

But yeah you have a point too, maybe we shouldn't assume such a condition.
Especially as long as we won't set write=1 without soft-dirty tracking
enabled, I think it should be safe.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 13:10       ` Peter Xu
@ 2022-07-20 15:10         ` David Hildenbrand
  2022-07-20 19:15           ` Peter Xu
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2022-07-20 15:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nadav Amit, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On 20.07.22 15:10, Peter Xu wrote:
> On Wed, Jul 20, 2022 at 11:39:23AM +0200, David Hildenbrand wrote:
>> On 19.07.22 22:47, Peter Xu wrote:
>>> On Mon, Jul 18, 2022 at 05:01:59AM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>>
>>>> When userfaultfd makes a PTE writable, it can now change the PTE
>>>> directly, in some cases, without going triggering a page-fault first.
>>>> Yet, doing so might leave the PTE that was write-unprotected as old and
>>>> clean. At least on x86, this would cause a >500 cycles overhead when the
>>>> PTE is first accessed.
>>>>
>>>> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
>>>> gets a hint that the page is likely to be used. Avoid changing the PTE
>>>> to young and dirty in other cases to avoid excessive writeback and
>>>> messing with the page reclamation logic.
>>>>
>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Andy Lutomirski <luto@kernel.org>
>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Peter Xu <peterx@redhat.com>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>> Cc: Will Deacon <will@kernel.org>
>>>> Cc: Yu Zhao <yuzhao@google.com>
>>>> Cc: Nick Piggin <npiggin@gmail.com>
>>>> ---
>>>>  include/linux/mm.h | 2 ++
>>>>  mm/mprotect.c      | 9 ++++++++-
>>>>  mm/userfaultfd.c   | 8 ++++++--
>>>>  3 files changed, 16 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 9cc02a7e503b..4afd75ce5875 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>>>  /* Whether this change is for write protecting */
>>>>  #define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
>>>>  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
>>>> +/* Whether to try to mark entries as dirty as they are to be written */
>>>> +#define  MM_CP_WILL_NEED		   (1UL << 4)
>>>>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>>>  					    MM_CP_UFFD_WP_RESOLVE)
>>>>  
>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>> index 996a97e213ad..34c2dfb68c42 100644
>>>> --- a/mm/mprotect.c
>>>> +++ b/mm/mprotect.c
>>>> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>>>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>>>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>>>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>>>> +	bool will_need = cp_flags & MM_CP_WILL_NEED;
>>>>  
>>>>  	tlb_change_page_size(tlb, PAGE_SIZE);
>>>>  
>>>> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>>>  				ptent = pte_clear_uffd_wp(ptent);
>>>>  			}
>>>>  
>>>> +			if (will_need)
>>>> +				ptent = pte_mkyoung(ptent);
>>>
>>> For uffd path, UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY are new
>>> internal flags used with or without the new feature bit set.  It means even
>>> with !ACCESS_HINT we'll start to set young bit while we used not to?  Is
>>> that some kind of a light abi change?
>>>
>>> I'd suggest we only set will_need if ACCESS_HINT is set.
>>>
>>>> +
>>>>  			/*
>>>>  			 * In some writable, shared mappings, we might want
>>>>  			 * to catch actual write access -- see
>>>> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>>>  			 */
>>>>  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>>>  			    !pte_write(ptent) &&
>>>> -			    can_change_pte_writable(vma, addr, ptent))
>>>> +			    can_change_pte_writable(vma, addr, ptent)) {
>>>>  				ptent = pte_mkwrite(ptent);
>>>> +				if (will_need)
>>>> +					ptent = pte_mkdirty(ptent);
>>>
>>> Can we make this unconditional?  IOW to cover both:
>>>
>>>   (1) When will_need is not set, or
>>>   (2) mprotect() too
>>>
>>> David's patch is good in that we merged the unprotect and CoW.  However
>>> that's not complete because the dirty bit ops are missing.
>>>
>>> Here IMHO we should have a standalone patch to just add the dirty bit into
>>> this logic when we'll grant write bit.  IMHO it'll make the write+dirty
>>> bits coherent again in all paths.
>>
>> I'm not sure I follow.
>>
>> We *surely* don't want to dirty random pages (especially once in the
>> pagecache/swapcache) simply because we change protection.
>>
>> Just like we don't set all pages write+dirty in a writable VMA on a read
>> fault.
> 
> IMO unmprotect (in generic mprotect form or uffd form) has a stronger sign
> of page being written, unlike read faults, as many of them happen because
> page being written and message generated.

I'm sorry, but I am very skeptical about such statements. I don't buy it.

> 
> But yeah you have a point too, maybe we shouldn't assume such a condition.
> Especially as long as we won't set write=1 without soft-dirty tracking
> enabled, I think it should be safe.

For pagecache pages it may as well be *plain wrong* to bypass the write
fault handler and simply mark pages dirty+map them writable.

Please, let's keep this protection change handler here as simple as
possible and *not* try to replicate the whole write fault handler logic
in here by relying on statements as above.

If we try optimizing for corner cases by making the implementation here
overly complicated, then we are clearly doing something wrong.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable
  2022-07-18 12:02 ` [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable Nadav Amit
@ 2022-07-20 15:19   ` David Hildenbrand
  2022-07-20 17:25     ` Nadav Amit
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2022-07-20 15:19 UTC (permalink / raw)
  To: Nadav Amit, linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Xu, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On 18.07.22 14:02, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Anonymous pages might have the dirty bit clear, but this should not
> prevent mprotect from making them writable if they are exclusive.
> Therefore, skip the test whether the page is dirty in this case.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  mm/mprotect.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 34c2dfb68c42..da5b9bf8204f 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>  
>  	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
>  
> -	if (pte_protnone(pte) || !pte_dirty(pte))
> +	if (pte_protnone(pte))
>  		return false;
>  
>  	/* Do we need write faults for softdirty tracking? */
> @@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>  		page = vm_normal_page(vma, addr, pte);
>  		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>  			return false;
> -	}
> +	} else if (!pte_dirty(pte))
> +		return false;
>  
>  	return true;
>  }

When I wrote that code, I was wondering how often that would actually
happen in practice -- and if we care about optimizing that. Do you have
a gut feeling in which scenarios this would happen and if we care?

If the page is in the swapcache and was swapped out, you'd be requiring
a writeback even though nobody modified the page and possibly isn't
going to do so in the near future.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable
  2022-07-20 15:19   ` David Hildenbrand
@ 2022-07-20 17:25     ` Nadav Amit
  2022-07-21  7:45       ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Nadav Amit @ 2022-07-20 17:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux MM, LKML, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin

On Jul 20, 2022, at 8:19 AM, David Hildenbrand <david@redhat.com> wrote:

> On 18.07.22 14:02, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Anonymous pages might have the dirty bit clear, but this should not
>> prevent mprotect from making them writable if they are exclusive.
>> Therefore, skip the test whether the page is dirty in this case.
>> 
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> mm/mprotect.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 34c2dfb68c42..da5b9bf8204f 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>> 
>> 	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
>> 
>> -	if (pte_protnone(pte) || !pte_dirty(pte))
>> +	if (pte_protnone(pte))
>> 		return false;
>> 
>> 	/* Do we need write faults for softdirty tracking? */
>> @@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>> 		page = vm_normal_page(vma, addr, pte);
>> 		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>> 			return false;
>> -	}
>> +	} else if (!pte_dirty(pte))
>> +		return false;
>> 
>> 	return true;
>> }
> 
> When I wrote that code, I was wondering how often that would actually
> happen in practice -- and if we care about optimizing that. Do you have
> a gut feeling in which scenarios this would happen and if we care?
> 
> If the page is in the swapcache and was swapped out, you'd be requiring
> a writeback even though nobody modified the page and possibly isn't
> going to do so in the near future.

So here is my due diligence: I did not really encounter a scenario in which
it showed up. When I looked at your code, I assumed this was an oversight
and not a thoughtful decision. For me the issue is more of the discrepancy
between how a certain page is handled before and after it was pages out.

The way that I see it, there is a tradeoff in the way dirty-bit should
be handled:
(1) Writable-clean PTEs introduce some non-negligible overhead.
(2) Marking a PTE dirty speculatively would require a write back.

… But this tradeoff should not affect whether a PTE is writable, i.e.,
mapping the PTE as writable-clean should not cause a writeback. In other
words, if you are concerned about unnecessary writebacks, which I think is a
fair concern, then do not set the dirty-bit. In a later patch I try to avoid
TLB flushes on clean-writable entries that are write-protected.

So I do not think that the writeback you mentioned should be a real issue.
Yet if you think that using the fact that the page is not-dirty is a good
hueristics to avoid future TLB flushes (for P->NP; as I said there is a
solution for RW->RO), or if you are concerned about the cost of
vm_normal_page(), perhaps those are valid concerned (although I do not think
so).

--

[ Regarding (1): After some discussions with Peter and reading more code, I
thought at some point that perhaps avoiding having writable-clean PTE as
much as possible makes sense [*], since setting the dirty-bit costs ~550
cycles and a page fault is not a lot more than 1000. But with all the
mitigations (and after adding IBRS for retbless) page-fault entry is kind of
expensive. 

[*] At least on x86 ]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20  9:42   ` David Hildenbrand
@ 2022-07-20 17:36     ` Nadav Amit
  2022-07-20 18:00       ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Nadav Amit @ 2022-07-20 17:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux MM, LKML, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin

On Jul 20, 2022, at 2:42 AM, David Hildenbrand <david@redhat.com> wrote:

> ⚠ External Email
> 
> On 18.07.22 14:01, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> When userfaultfd makes a PTE writable, it can now change the PTE
>> directly, in some cases, without going triggering a page-fault first.
>> Yet, doing so might leave the PTE that was write-unprotected as old and
>> clean. At least on x86, this would cause a >500 cycles overhead when the
>> PTE is first accessed.
>> 
>> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
>> gets a hint that the page is likely to be used. Avoid changing the PTE
>> to young and dirty in other cases to avoid excessive writeback and
>> messing with the page reclamation logic.
>> 
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> ---
>> include/linux/mm.h | 2 ++
>> mm/mprotect.c | 9 ++++++++-
>> mm/userfaultfd.c | 8 ++++++--
>> 3 files changed, 16 insertions(+), 3 deletions(-)
>> 
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 9cc02a7e503b..4afd75ce5875 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>> /* Whether this change is for write protecting */
>> #define MM_CP_UFFD_WP (1UL << 2) /* do wp */
>> #define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */
>> +/* Whether to try to mark entries as dirty as they are to be written */
>> +#define MM_CP_WILL_NEED (1UL << 4)
>> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
>> MM_CP_UFFD_WP_RESOLVE)
>> 
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 996a97e213ad..34c2dfb68c42 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>> bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>> bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>> bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> + bool will_need = cp_flags & MM_CP_WILL_NEED;
>> 
>> tlb_change_page_size(tlb, PAGE_SIZE);
>> 
>> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>> ptent = pte_clear_uffd_wp(ptent);
>> }
>> 
>> + if (will_need)
>> + ptent = pte_mkyoung(ptent);
>> +
>> /*
>> * In some writable, shared mappings, we might want
>> * to catch actual write access -- see
>> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>> */
>> if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> !pte_write(ptent) &&
> 
> 
> Why would we want to check if we can set something writable if it
> already *is* writable? That doesn't make sense to me.

We check !pte_write(). What am I missing in your question?

Having said that, I do notice now that pte_mkdirty() should not be done
only this condition is fulfilled. Instead we should just have
something like:

                       if (will_need) {
                               ptent = pte_mkyoung(ptent);
                               if (pte_write(ptent))
                                       ptent = pte_mkdirty(ptent);
                       }

But I do not think this answers your question, which I did not understand.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 17:36     ` Nadav Amit
@ 2022-07-20 18:00       ` David Hildenbrand
  2022-07-20 18:09         ` Nadav Amit
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2022-07-20 18:00 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, LKML, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin

On 20.07.22 19:36, Nadav Amit wrote:
> On Jul 20, 2022, at 2:42 AM, David Hildenbrand <david@redhat.com> wrote:
> 
>> ⚠ External Email
>>
>> On 18.07.22 14:01, Nadav Amit wrote:
>>> From: Nadav Amit <namit@vmware.com>
>>>
>>> When userfaultfd makes a PTE writable, it can now change the PTE
>>> directly, in some cases, without going triggering a page-fault first.
>>> Yet, doing so might leave the PTE that was write-unprotected as old and
>>> clean. At least on x86, this would cause a >500 cycles overhead when the
>>> PTE is first accessed.
>>>
>>> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
>>> gets a hint that the page is likely to be used. Avoid changing the PTE
>>> to young and dirty in other cases to avoid excessive writeback and
>>> messing with the page reclamation logic.
>>>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Andy Lutomirski <luto@kernel.org>
>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Will Deacon <will@kernel.org>
>>> Cc: Yu Zhao <yuzhao@google.com>
>>> Cc: Nick Piggin <npiggin@gmail.com>
>>> ---
>>> include/linux/mm.h | 2 ++
>>> mm/mprotect.c | 9 ++++++++-
>>> mm/userfaultfd.c | 8 ++++++--
>>> 3 files changed, 16 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 9cc02a7e503b..4afd75ce5875 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>> /* Whether this change is for write protecting */
>>> #define MM_CP_UFFD_WP (1UL << 2) /* do wp */
>>> #define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */
>>> +/* Whether to try to mark entries as dirty as they are to be written */
>>> +#define MM_CP_WILL_NEED (1UL << 4)
>>> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
>>> MM_CP_UFFD_WP_RESOLVE)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 996a97e213ad..34c2dfb68c42 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>> bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>> bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>> bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>>> + bool will_need = cp_flags & MM_CP_WILL_NEED;
>>>
>>> tlb_change_page_size(tlb, PAGE_SIZE);
>>>
>>> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>> ptent = pte_clear_uffd_wp(ptent);
>>> }
>>>
>>> + if (will_need)
>>> + ptent = pte_mkyoung(ptent);
>>> +
>>> /*
>>> * In some writable, shared mappings, we might want
>>> * to catch actual write access -- see
>>> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>> */
>>> if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>> !pte_write(ptent) &&
>>
>>
>> Why would we want to check if we can set something writable if it
>> already *is* writable? That doesn't make sense to me.
> 
> We check !pte_write(). What am I missing in your question?

My patch review skills have seen better days. I thought you'd be
removing the pte_write() check ... :( Tired eyes ...

> 
> Having said that, I do notice now that pte_mkdirty() should not be done
> only this condition is fulfilled. Instead we should just have
> something like:
> 
>                        if (will_need) {
>                                ptent = pte_mkyoung(ptent);
>                                if (pte_write(ptent))
>                                        ptent = pte_mkdirty(ptent);
>                        }

As can_change_pte_writable() will fail if it stumbles over a !pte_dirty
page in current code ... so I assume you would have that code before the
actual pte_mkwrite() logic, correct?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 18:00       ` David Hildenbrand
@ 2022-07-20 18:09         ` Nadav Amit
  2022-07-20 18:11           ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Nadav Amit @ 2022-07-20 18:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux MM, LKML, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin

On Jul 20, 2022, at 11:00 AM, David Hildenbrand <david@redhat.com> wrote:

> My patch review skills have seen better days. I thought you'd be
> removing the pte_write() check ... :( Tired eyes ...
> 
>> Having said that, I do notice now that pte_mkdirty() should not be done
>> only this condition is fulfilled. Instead we should just have
>> something like:
>> 
>> if (will_need) {
>> ptent = pte_mkyoung(ptent);
>> if (pte_write(ptent))
>> ptent = pte_mkdirty(ptent);
>> }
> 
> As can_change_pte_writable() will fail if it stumbles over a !pte_dirty
> page in current code ... so I assume you would have that code before the
> actual pte_mkwrite() logic, correct?

No, I thought this should go after for 2 reasons:

1. You want to allow the PTE to be made writable following the
can_change_pte_writable().

2. You do not want to set a non-writable PTE as dirty, especially since it
might then be used to determine that the PTE can become writable. Doing so
would circumvent cases in which set_page_dirty() needs to be called and
break things down.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 18:09         ` Nadav Amit
@ 2022-07-20 18:11           ` David Hildenbrand
  0 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2022-07-20 18:11 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, LKML, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin

On 20.07.22 20:09, Nadav Amit wrote:
> On Jul 20, 2022, at 11:00 AM, David Hildenbrand <david@redhat.com> wrote:
> 
>> My patch review skills have seen better days. I thought you'd be
>> removing the pte_write() check ... :( Tired eyes ...
>>
>>> Having said that, I do notice now that pte_mkdirty() should not be done
>>> only this condition is fulfilled. Instead we should just have
>>> something like:
>>>
>>> if (will_need) {
>>> ptent = pte_mkyoung(ptent);
>>> if (pte_write(ptent))
>>> ptent = pte_mkdirty(ptent);
>>> }
>>
>> As can_change_pte_writable() will fail if it stumbles over a !pte_dirty
>> page in current code ... so I assume you would have that code before the
>> actual pte_mkwrite() logic, correct?
> 
> No, I thought this should go after for 2 reasons:
> 
> 1. You want to allow the PTE to be made writable following the
> can_change_pte_writable().
> 
> 2. You do not want to set a non-writable PTE as dirty, especially since it
> might then be used to determine that the PTE can become writable. Doing so
> would circumvent cases in which set_page_dirty() needs to be called and
> break things down.

The I'm confused how can_change_pte_writable() would ever allow for
that. Best to show me the code :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 15:10         ` David Hildenbrand
@ 2022-07-20 19:15           ` Peter Xu
  2022-07-20 19:33             ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Xu @ 2022-07-20 19:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
> For pagecache pages it may as well be *plain wrong* to bypass the write
> fault handler and simply mark pages dirty+map them writable.

Could you elaborate?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 19:15           ` Peter Xu
@ 2022-07-20 19:33             ` David Hildenbrand
  2022-07-20 19:48               ` Peter Xu
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2022-07-20 19:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nadav Amit, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On 20.07.22 21:15, Peter Xu wrote:
> On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
>> For pagecache pages it may as well be *plain wrong* to bypass the write
>> fault handler and simply mark pages dirty+map them writable.
> 
> Could you elaborate?

Write-fault handling for some filesystems (that even require this
"slow path") is a bit special.

For example, do_shared_fault() might have to call page_mkwrite().

AFAIK file systems use that for lazy allocation of disk blocks.
If you simply go ahead and map a !dirty pagecache page writable
and mark it dirty, it will not trigger page_mkwrite() and you might
end up corrupting data.

That's why we the old change_pte_range() code never touched
anything if the pte wasn't already dirty. Because as long as it's not writable,
the FS might have to be informed about the write-unprotect.

And we end up in the case here for VM_SHARED with vma_wants_writenotify().
Where we, for example, check

/* The backer wishes to know when pages are first written to? *
if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))$
	return 1;


Long story short, we should be really careful with write-fault handler bypasses,
especially when deciding to set dirty bits. For pagecache pages, we have to be
especially careful.

For exclusive anon pages it's mostly ok, because wp_page_reuse()
doesn't really contain that much magic. The only thing to consider for ordinary
mprotect() is that there is -- IMHO -- absolutely no guarantee that someone will
write to a specific write-unprotected page soon. For uffd-wp-unprotect it might be
easier to guess, especially, if we un-protect only a single page.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 19:33             ` David Hildenbrand
@ 2022-07-20 19:48               ` Peter Xu
  2022-07-20 19:55                 ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Xu @ 2022-07-20 19:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On Wed, Jul 20, 2022 at 09:33:35PM +0200, David Hildenbrand wrote:
> On 20.07.22 21:15, Peter Xu wrote:
> > On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
> >> For pagecache pages it may as well be *plain wrong* to bypass the write
> >> fault handler and simply mark pages dirty+map them writable.
> > 
> > Could you elaborate?
> 
> Write-fault handling for some filesystems (that even require this
> "slow path") is a bit special.
> 
> For example, do_shared_fault() might have to call page_mkwrite().
> 
> AFAIK file systems use that for lazy allocation of disk blocks.
> If you simply go ahead and map a !dirty pagecache page writable
> and mark it dirty, it will not trigger page_mkwrite() and you might
> end up corrupting data.
> 
> That's why we the old change_pte_range() code never touched
> anything if the pte wasn't already dirty.

I don't think that pte_dirty() check was for the pagecache code. For any fs
that has page_mkwrite() defined, it'll already have vma_wants_writenotify()
return 1, so we'll never try to add write bit, hence we'll never even try
to check pte_dirty().

> Because as long as it's not writable,
> the FS might have to be informed about the write-unprotect.
> 
> And we end up in the case here for VM_SHARED with vma_wants_writenotify().
> Where we, for example, check
> 
> /* The backer wishes to know when pages are first written to? *
> if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))$
> 	return 1;
> 
> 
> Long story short, we should be really careful with write-fault handler bypasses,
> especially when deciding to set dirty bits. For pagecache pages, we have to be
> especially careful.

Since you mentioned page_mkwrite, IMHO it's really the write bit not dirty
bit that matters here (and IMHO that's why it's called page_mkwrite() not
page_mkdirty()).  Here Nadav's patch added pte_mkdirty() only if
pte_mkwrite() happens.  So I'm a bit confused on what's your worry, and
what you're against doing.

Say, even if with my original proposal to set dirty unconditionally, it'll
be still be after the pte_mkwrite().  I never see how that could affect
page_mkwrite not to mention it'll not even reach there.

> 
> For exclusive anon pages it's mostly ok, because wp_page_reuse()
> doesn't really contain that much magic. The only thing to consider for ordinary
> mprotect() is that there is -- IMHO -- absolutely no guarantee that someone will
> write to a specific write-unprotected page soon. For uffd-wp-unprotect it might be
> easier to guess, especially, if we un-protect only a single page.

Yeh, as mentioned I think that's a valid point - looks good to me to attach
the dirty bit only when with a hint.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 19:48               ` Peter Xu
@ 2022-07-20 19:55                 ` David Hildenbrand
  2022-07-20 20:22                   ` Nadav Amit
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2022-07-20 19:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nadav Amit, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On 20.07.22 21:48, Peter Xu wrote:
> On Wed, Jul 20, 2022 at 09:33:35PM +0200, David Hildenbrand wrote:
>> On 20.07.22 21:15, Peter Xu wrote:
>>> On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
>>>> For pagecache pages it may as well be *plain wrong* to bypass the write
>>>> fault handler and simply mark pages dirty+map them writable.
>>>
>>> Could you elaborate?
>>
>> Write-fault handling for some filesystems (that even require this
>> "slow path") is a bit special.
>>
>> For example, do_shared_fault() might have to call page_mkwrite().
>>
>> AFAIK file systems use that for lazy allocation of disk blocks.
>> If you simply go ahead and map a !dirty pagecache page writable
>> and mark it dirty, it will not trigger page_mkwrite() and you might
>> end up corrupting data.
>>
>> That's why we the old change_pte_range() code never touched
>> anything if the pte wasn't already dirty.
> 
> I don't think that pte_dirty() check was for the pagecache code. For any fs
> that has page_mkwrite() defined, it'll already have vma_wants_writenotify()
> return 1, so we'll never try to add write bit, hence we'll never even try
> to check pte_dirty().
> 

I might be too tired, but the whole reason we had this magic before my
commit in place was only for the pagecache.

With vma_wants_writenotify()=0 you can directly map the pages writable
and don't have to do these advanced checks here. In a writable
MAP_SHARED VMA you'll already have pte_write().

We only get !pte_write() in case we have vma_wants_writenotify()=1 ...

  try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);

and that's the code that checked the dirty bit after all to decide --
amongst other things -- if we can simply map it writable without going
via the write fault handler and triggering do_shared_fault() .

See crazy/ugly FOLL_FORCE code in GUP that similarly checks the dirty bit.

But yeah, it's all confusing so I might just be wrong regarding
pagecache pages.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 19:55                 ` David Hildenbrand
@ 2022-07-20 20:22                   ` Nadav Amit
  2022-07-20 20:38                     ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Nadav Amit @ 2022-07-20 20:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Linux MM, LKML, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin

On Jul 20, 2022, at 12:55 PM, David Hildenbrand <david@redhat.com> wrote:

> On 20.07.22 21:48, Peter Xu wrote:
>> On Wed, Jul 20, 2022 at 09:33:35PM +0200, David Hildenbrand wrote:
>>> On 20.07.22 21:15, Peter Xu wrote:
>>>> On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
>>>>> For pagecache pages it may as well be *plain wrong* to bypass the write
>>>>> fault handler and simply mark pages dirty+map them writable.
>>>> 
>>>> Could you elaborate?
>>> 
>>> Write-fault handling for some filesystems (that even require this
>>> "slow path") is a bit special.
>>> 
>>> For example, do_shared_fault() might have to call page_mkwrite().
>>> 
>>> AFAIK file systems use that for lazy allocation of disk blocks.
>>> If you simply go ahead and map a !dirty pagecache page writable
>>> and mark it dirty, it will not trigger page_mkwrite() and you might
>>> end up corrupting data.
>>> 
>>> That's why we the old change_pte_range() code never touched
>>> anything if the pte wasn't already dirty.
>> 
>> I don't think that pte_dirty() check was for the pagecache code. For any fs
>> that has page_mkwrite() defined, it'll already have vma_wants_writenotify()
>> return 1, so we'll never try to add write bit, hence we'll never even try
>> to check pte_dirty().
> 
> I might be too tired, but the whole reason we had this magic before my
> commit in place was only for the pagecache.
> 
> With vma_wants_writenotify()=0 you can directly map the pages writable
> and don't have to do these advanced checks here. In a writable
> MAP_SHARED VMA you'll already have pte_write().
> 
> We only get !pte_write() in case we have vma_wants_writenotify()=1 ...
> 
>  try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
> 
> and that's the code that checked the dirty bit after all to decide --
> amongst other things -- if we can simply map it writable without going
> via the write fault handler and triggering do_shared_fault() .
> 
> See crazy/ugly FOLL_FORCE code in GUP that similarly checks the dirty bit.

I thought you want to get rid of it at least for anonymous pages. No?

> 
> But yeah, it's all confusing so I might just be wrong regarding
> pagecache pages.

Just to note: I am not very courageous and I did not intend to change
condition for when non-anonymous pages are set as writable. That’s the
reason I did not change the dirty for non-writable non-anonymous entries (as
Peter said). And that’s the reason that setting the dirty bit (at least as I
should have done it) is only performed after we made the decision on the
write-bit.

IOW, after you made your decision about the write-bit, then and only then
you may be able to set the dirty bit for writable entries. Since the entry
is already writeable (i.e., can be written without a fault later directly
from userspace), there should be no concern of correctness when you set it.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 20:22                   ` Nadav Amit
@ 2022-07-20 20:38                     ` David Hildenbrand
  2022-07-20 20:56                       ` Nadav Amit
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2022-07-20 20:38 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Xu, Linux MM, LKML, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin

On 20.07.22 22:22, Nadav Amit wrote:
> On Jul 20, 2022, at 12:55 PM, David Hildenbrand <david@redhat.com> wrote:
> 
>> On 20.07.22 21:48, Peter Xu wrote:
>>> On Wed, Jul 20, 2022 at 09:33:35PM +0200, David Hildenbrand wrote:
>>>> On 20.07.22 21:15, Peter Xu wrote:
>>>>> On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
>>>>>> For pagecache pages it may as well be *plain wrong* to bypass the write
>>>>>> fault handler and simply mark pages dirty+map them writable.
>>>>>
>>>>> Could you elaborate?
>>>>
>>>> Write-fault handling for some filesystems (that even require this
>>>> "slow path") is a bit special.
>>>>
>>>> For example, do_shared_fault() might have to call page_mkwrite().
>>>>
>>>> AFAIK file systems use that for lazy allocation of disk blocks.
>>>> If you simply go ahead and map a !dirty pagecache page writable
>>>> and mark it dirty, it will not trigger page_mkwrite() and you might
>>>> end up corrupting data.
>>>>
>>>> That's why we the old change_pte_range() code never touched
>>>> anything if the pte wasn't already dirty.
>>>
>>> I don't think that pte_dirty() check was for the pagecache code. For any fs
>>> that has page_mkwrite() defined, it'll already have vma_wants_writenotify()
>>> return 1, so we'll never try to add write bit, hence we'll never even try
>>> to check pte_dirty().
>>
>> I might be too tired, but the whole reason we had this magic before my
>> commit in place was only for the pagecache.
>>
>> With vma_wants_writenotify()=0 you can directly map the pages writable
>> and don't have to do these advanced checks here. In a writable
>> MAP_SHARED VMA you'll already have pte_write().
>>
>> We only get !pte_write() in case we have vma_wants_writenotify()=1 ...
>>
>>  try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
>>
>> and that's the code that checked the dirty bit after all to decide --
>> amongst other things -- if we can simply map it writable without going
>> via the write fault handler and triggering do_shared_fault() .
>>
>> See crazy/ugly FOLL_FORCE code in GUP that similarly checks the dirty bit.
> 
> I thought you want to get rid of it at least for anonymous pages. No?

Yes. Especially for any MAP_PRIVATE mappings.

If you want to write to something that's not mapped writable in a
MAP_PRIVATE mapping it
a) Has to be an exclusive anonymous page
b) The pte has to be dirty

In any other case, you clearly missed to COW or the modifications might
get lost if the PTE is not dirty.

MAP_SHARED is a bit more involved. But whether the pte is dirty might be
good enough ... but this needs a lot more care.

> 
>>
>> But yeah, it's all confusing so I might just be wrong regarding
>> pagecache pages.
> 
> Just to note: I am not very courageous and I did not intend to change
> condition for when non-anonymous pages are set as writable. That’s the
> reason I did not change the dirty for non-writable non-anonymous entries (as
> Peter said). And that’s the reason that setting the dirty bit (at least as I
> should have done it) is only performed after we made the decision on the
> write-bit.

Good. As long as we stick to anonymous pages I roughly know what we we
can and cannot do at this point :)


The problem I see is that detection whether we can write requires the
dirty bit ... and whether to set the dirty bit requires the information
whether we can write.

Again, for anonymous pages we should be able to relax the "dirty"
requirement when detecting whether we can write.

> 
> IOW, after you made your decision about the write-bit, then and only then
> you may be able to set the dirty bit for writable entries. Since the entry
> is already writeable (i.e., can be written without a fault later directly
> from userspace), there should be no concern of correctness when you set it.

That makes sense to me. What keeps confusing me are architectures with
and without a hw-managed dirty bit ... :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 20:38                     ` David Hildenbrand
@ 2022-07-20 20:56                       ` Nadav Amit
  2022-07-21  7:52                         ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Nadav Amit @ 2022-07-20 20:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Linux MM, LKML, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin

On Jul 20, 2022, at 1:38 PM, David Hildenbrand <david@redhat.com> wrote:

> On 20.07.22 22:22, Nadav Amit wrote:
>> On Jul 20, 2022, at 12:55 PM, David Hildenbrand <david@redhat.com> wrote:
>> 
>>> On 20.07.22 21:48, Peter Xu wrote:
>>>> On Wed, Jul 20, 2022 at 09:33:35PM +0200, David Hildenbrand wrote:
>>>>> On 20.07.22 21:15, Peter Xu wrote:
>>>>>> On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
>>>>>>> For pagecache pages it may as well be *plain wrong* to bypass the write
>>>>>>> fault handler and simply mark pages dirty+map them writable.
>>>>>> 
>>>>>> Could you elaborate?
>>>>> 
>>>>> Write-fault handling for some filesystems (that even require this
>>>>> "slow path") is a bit special.
>>>>> 
>>>>> For example, do_shared_fault() might have to call page_mkwrite().
>>>>> 
>>>>> AFAIK file systems use that for lazy allocation of disk blocks.
>>>>> If you simply go ahead and map a !dirty pagecache page writable
>>>>> and mark it dirty, it will not trigger page_mkwrite() and you might
>>>>> end up corrupting data.
>>>>> 
>>>>> That's why we the old change_pte_range() code never touched
>>>>> anything if the pte wasn't already dirty.
>>>> 
>>>> I don't think that pte_dirty() check was for the pagecache code. For any fs
>>>> that has page_mkwrite() defined, it'll already have vma_wants_writenotify()
>>>> return 1, so we'll never try to add write bit, hence we'll never even try
>>>> to check pte_dirty().
>>> 
>>> I might be too tired, but the whole reason we had this magic before my
>>> commit in place was only for the pagecache.
>>> 
>>> With vma_wants_writenotify()=0 you can directly map the pages writable
>>> and don't have to do these advanced checks here. In a writable
>>> MAP_SHARED VMA you'll already have pte_write().
>>> 
>>> We only get !pte_write() in case we have vma_wants_writenotify()=1 ...
>>> 
>>> try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
>>> 
>>> and that's the code that checked the dirty bit after all to decide --
>>> amongst other things -- if we can simply map it writable without going
>>> via the write fault handler and triggering do_shared_fault() .
>>> 
>>> See crazy/ugly FOLL_FORCE code in GUP that similarly checks the dirty bit.
>> 
>> I thought you want to get rid of it at least for anonymous pages. No?
> 
> Yes. Especially for any MAP_PRIVATE mappings.
> 
> If you want to write to something that's not mapped writable in a
> MAP_PRIVATE mapping it
> a) Has to be an exclusive anonymous page
> b) The pte has to be dirty

Do you need both conditions to be true? I thought (a) is sufficient (if
the soft-dirty and related checks succeed).

> 
> In any other case, you clearly missed to COW or the modifications might
> get lost if the PTE is not dirty.
> 
> MAP_SHARED is a bit more involved. But whether the pte is dirty might be
> good enough ... but this needs a lot more care.
> 
>>> But yeah, it's all confusing so I might just be wrong regarding
>>> pagecache pages.
>> 
>> Just to note: I am not very courageous and I did not intend to change
>> condition for when non-anonymous pages are set as writable. That’s the
>> reason I did not change the dirty for non-writable non-anonymous entries (as
>> Peter said). And that’s the reason that setting the dirty bit (at least as I
>> should have done it) is only performed after we made the decision on the
>> write-bit.
> 
> Good. As long as we stick to anonymous pages I roughly know what we we
> can and cannot do at this point :)
> 
> 
> The problem I see is that detection whether we can write requires the
> dirty bit ... and whether to set the dirty bit requires the information
> whether we can write.
> 
> Again, for anonymous pages we should be able to relax the "dirty"
> requirement when detecting whether we can write.

That’s all I wanted to do there.

> 
>> IOW, after you made your decision about the write-bit, then and only then
>> you may be able to set the dirty bit for writable entries. Since the entry
>> is already writeable (i.e., can be written without a fault later directly
>> from userspace), there should be no concern of correctness when you set it.
> 
> That makes sense to me. What keeps confusing me are architectures with
> and without a hw-managed dirty bit ... :)

I don’t know which arch you have in your mind. But the moment a PTE is
writable, then marking it logically/architecturally as dirty should be
fine.

But… if the Exclusive check is not good enough for private+anon without
the “logical” dirty bit, then there would be a problem. 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable
  2022-07-20 17:25     ` Nadav Amit
@ 2022-07-21  7:45       ` David Hildenbrand
  0 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2022-07-21  7:45 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, LKML, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin

On 20.07.22 19:25, Nadav Amit wrote:
> On Jul 20, 2022, at 8:19 AM, David Hildenbrand <david@redhat.com> wrote:
> 
>> On 18.07.22 14:02, Nadav Amit wrote:
>>> From: Nadav Amit <namit@vmware.com>
>>>
>>> Anonymous pages might have the dirty bit clear, but this should not
>>> prevent mprotect from making them writable if they are exclusive.
>>> Therefore, skip the test whether the page is dirty in this case.
>>>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Andy Lutomirski <luto@kernel.org>
>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Will Deacon <will@kernel.org>
>>> Cc: Yu Zhao <yuzhao@google.com>
>>> Cc: Nick Piggin <npiggin@gmail.com>
>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>> ---
>>> mm/mprotect.c | 5 +++--
>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 34c2dfb68c42..da5b9bf8204f 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>>>
>>> 	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
>>>
>>> -	if (pte_protnone(pte) || !pte_dirty(pte))
>>> +	if (pte_protnone(pte))
>>> 		return false;
>>>
>>> 	/* Do we need write faults for softdirty tracking? */
>>> @@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>>> 		page = vm_normal_page(vma, addr, pte);
>>> 		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>>> 			return false;
>>> -	}
>>> +	} else if (!pte_dirty(pte))
>>> +		return false;
>>>
>>> 	return true;
>>> }
>>
>> When I wrote that code, I was wondering how often that would actually
>> happen in practice -- and if we care about optimizing that. Do you have
>> a gut feeling in which scenarios this would happen and if we care?
>>
>> If the page is in the swapcache and was swapped out, you'd be requiring
>> a writeback even though nobody modified the page and possibly isn't
>> going to do so in the near future.
> 
> So here is my due diligence: I did not really encounter a scenario in which
> it showed up. When I looked at your code, I assumed this was an oversight
> and not a thoughtful decision. For me the issue is more of the discrepancy
> between how a certain page is handled before and after it was pages out.
> 
> The way that I see it, there is a tradeoff in the way dirty-bit should
> be handled:
> (1) Writable-clean PTEs introduce some non-negligible overhead.
> (2) Marking a PTE dirty speculatively would require a write back.
> 
> … But this tradeoff should not affect whether a PTE is writable, i.e.,
> mapping the PTE as writable-clean should not cause a writeback. In other
> words, if you are concerned about unnecessary writebacks, which I think is a
> fair concern, then do not set the dirty-bit. In a later patch I try to avoid
> TLB flushes on clean-writable entries that are write-protected.
> 
> So I do not think that the writeback you mentioned should be a real issue.
> Yet if you think that using the fact that the page is not-dirty is a good
> hueristics to avoid future TLB flushes (for P->NP; as I said there is a
> solution for RW->RO), or if you are concerned about the cost of
> vm_normal_page(), perhaps those are valid concerned (although I do not think
> so).

I think I now understand what you mean. I somehow remembered that some
architectures set a PTE dirty when marking it writable, but I guess this
is not true -- and setting it writable will keep it !dirty until really
accessed (either by the HW or by a fault). [I'll do some more digging
just to confirm]

With that in mind, your patch makes sense, and I guess you'd want that
as patch #1, because otherwise I fail to see how current patch #1 would
even succeed in reaching the "pte_mkdirty" call -- if !pte_dirty()
protects the code from running.

> 
> --
> 
> [ Regarding (1): After some discussions with Peter and reading more code, I
> thought at some point that perhaps avoiding having writable-clean PTE as
> much as possible makes sense [*], since setting the dirty-bit costs ~550
> cycles and a page fault is not a lot more than 1000. But with all the
> mitigations (and after adding IBRS for retbless) page-fault entry is kind of
> expensive. 

I understand the reasoning for anonymous memory, but not for page cache
pages. And for anonymous memory I think there are still cases where we
don't want to do that (swapcache and MADV_FREE comes to mind) -- and
IMHO some of the scenarios where we have clean anonymous memory at all,
we just don't care about optimizing for setting the dirty bit faster on
x86 ;) .

So my gut feeling is to keep it as simple as possible. To me, it
translates to setting the dirty bit only if we have clear indication
that write access is likely next.

Thanks Nadav for refreshing my memory :) !

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 20:56                       ` Nadav Amit
@ 2022-07-21  7:52                         ` David Hildenbrand
  2022-07-21 14:10                           ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2022-07-21  7:52 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Xu, Linux MM, LKML, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin

>> Yes. Especially for any MAP_PRIVATE mappings.
>>
>> If you want to write to something that's not mapped writable in a
>> MAP_PRIVATE mapping it
>> a) Has to be an exclusive anonymous page
>> b) The pte has to be dirty
> 
> Do you need both conditions to be true? I thought (a) is sufficient (if
> the soft-dirty and related checks succeed).

If we force-write to a page, we need it to be dirty to tell reclaim code
that the content stale. We can either mark the pte dirty manually, or
just let the write fault handler deal with it to simplify GUP code. This
needs some more thought, but that's my understanding.

> 
>>
>> In any other case, you clearly missed to COW or the modifications might
>> get lost if the PTE is not dirty.
>>
>> MAP_SHARED is a bit more involved. But whether the pte is dirty might be
>> good enough ... but this needs a lot more care.
>>
>>>> But yeah, it's all confusing so I might just be wrong regarding
>>>> pagecache pages.
>>>
>>> Just to note: I am not very courageous and I did not intend to change
>>> condition for when non-anonymous pages are set as writable. That’s the
>>> reason I did not change the dirty for non-writable non-anonymous entries (as
>>> Peter said). And that’s the reason that setting the dirty bit (at least as I
>>> should have done it) is only performed after we made the decision on the
>>> write-bit.
>>
>> Good. As long as we stick to anonymous pages I roughly know what we we
>> can and cannot do at this point :)
>>
>>
>> The problem I see is that detection whether we can write requires the
>> dirty bit ... and whether to set the dirty bit requires the information
>> whether we can write.
>>
>> Again, for anonymous pages we should be able to relax the "dirty"
>> requirement when detecting whether we can write.
> 
> That’s all I wanted to do there.
> 
>>
>>> IOW, after you made your decision about the write-bit, then and only then
>>> you may be able to set the dirty bit for writable entries. Since the entry
>>> is already writeable (i.e., can be written without a fault later directly
>>> from userspace), there should be no concern of correctness when you set it.
>>
>> That makes sense to me. What keeps confusing me are architectures with
>> and without a hw-managed dirty bit ... :)
> 
> I don’t know which arch you have in your mind. But the moment a PTE is
> writable, then marking it logically/architecturally as dirty should be
> fine.
> 
> But… if the Exclusive check is not good enough for private+anon without
> the “logical” dirty bit, then there would be a problem. 

I think we are good for anonymous pages.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-21  7:52                         ` David Hildenbrand
@ 2022-07-21 14:10                           ` David Hildenbrand
  0 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2022-07-21 14:10 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Xu, Linux MM, LKML, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin

On 21.07.22 09:52, David Hildenbrand wrote:
>>> Yes. Especially for any MAP_PRIVATE mappings.
>>>
>>> If you want to write to something that's not mapped writable in a
>>> MAP_PRIVATE mapping it
>>> a) Has to be an exclusive anonymous page
>>> b) The pte has to be dirty
>>
>> Do you need both conditions to be true? I thought (a) is sufficient (if
>> the soft-dirty and related checks succeed).
> 
> If we force-write to a page, we need it to be dirty to tell reclaim code
> that the content stale. We can either mark the pte dirty manually, or
> just let the write fault handler deal with it to simplify GUP code. This
> needs some more thought, but that's my understanding.

Extending on my previous answer after staring at the code

a) I have to dig if the FOLL_FORCE special-retry-handling is required
for MAP_SHARED at all.

check_vma_flags() allows FOLL_FORCE only on MAP_PRIVATE VMAs that lack
VM_WRITE.

Consequently, I would have assumed that the first write fault should be
sufficient on a MAP_SHARED VMA to actually map the PTE writable and not
require any of that special retry magic.


b) I wonder if we have to take care of uffd-wp and softdirty (just like
in mprotect code here) as well in case we stumble over an exclusive
anonymous page. Yes, the VMA is currently not writable, but I'd have
expected at least softdirty tracking to apply.

... I'll dig into the details.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2022-07-21 14:11 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-18 12:01 [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi Nadav Amit
2022-07-18 12:01 ` [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect Nadav Amit
2022-07-19 20:47   ` Peter Xu
2022-07-20  9:39     ` David Hildenbrand
2022-07-20 13:10       ` Peter Xu
2022-07-20 15:10         ` David Hildenbrand
2022-07-20 19:15           ` Peter Xu
2022-07-20 19:33             ` David Hildenbrand
2022-07-20 19:48               ` Peter Xu
2022-07-20 19:55                 ` David Hildenbrand
2022-07-20 20:22                   ` Nadav Amit
2022-07-20 20:38                     ` David Hildenbrand
2022-07-20 20:56                       ` Nadav Amit
2022-07-21  7:52                         ` David Hildenbrand
2022-07-21 14:10                           ` David Hildenbrand
2022-07-20  9:42   ` David Hildenbrand
2022-07-20 17:36     ` Nadav Amit
2022-07-20 18:00       ` David Hildenbrand
2022-07-20 18:09         ` Nadav Amit
2022-07-20 18:11           ` David Hildenbrand
2022-07-18 12:02 ` [RFC PATCH 02/14] userfaultfd: try to map write-unprotected pages Nadav Amit
2022-07-19 20:49   ` Peter Xu
2022-07-18 12:02 ` [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable Nadav Amit
2022-07-20 15:19   ` David Hildenbrand
2022-07-20 17:25     ` Nadav Amit
2022-07-21  7:45       ` David Hildenbrand
2022-07-18 12:02 ` [RFC PATCH 04/14] mm/mprotect: preserve write with MM_CP_TRY_CHANGE_WRITABLE Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 05/14] x86/mm: check exec permissions on fault Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 06/14] mm/rmap: avoid flushing on page_vma_mkclean_one() when possible Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 07/14] mm: do fix spurious page-faults for instruction faults Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 08/14] x86/mm: introduce flush_tlb_fix_spurious_fault Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 09/14] mm: introduce relaxed TLB flushes Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 10/14] x86/mm: " Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 11/14] x86/mm: use relaxed TLB flushes when protection is removed Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 12/14] x86/tlb: no flush on PTE change from RW->RO when PTE is clean Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 13/14] mm/mprotect: do not check flush type if a strict is needed Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 14/14] mm: conditional check of pfn in pte_flush_type Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).