All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND PATCH v3 0/5] mm/mprotect: avoid unnecessary TLB flushes
@ 2022-03-11 19:07 Nadav Amit
  2022-03-11 19:07 ` [RESEND PATCH v3 1/5] x86: Detection of Knights Landing A/D leak Nadav Amit
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Nadav Amit @ 2022-03-11 19:07 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Andrew Morton, Nadav Amit

From: Nadav Amit <namit@vmware.com>

This patch-set is intended to remove unnecessary TLB flushes during
mprotect() syscalls. Once this patch-set make it through, similar
and further optimizations for MADV_COLD and userfaultfd would be
possible.

Sorry for the time between it took me to get to v3.

Basically, there are 3 optimizations in this patch-set:
1. Use TLB batching infrastructure to batch flushes across VMAs and
   do better/fewer flushes. This would also be handy for later
   userfaultfd enhancements.
2. Avoid TLB flushes on permission demotion. This optimization is
   the one that provides most of the performance benefits. Note that
   the previous batching infrastructure changes are needed for that to
   happen.
3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
   prevent the A/D bits from changing.

Andrew asked for some benchmark numbers. I do not have an easy
determinate macrobenchmark in which it is easy to show benefit. I therre
ran a microbenchmark: a loop that does the following on anonymous
memory, just as a sanity check to see that time is saved by avoiding TLB
flushes. The loop goes:

	mprotect(p, PAGE_SIZE, PROT_READ)
	mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
	*p = 0; // make the page writable

The test was run in KVM guest with 1 or 2 threads (the second thread
was busy-looping). I measured the time (cycles) of each operation:

		1 thread		2 threads
		mmots	+patch		mmots	+patch
PROT_READ	3494	2725 (-22%)	8630	7788 (-10%)
PROT_READ|WRITE	3952	2724 (-31%)	9075	2865 (-68%)

[ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]

The exact numbers are really meaningless, but the benefit is clear.
There are 2 interesting results though. 

(1) PROT_READ is cheaper, while one can expect it not to be affected.
This is presumably due to TLB miss that is saved

(2) Without memory access (*p = 0), the speedup of the patch is even
greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush.
As a result both operations on the patched kernel take roughly ~1500
cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
high as presented in the table.

--

v2 -> v3:
* Fix orders of patches (order could lead to breakage)
* Better comments
* Clearer KNL detection [Dave]
* Assertion on PF error-code [Dave]
* Comments, code, function names improvements [PeterZ]
* Flush on access-bit clearing on PMD changes to follow the way
  flushing on x86 is done today in the kernel.

v1 -> v2:
* Wrong detection of permission demotion [Andrea]
* Better comments [Andrea]
* Handle THP [Andrea]
* Batching across VMAs [Peter Xu]
* Avoid open-coding PTE analysis
* Fix wrong use of the mmu_gather()


*** BLURB HERE ***

Nadav Amit (5):
  x86: Detection of Knights Landing A/D leak
  x86/mm: check exec permissions on fault
  mm/mprotect: use mmu_gather
  mm/mprotect: do not flush on permission promotion
  mm: avoid unnecessary flush on change_huge_pmd()

 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/pgtable.h       |  5 ++
 arch/x86/include/asm/pgtable_types.h |  2 +
 arch/x86/include/asm/tlbflush.h      | 82 ++++++++++++++++++++++++
 arch/x86/kernel/cpu/intel.c          |  5 ++
 arch/x86/mm/fault.c                  | 22 ++++++-
 arch/x86/mm/pgtable.c                | 10 +++
 fs/exec.c                            |  6 +-
 include/asm-generic/tlb.h            | 14 +++++
 include/linux/huge_mm.h              |  5 +-
 include/linux/mm.h                   |  5 +-
 include/linux/pgtable.h              | 20 ++++++
 mm/huge_memory.c                     | 19 ++++--
 mm/mprotect.c                        | 94 +++++++++++++++-------------
 mm/pgtable-generic.c                 |  8 +++
 mm/userfaultfd.c                     |  6 +-
 16 files changed, 248 insertions(+), 56 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RESEND PATCH v3 1/5] x86: Detection of Knights Landing A/D leak
  2022-03-11 19:07 [RESEND PATCH v3 0/5] mm/mprotect: avoid unnecessary TLB flushes Nadav Amit
@ 2022-03-11 19:07 ` Nadav Amit
  2022-03-11 19:07 ` [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault Nadav Amit
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Nadav Amit @ 2022-03-11 19:07 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Nadav Amit, Andi Kleen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

Knights Landing has a issue that a thread setting A or D bits may not do
so atomically against checking the present bit.  A thread which is going
to page fault may still set those bits, even though the present bit was
already atomically cleared.

This implies that when the kernel clears present atomically, some time
later the supposed to be zero entry could be corrupted with stray A or D
bits.

Since the PTE could be already used for storing a swap index, or a NUMA
migration index, this cannot be tolerated. Most of the time the kernel
detects the problem, but in some rare cases it may not.

This patch adds an interface to detect the bug, which will be used in a
following patch.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
Link: https://lore.kernel.org/lkml/1465919919-2093-1-git-send-email-lukasz.anaczkowski@intel.com/
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/kernel/cpu/intel.c        | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 67ef0e81c7dc..184b299dbf12 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -442,5 +442,6 @@
 #define X86_BUG_TAA			X86_BUG(22) /* CPU is affected by TSX Async Abort(TAA) */
 #define X86_BUG_ITLB_MULTIHIT		X86_BUG(23) /* CPU may incur MCE during certain page attribute changes */
 #define X86_BUG_SRBDS			X86_BUG(24) /* CPU may leak RNG bits if not mitigated */
+#define X86_BUG_PTE_LEAK		X86_BUG(25) /* PTE may leak A/D bits after clear */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43554a1..74780fef3f12 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -296,6 +296,11 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 		}
 	}
 
+	if (c->x86_model == INTEL_FAM6_XEON_PHI_KNL) {
+		pr_info_once("Enabling PTE leaking workaround\n");
+		set_cpu_bug(c, X86_BUG_PTE_LEAK);
+	}
+
 	/*
 	 * Intel Quark Core DevMan_001.pdf section 6.4.11
 	 * "The operating system also is required to invalidate (i.e., flush)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault
  2022-03-11 19:07 [RESEND PATCH v3 0/5] mm/mprotect: avoid unnecessary TLB flushes Nadav Amit
  2022-03-11 19:07 ` [RESEND PATCH v3 1/5] x86: Detection of Knights Landing A/D leak Nadav Amit
@ 2022-03-11 19:07 ` Nadav Amit
  2022-03-11 19:41   ` Dave Hansen
  2022-03-11 19:07 ` [RESEND PATCH v3 3/5] mm/mprotect: use mmu_gather Nadav Amit
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Nadav Amit @ 2022-03-11 19:07 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Nadav Amit, Andrea Arcangeli,
	Andrew Cooper, Andy Lutomirski, Dave Hansen, Peter Xu,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

access_error() currently does not check for execution permission
violation. As a result, spurious page-faults due to execution permission
violation cause SIGSEGV.

It appears not to be an issue so far, but the next patches avoid TLB
flushes on permission promotion, which can lead to this scenario. nodejs
for instance crashes when TLB flush is avoided on permission promotion.

Add a check to prevent access_error() from returning mistakenly that
spurious page-faults due to instruction fetch are a reason for an access
error.

It is assumed that error code bits of "instruction fetch" and "write" in
the hardware error code are mutual exclusive, and the change assumes so.
However, to be on the safe side, especially if hypervisors misbehave,
assert this is the case and warn otherwise.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/fault.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index d0074c6ed31a..ad0ef0a6087a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
 				       (error_code & X86_PF_INSTR), foreign))
 		return 1;
 
-	if (error_code & X86_PF_WRITE) {
+	if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
+		/*
+		 * CPUs are not expected to set the two error code bits
+		 * together, but to ensure that hypervisors do not misbehave,
+		 * run an additional sanity check.
+		 */
+		if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
+					(X86_PF_WRITE|X86_PF_INSTR)) {
+			WARN_ON_ONCE(1);
+			return 1;
+		}
+
 		/* write, present and write, not present: */
-		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+		if ((error_code & X86_PF_WRITE) &&
+		    unlikely(!(vma->vm_flags & VM_WRITE)))
+			return 1;
+
+		/* exec, present and exec, not present: */
+		if ((error_code & X86_PF_INSTR) &&
+		    unlikely(!(vma->vm_flags & VM_EXEC)))
 			return 1;
+
 		return 0;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RESEND PATCH v3 3/5] mm/mprotect: use mmu_gather
  2022-03-11 19:07 [RESEND PATCH v3 0/5] mm/mprotect: avoid unnecessary TLB flushes Nadav Amit
  2022-03-11 19:07 ` [RESEND PATCH v3 1/5] x86: Detection of Knights Landing A/D leak Nadav Amit
  2022-03-11 19:07 ` [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault Nadav Amit
@ 2022-03-11 19:07 ` Nadav Amit
  2022-03-11 19:07 ` [RESEND PATCH v3 4/5] mm/mprotect: do not flush on permission promotion Nadav Amit
  2022-03-11 19:07 ` [RESEND PATCH v3 5/5] mm: avoid unnecessary flush on change_huge_pmd() Nadav Amit
  4 siblings, 0 replies; 14+ messages in thread
From: Nadav Amit @ 2022-03-11 19:07 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Nadav Amit, Andrea Arcangeli,
	Andrew Cooper, Andy Lutomirski, Dave Hansen, Peter Xu,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates
the code, as developers need to be aware of different invalidation
schemes, and prevents opportunities to avoid TLB flushes or perform them
in finer granularity.

The use of mmu_gather for modified PTEs has benefits in various
scenarios even if pages are not released. For instance, if only a single
page needs to be flushed out of a range of many pages, only that page
would be flushed. If a THP page is flushed, on x86 a single TLB invlpg
instruction can be used instead of 512 instructions (or a full TLB
flush, which would Linux would actually use by default). mprotect() over
multiple VMAs requires a single flush.

Use mmu_gather in change_pXX_range(). As the pages are not released,
only record the flushed range using tlb_flush_pXX_range().

Handle THP similarly and get rid of flush_cache_range() which becomes
redundant since tlb_start_vma() calls it when needed.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/exec.c               |  6 ++-
 include/linux/huge_mm.h |  5 ++-
 include/linux/mm.h      |  5 ++-
 mm/huge_memory.c        | 10 ++++-
 mm/mprotect.c           | 93 ++++++++++++++++++++++-------------------
 mm/userfaultfd.c        |  6 ++-
 6 files changed, 75 insertions(+), 50 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a39108c1190a..1c46b7d7c32c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -759,6 +759,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	unsigned long stack_size;
 	unsigned long stack_expand;
 	unsigned long rlim_stack;
+	struct mmu_gather tlb;
 
 #ifdef CONFIG_STACK_GROWSUP
 	/* Limit stack size */
@@ -813,8 +814,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	vm_flags |= mm->def_flags;
 	vm_flags |= VM_STACK_INCOMPLETE_SETUP;
 
-	ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
+	tlb_gather_mmu(&tlb, mm);
+	ret = mprotect_fixup(&tlb, vma, &prev, vma->vm_start, vma->vm_end,
 			vm_flags);
+	tlb_finish_mmu(&tlb);
+
 	if (ret)
 		goto out_unlock;
 	BUG_ON(prev != vma);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2999190adc22..9a26bd10e083 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -36,8 +36,9 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud,
 		 unsigned long addr);
 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		   unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
-int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
-		    pgprot_t newprot, unsigned long cp_flags);
+int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
+		    unsigned long cp_flags);
 vm_fault_t vmf_insert_pfn_pmd_prot(struct vm_fault *vmf, pfn_t pfn,
 				   pgprot_t pgprot, bool write);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc69d2a69912..b2574bbdb76d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1969,10 +1969,11 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
 
-extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+extern unsigned long change_protection(struct mmu_gather *tlb,
+			      struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
 			      unsigned long cp_flags);
-extern int mprotect_fixup(struct vm_area_struct *vma,
+extern int mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3557aabe86fe..d58a5b498011 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1692,8 +1692,9 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  *      or if prot_numa but THP migration is not supported
  *  - HPAGE_PMD_NR if protections changed and TLB flush necessary
  */
-int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, pgprot_t newprot, unsigned long cp_flags)
+int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
+		    unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
@@ -1704,6 +1705,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
+	tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
+
 	if (prot_numa && !thp_migration_supported())
 		return 1;
 
@@ -1799,6 +1802,9 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 	ret = HPAGE_PMD_NR;
 	set_pmd_at(mm, addr, pmd, entry);
+
+	tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
+
 	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
 unlock:
 	spin_unlock(ptl);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b69ce7a7b2b7..f9730bac2d78 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -33,12 +33,13 @@
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
+#include <asm/tlb.h>
 
 #include "internal.h"
 
-static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, unsigned long end, pgprot_t newprot,
-		unsigned long cp_flags)
+static unsigned long change_pte_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
@@ -49,6 +50,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
+	tlb_change_page_size(tlb, PAGE_SIZE);
+
 	/*
 	 * Can be called with only the mmap_lock for reading by
 	 * prot_numa so we must check the pmd isn't constantly
@@ -149,6 +152,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				ptent = pte_mkwrite(ptent);
 			}
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+			tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
 			pages++;
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
@@ -230,9 +234,9 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 	return 0;
 }
 
-static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
-		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -272,8 +276,12 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			if (next - addr != HPAGE_PMD_SIZE) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
-				int nr_ptes = change_huge_pmd(vma, pmd, addr,
-							      newprot, cp_flags);
+				/*
+				 * change_huge_pmd() does not defer TLB flushes,
+				 * so no need to propagate the tlb argument.
+				 */
+				int nr_ptes = change_huge_pmd(tlb, vma, pmd,
+						addr, newprot, cp_flags);
 
 				if (nr_ptes) {
 					if (nr_ptes == HPAGE_PMD_NR) {
@@ -287,8 +295,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			}
 			/* fall through, the trans huge pmd just split */
 		}
-		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-					      cp_flags);
+		this_pages = change_pte_range(tlb, vma, pmd, addr, next,
+					      newprot, cp_flags);
 		pages += this_pages;
 next:
 		cond_resched();
@@ -302,9 +310,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	return pages;
 }
 
-static inline unsigned long change_pud_range(struct vm_area_struct *vma,
-		p4d_t *p4d, unsigned long addr, unsigned long end,
-		pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_pud_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -315,16 +323,16 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		pages += change_pmd_range(vma, pud, addr, next, newprot,
+		pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
 					  cp_flags);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
 }
 
-static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
-		pgd_t *pgd, unsigned long addr, unsigned long end,
-		pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_p4d_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr,
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -335,44 +343,40 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 		next = p4d_addr_end(addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
-		pages += change_pud_range(vma, p4d, addr, next, newprot,
+		pages += change_pud_range(tlb, vma, p4d, addr, next, newprot,
 					  cp_flags);
 	} while (p4d++, addr = next, addr != end);
 
 	return pages;
 }
 
-static unsigned long change_protection_range(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end, pgprot_t newprot,
-		unsigned long cp_flags)
+static unsigned long change_protection_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, unsigned long addr,
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long start = addr;
 	unsigned long pages = 0;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
-	flush_cache_range(vma, addr, end);
-	inc_tlb_flush_pending(mm);
+	tlb_start_vma(tlb, vma);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		pages += change_p4d_range(vma, pgd, addr, next, newprot,
+		pages += change_p4d_range(tlb, vma, pgd, addr, next, newprot,
 					  cp_flags);
 	} while (pgd++, addr = next, addr != end);
 
-	/* Only flush the TLB if we actually modified any entries: */
-	if (pages)
-		flush_tlb_range(vma, start, end);
-	dec_tlb_flush_pending(mm);
+	tlb_end_vma(tlb, vma);
 
 	return pages;
 }
 
-unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+unsigned long change_protection(struct mmu_gather *tlb,
+		       struct vm_area_struct *vma, unsigned long start,
 		       unsigned long end, pgprot_t newprot,
 		       unsigned long cp_flags)
 {
@@ -383,7 +387,7 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot);
 	else
-		pages = change_protection_range(vma, start, end, newprot,
+		pages = change_protection_range(tlb, vma, start, end, newprot,
 						cp_flags);
 
 	return pages;
@@ -417,8 +421,9 @@ static const struct mm_walk_ops prot_none_walk_ops = {
 };
 
 int
-mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
-	unsigned long start, unsigned long end, unsigned long newflags)
+mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
+	       struct vm_area_struct **pprev, unsigned long start,
+	       unsigned long end, unsigned long newflags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long oldflags = vma->vm_flags;
@@ -505,7 +510,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
 	vma_set_page_prot(vma);
 
-	change_protection(vma, start, end, vma->vm_page_prot,
+	change_protection(tlb, vma, start, end, vma->vm_page_prot,
 			  dirty_accountable ? MM_CP_DIRTY_ACCT : 0);
 
 	/*
@@ -539,6 +544,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 	const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
 	const bool rier = (current->personality & READ_IMPLIES_EXEC) &&
 				(prot & PROT_READ);
+	struct mmu_gather tlb;
 
 	start = untagged_addr(start);
 
@@ -598,6 +604,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 	else
 		prev = vma->vm_prev;
 
+	tlb_gather_mmu(&tlb, current->mm);
 	for (nstart = start ; ; ) {
 		unsigned long mask_off_old_flags;
 		unsigned long newflags;
@@ -624,18 +631,18 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */
 		if ((newflags & ~(newflags >> 4)) & VM_ACCESS_FLAGS) {
 			error = -EACCES;
-			goto out;
+			goto out_tlb;
 		}
 
 		/* Allow architectures to sanity-check the new flags */
 		if (!arch_validate_flags(newflags)) {
 			error = -EINVAL;
-			goto out;
+			goto out_tlb;
 		}
 
 		error = security_file_mprotect(vma, reqprot, prot);
 		if (error)
-			goto out;
+			goto out_tlb;
 
 		tmp = vma->vm_end;
 		if (tmp > end)
@@ -644,27 +651,29 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		if (vma->vm_ops && vma->vm_ops->mprotect) {
 			error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags);
 			if (error)
-				goto out;
+				goto out_tlb;
 		}
 
-		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
+		error = mprotect_fixup(&tlb, vma, &prev, nstart, tmp, newflags);
 		if (error)
-			goto out;
+			goto out_tlb;
 
 		nstart = tmp;
 
 		if (nstart < prev->vm_end)
 			nstart = prev->vm_end;
 		if (nstart >= end)
-			goto out;
+			goto out_tlb;
 
 		vma = prev->vm_next;
 		if (!vma || vma->vm_start != nstart) {
 			error = -ENOMEM;
-			goto out;
+			goto out_tlb;
 		}
 		prot = reqprot;
 	}
+out_tlb:
+	tlb_finish_mmu(&tlb);
 out:
 	mmap_write_unlock(current->mm);
 	return error;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e9bb6db002aa..0fd7fc8010d7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -16,6 +16,7 @@
 #include <linux/hugetlb.h>
 #include <linux/shmem_fs.h>
 #include <asm/tlbflush.h>
+#include <asm/tlb.h>
 #include "internal.h"
 
 static __always_inline
@@ -687,6 +688,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 			atomic_t *mmap_changing)
 {
 	struct vm_area_struct *dst_vma;
+	struct mmu_gather tlb;
 	pgprot_t newprot;
 	int err;
 
@@ -728,8 +730,10 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 	else
 		newprot = vm_get_page_prot(dst_vma->vm_flags);
 
-	change_protection(dst_vma, start, start + len, newprot,
+	tlb_gather_mmu(&tlb, dst_mm);
+	change_protection(&tlb, dst_vma, start, start + len, newprot,
 			  enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
+	tlb_finish_mmu(&tlb);
 
 	err = 0;
 out_unlock:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RESEND PATCH v3 4/5] mm/mprotect: do not flush on permission promotion
  2022-03-11 19:07 [RESEND PATCH v3 0/5] mm/mprotect: avoid unnecessary TLB flushes Nadav Amit
                   ` (2 preceding siblings ...)
  2022-03-11 19:07 ` [RESEND PATCH v3 3/5] mm/mprotect: use mmu_gather Nadav Amit
@ 2022-03-11 19:07 ` Nadav Amit
  2022-03-11 22:45   ` Nadav Amit
  2022-03-11 19:07 ` [RESEND PATCH v3 5/5] mm: avoid unnecessary flush on change_huge_pmd() Nadav Amit
  4 siblings, 1 reply; 14+ messages in thread
From: Nadav Amit @ 2022-03-11 19:07 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Nadav Amit, Andrea Arcangeli,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

Currently, using mprotect() to unprotect a memory region or uffd to
unprotect a memory region causes a TLB flush. At least on x86, as
protection is promoted, no TLB flush is needed.

Add an arch-specific pte_may_need_flush() which tells whether a TLB
flush is needed based on the old PTE and the new one. Implement an x86
pte_may_need_flush().

For x86, besides the simple logic that PTE protection promotion or
changes of software bits does require a flush, also add logic that
considers the dirty-bit. If the dirty-bit is clear and write-protect is
set, no TLB flush is needed, as x86 updates the dirty-bit atomically
on write, and if the bit is clear, the PTE is reread.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/pgtable_types.h |  2 +
 arch/x86/include/asm/tlbflush.h      | 82 ++++++++++++++++++++++++++++
 include/asm-generic/tlb.h            | 14 +++++
 mm/huge_memory.c                     |  9 +--
 mm/mprotect.c                        |  3 +-
 5 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 40497a9020c6..8668bc661026 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -110,9 +110,11 @@
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
+#define _PAGE_SOFTW4	(_AT(pteval_t, 1) << _PAGE_BIT_SOFTW4)
 #else
 #define _PAGE_NX	(_AT(pteval_t, 0))
 #define _PAGE_DEVMAP	(_AT(pteval_t, 0))
+#define _PAGE_SOFTW4	(_AT(pteval_t, 0))
 #endif
 
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 98fa0a114074..ec01e6cff137 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,6 +259,88 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
+/*
+ * The enabled_mask tells which bits that were present and gets cleared require
+ * flush.
+ *
+ * The disabled_mask tells which bits that were missing and gets set require
+ * flush.
+ *
+ * All the other bits except the ignored bits will require a flush no matter if
+ * they gets set or cleared.
+ *
+ * This function ignores the global bit, as it is used for protnone. This
+ * function should therefore not be used if the global bit might really be
+ * cleared.
+ *
+ * The function allows to ignore_access flags (provide _PAGE_ACCESS as argument
+ * to do so). The expected use is that the access bit would be ignored for PTEs
+ * but not for PMDs. That is the way the kernel perform TLB flushes after
+ * updates of the access-bit in other situations.
+ */
+static inline bool pte_flags_need_flush(unsigned long oldflags,
+					unsigned long newflags,
+					pteval_t ignore_access)
+{
+	const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
+		_PAGE_SOFTW3 | _PAGE_SOFTW4 | _PAGE_GLOBAL | ignore_access;
+	const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT |
+				     (_PAGE_ACCESSED & ~ignore_access);
+	const pteval_t disable_mask = _PAGE_NX;
+	unsigned long diff = oldflags ^ newflags;
+
+	VM_BUG_ON(ignore_access != 0 && ignore_access != _PAGE_ACCESSED);
+
+	return diff & ((oldflags & enable_mask) |
+		       (newflags & disable_mask) |
+		       ~(enable_mask | disable_mask | ignore_mask));
+}
+
+/*
+ * pte_needs_flush() checks whether permissions were demoted and require a
+ * flush. It should only be used for userspace PTEs.
+ */
+static inline bool pte_needs_flush(pte_t oldpte, pte_t newpte)
+{
+	/* !PRESENT -> * ; no need for flush */
+	if (!pte_present(oldpte))
+		return false;
+
+	/* PRESENT -> !PRESENT ; needs flush */
+	if (!pte_present(newpte))
+		return true;
+
+	/* PFN changed ; needs flush */
+	if (pte_pfn(oldpte) != pte_pfn(newpte))
+		return true;
+
+	return pte_flags_need_flush(pte_flags(oldpte), pte_flags(newpte),
+				    _PAGE_ACCESSED);
+}
+#define pte_needs_flush pte_needs_flush
+
+/*
+ * huge_pmd_needs_flush() checks whether permissions were demoted and
+ * require a flush. It should only be used for userspace huge PMDs.
+ */
+static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
+{
+	/* !PRESENT -> * ; no need for flush */
+	if (!pmd_present(oldpmd))
+		return false;
+
+	/* PRESENT -> !PRESENT ; needs flush */
+	if (!pmd_present(newpmd))
+		return true;
+
+	/* PFN changed ; needs flush */
+	if (pmd_pfn(oldpmd) != pmd_pfn(newpmd))
+		return true;
+
+	return pte_flags_need_flush(pmd_flags(oldpmd), pmd_flags(newpmd), 0);
+}
+#define huge_pmd_needs_flush huge_pmd_needs_flush
+
 #endif /* !MODULE */
 
 static inline void __native_tlb_flush_global(unsigned long cr4)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index fd7feb5c7894..3a30e23fa35d 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -654,6 +654,20 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 	} while (0)
 #endif
 
+#ifndef pte_needs_flush
+static inline bool pte_needs_flush(pte_t oldpte, pte_t newpte)
+{
+	return true;
+}
+#endif
+
+#ifndef huge_pmd_needs_flush
+static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
+{
+	return true;
+}
+#endif
+
 #endif /* CONFIG_MMU */
 
 #endif /* _ASM_GENERIC__TLB_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d58a5b498011..51b0f3cb1ba0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1698,7 +1698,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
-	pmd_t entry;
+	pmd_t oldpmd, entry;
 	bool preserve_write;
 	int ret;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
@@ -1784,9 +1784,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * pmdp_invalidate() is required to make sure we don't miss
 	 * dirty/young flags set by hardware.
 	 */
-	entry = pmdp_invalidate(vma, addr, pmd);
+	oldpmd = pmdp_invalidate(vma, addr, pmd);
 
-	entry = pmd_modify(entry, newprot);
+	entry = pmd_modify(oldpmd, newprot);
 	if (preserve_write)
 		entry = pmd_mk_savedwrite(entry);
 	if (uffd_wp) {
@@ -1803,7 +1803,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	ret = HPAGE_PMD_NR;
 	set_pmd_at(mm, addr, pmd, entry);
 
-	tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
+	if (huge_pmd_needs_flush(oldpmd, entry))
+		tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
 
 	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
 unlock:
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f9730bac2d78..97967d589ddb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -152,7 +152,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 				ptent = pte_mkwrite(ptent);
 			}
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
-			tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
+			if (pte_needs_flush(oldpte, ptent))
+				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
 			pages++;
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RESEND PATCH v3 5/5] mm: avoid unnecessary flush on change_huge_pmd()
  2022-03-11 19:07 [RESEND PATCH v3 0/5] mm/mprotect: avoid unnecessary TLB flushes Nadav Amit
                   ` (3 preceding siblings ...)
  2022-03-11 19:07 ` [RESEND PATCH v3 4/5] mm/mprotect: do not flush on permission promotion Nadav Amit
@ 2022-03-11 19:07 ` Nadav Amit
  2022-03-11 20:41   ` Dave Hansen
  4 siblings, 1 reply; 14+ messages in thread
From: Nadav Amit @ 2022-03-11 19:07 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Nadav Amit, Andrea Arcangeli,
	Andrew Cooper, Andy Lutomirski, Dave Hansen, Peter Xu,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, x86

From: Nadav Amit <namit@vmware.com>

Calls to change_protection_range() on THP can trigger, at least on x86,
two TLB flushes for one page: one immediately, when pmdp_invalidate() is
called by change_huge_pmd(), and then another one later (that can be
batched) when change_protection_range() finishes.

The first TLB flush is only necessary to prevent the dirty bit (and with
a lesser importance the access bit) from changing while the PTE is
modified. However, this is not necessary as the x86 CPUs set the
dirty-bit atomically with an additional check that the PTE is (still)
present. One caveat is Intel's Knights Landing that has a bug and does
not do so.

Leverage this behavior to eliminate the unnecessary TLB flush in
change_huge_pmd(). Introduce a new arch specific pmdp_invalidate_ad()
that only invalidates the access and dirty bit from further changes.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/pgtable.h |  5 +++++
 arch/x86/mm/pgtable.c          | 10 ++++++++++
 include/linux/pgtable.h        | 20 ++++++++++++++++++++
 mm/huge_memory.c               |  4 ++--
 mm/pgtable-generic.c           |  8 ++++++++
 5 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62ab07e24aef..23ad34edcc4b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1173,6 +1173,11 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 	}
 }
 #endif
+
+#define __HAVE_ARCH_PMDP_INVALIDATE_AD
+extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
+				unsigned long address, pmd_t *pmdp);
+
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..b2fcb2c749ce 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -608,6 +608,16 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 
 	return young;
 }
+
+pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
+			 pmd_t *pmdp)
+{
+	pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
+
+	if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
+		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	return old;
+}
 #endif
 
 /**
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f4f4077b97aa..5826e8e52619 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -570,6 +570,26 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
+
+/*
+ * pmdp_invalidate_ad() invalidates the PMD while changing a transparent
+ * hugepage mapping in the page tables. This function is similar to
+ * pmdp_invalidate(), but should only be used if the access and dirty bits would
+ * not be cleared by the software in the new PMD value. The function ensures
+ * that hardware changes of the access and dirty bits updates would not be lost.
+ *
+ * Doing so can allow in certain architectures to avoid a TLB flush in most
+ * cases. Yet, another TLB flush might be necessary later if the PMD update
+ * itself requires such flush (e.g., if protection was set to be stricter). Yet,
+ * even when a TLB flush is needed because of the update, the caller may be able
+ * to batch these TLB flushing operations, so fewer TLB flush operations are
+ * needed.
+ */
+extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
+				unsigned long address, pmd_t *pmdp);
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 static inline int pte_same(pte_t pte_a, pte_t pte_b)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 51b0f3cb1ba0..691d80edcfd7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1781,10 +1781,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
 	 * which may break userspace.
 	 *
-	 * pmdp_invalidate() is required to make sure we don't miss
+	 * pmdp_invalidate_ad() is required to make sure we don't miss
 	 * dirty/young flags set by hardware.
 	 */
-	oldpmd = pmdp_invalidate(vma, addr, pmd);
+	oldpmd = pmdp_invalidate_ad(vma, addr, pmd);
 
 	entry = pmd_modify(oldpmd, newprot);
 	if (preserve_write)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 6523fda274e5..90ab721a12a8 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -201,6 +201,14 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
+pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
+			 pmd_t *pmdp)
+{
+	return pmdp_invalidate(vma, address, pmdp);
+}
+#endif
+
 #ifndef pmdp_collapse_flush
 pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 			  pmd_t *pmdp)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault
  2022-03-11 19:07 ` [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault Nadav Amit
@ 2022-03-11 19:41   ` Dave Hansen
  2022-03-11 20:38     ` Nadav Amit
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2022-03-11 19:41 UTC (permalink / raw)
  To: Nadav Amit, linux-mm
  Cc: linux-kernel, Andrew Morton, Nadav Amit, Andrea Arcangeli,
	Andrew Cooper, Andy Lutomirski, Dave Hansen, Peter Xu,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, x86

On 3/11/22 11:07, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> access_error() currently does not check for execution permission
> violation. As a result, spurious page-faults due to execution permission
> violation cause SIGSEGV.

This is a bit muddy on the problem statement.  I get that spurious
faults can theoretically cause this, but *do* they in practice on
current kernels?

> It appears not to be an issue so far, but the next patches avoid TLB
> flushes on permission promotion, which can lead to this scenario. nodejs
> for instance crashes when TLB flush is avoided on permission promotion.

By "it appears not to be an issue", do you mean that this suboptimal
behavior can not be triggered, period?  Or, it can be triggered but
folks seem not to care that it can be triggered?

I *think* these can be triggered today.  I think it takes two threads
that do something like:

	Thread 1			Thread 2
	========			========
	ptr = malloc();
	memcpy(ptr, &code, len);
	exec_now = 1;
					while (!exec_now);
					call(ptr);		
					// fault	
	mprotect(ptr, PROT_EXEC, len);
					// fault sees VM_EXEC


But that has a bug: exec_now is set before the mprotect().  It's not
sane code.

Can any sane code trigger this?

> Add a check to prevent access_error() from returning mistakenly that
> spurious page-faults due to instruction fetch are a reason for an access
> error.
> 
> It is assumed that error code bits of "instruction fetch" and "write" in
> the hardware error code are mutual exclusive, and the change assumes so.
> However, to be on the safe side, especially if hypervisors misbehave,
> assert this is the case and warn otherwise.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Cc: x86@kernel.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/mm/fault.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index d0074c6ed31a..ad0ef0a6087a 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>  				       (error_code & X86_PF_INSTR), foreign))
>  		return 1;
>  
> -	if (error_code & X86_PF_WRITE) {
> +	if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
> +		/*
> +		 * CPUs are not expected to set the two error code bits
> +		 * together, but to ensure that hypervisors do not misbehave,
> +		 * run an additional sanity check.
> +		 */
> +		if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
> +					(X86_PF_WRITE|X86_PF_INSTR)) {
> +			WARN_ON_ONCE(1);
> +			return 1;
> +		}

access_error() is only used on the do_user_addr_fault() side of things.
 Can we stick this check somewhere that also works for kernel address
faults?  This is a generic sanity check.  It can also be in a separate
patch.

Also, we should *probably* stop talking about CPUs here.  If there's
ever something wonky with error code bits, I'd put my money on a weird
hypervisor before any kind of CPU issue.

>  		/* write, present and write, not present: */
> -		if (unlikely(!(vma->vm_flags & VM_WRITE)))
> +		if ((error_code & X86_PF_WRITE) &&
> +		    unlikely(!(vma->vm_flags & VM_WRITE)))
> +			return 1;
> +
> +		/* exec, present and exec, not present: */
> +		if ((error_code & X86_PF_INSTR) &&
> +		    unlikely(!(vma->vm_flags & VM_EXEC)))
>  			return 1;
> +
>  		return 0;
>  	}

This is getting really ugly.  I think we've gone over this before, but
it escapes me.  Why do we need a common (X86_PF_WRITE | X86_PF_INSTR)
block of code?  Why can't we just add a simple X86_PF_INSN if() that
mirrors the current X86_PF_WRITE one?


        if (error_code & X86_PF_INSN) {
                /* present and not exec: */
                if (unlikely(!(vma->vm_flags & VM_EXEC)))
                        return 1;
                return 0;
        }



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault
  2022-03-11 19:41   ` Dave Hansen
@ 2022-03-11 20:38     ` Nadav Amit
  2022-03-11 20:59       ` Dave Hansen
  0 siblings, 1 reply; 14+ messages in thread
From: Nadav Amit @ 2022-03-11 20:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, x86



> On Mar 11, 2022, at 11:41 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 3/11/22 11:07, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> access_error() currently does not check for execution permission
>> violation. As a result, spurious page-faults due to execution permission
>> violation cause SIGSEGV.
> 
> This is a bit muddy on the problem statement.  I get that spurious
> faults can theoretically cause this, but *do* they in practice on
> current kernels?
> 
>> It appears not to be an issue so far, but the next patches avoid TLB
>> flushes on permission promotion, which can lead to this scenario. nodejs
>> for instance crashes when TLB flush is avoided on permission promotion.
> 
> By "it appears not to be an issue", do you mean that this suboptimal
> behavior can not be triggered, period?  Or, it can be triggered but
> folks seem not to care that it can be triggered?
> 
> I *think* these can be triggered today.  I think it takes two threads
> that do something like:
> 
> 	Thread 1			Thread 2
> 	========			========
> 	ptr = malloc();
> 	memcpy(ptr, &code, len);
> 	exec_now = 1;
> 					while (!exec_now);
> 					call(ptr);		
> 					// fault	
> 	mprotect(ptr, PROT_EXEC, len);
> 					// fault sees VM_EXEC
> 
> 
> But that has a bug: exec_now is set before the mprotect().  It's not
> sane code.
> 
> Can any sane code trigger this?

Well, regarding this question and the previous one: I do not think that
this scenario is possible today since mprotect() holds the mmap_lock
for write. There is no other code that I am aware of that toggles
the NX bit on a present entry.

But I will not bet my life on it. That’s the reason for the somewhat
vague phrasing that I used.

>> 
>> index d0074c6ed31a..ad0ef0a6087a 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>> 				       (error_code & X86_PF_INSTR), foreign))
>> 		return 1;
>> 
>> -	if (error_code & X86_PF_WRITE) {
>> +	if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
>> +		/*
>> +		 * CPUs are not expected to set the two error code bits
>> +		 * together, but to ensure that hypervisors do not misbehave,
>> +		 * run an additional sanity check.
>> +		 */
>> +		if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
>> +					(X86_PF_WRITE|X86_PF_INSTR)) {
>> +			WARN_ON_ONCE(1);
>> +			return 1;
>> +		}
> 
> access_error() is only used on the do_user_addr_fault() side of things.
> Can we stick this check somewhere that also works for kernel address
> faults?  This is a generic sanity check.  It can also be in a separate
> patch.

I can wrap it in a different function and also call it from
do_kern_addr_fault() or spurious_kernel_fault().

Anyhow, spurious_kernel_fault() should handle spurious faults on
executable code correctly. 

> 
> Also, we should *probably* stop talking about CPUs here.  If there's
> ever something wonky with error code bits, I'd put my money on a weird
> hypervisor before any kind of CPU issue.

I thought I manage to convey exactly that in the comment. Can you provide
a better phrasing?

> 
>> 		/* write, present and write, not present: */
>> -		if (unlikely(!(vma->vm_flags & VM_WRITE)))
>> +		if ((error_code & X86_PF_WRITE) &&
>> +		    unlikely(!(vma->vm_flags & VM_WRITE)))
>> +			return 1;
>> +
>> +		/* exec, present and exec, not present: */
>> +		if ((error_code & X86_PF_INSTR) &&
>> +		    unlikely(!(vma->vm_flags & VM_EXEC)))
>> 			return 1;
>> +
>> 		return 0;
>> 	}
> 
> This is getting really ugly.  I think we've gone over this before, but
> it escapes me.  Why do we need a common (X86_PF_WRITE | X86_PF_INSTR)
> block of code?  Why can't we just add a simple X86_PF_INSN if() that
> mirrors the current X86_PF_WRITE one?
> 
> 
>        if (error_code & X86_PF_INSN) {
>                /* present and not exec: */
>                if (unlikely(!(vma->vm_flags & VM_EXEC)))
>                        return 1;
>                return 0;
>        }

You are correct. My bad. I will fix it.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND PATCH v3 5/5] mm: avoid unnecessary flush on change_huge_pmd()
  2022-03-11 19:07 ` [RESEND PATCH v3 5/5] mm: avoid unnecessary flush on change_huge_pmd() Nadav Amit
@ 2022-03-11 20:41   ` Dave Hansen
  2022-03-11 20:53     ` Nadav Amit
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2022-03-11 20:41 UTC (permalink / raw)
  To: Nadav Amit, linux-mm
  Cc: linux-kernel, Andrew Morton, Nadav Amit, Andrea Arcangeli,
	Andrew Cooper, Andy Lutomirski, Dave Hansen, Peter Xu,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, x86

On 3/11/22 11:07, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Calls to change_protection_range() on THP can trigger, at least on x86,
> two TLB flushes for one page: one immediately, when pmdp_invalidate() is
> called by change_huge_pmd(), and then another one later (that can be
> batched) when change_protection_range() finishes.
> 
> The first TLB flush is only necessary to prevent the dirty bit (and with
> a lesser importance the access bit) from changing while the PTE is
> modified. However, this is not necessary as the x86 CPUs set the
> dirty-bit atomically with an additional check that the PTE is (still)
> present. One caveat is Intel's Knights Landing that has a bug and does
> not do so.

First of all, thank you for your diligence here.  This is a super
obscure issue.  I think I put handling for it in the kernel and I'm not
sure I would have even thought about this angle.

That said, I'm not sure this is all necessary.

Yes, the Dirty bit can get set unexpectedly in some PTEs.  But, the
question is whether it is *VALUABLE* and needs to be preserved.  The
current kernel code pretty much just lets the hardware set the Dirty bit
and then ignores it.  If it were valuable, ignoring it would have been a
bad thing.  We'd be losing data on today's kernels because the hardware
told us about a write that happened but that the kernel ignored.

My mental model of what the microcode responsible for the erratum does
is something along these lines:

	if (write)
		pte |= _PAGE_DIRTY;
	if (!pte_present(pte))
		#PF

The PTE is marked dirty, but the write never actually executes.  The
thread that triggered the A/D setting *also* gets a fault.

I'll double-check with some Intel folks to make sure I'm not missing
something.  But, either way, I don't think we should be going to this
much trouble for the good ol' Xeon Phi.  I doubt there are many still
around and I *REALLY* doubt they're running new kernels.

*If* we need this (and I'm not convinced we do), my first instinct would
be to just do this instead:

	clear_cpu_cap(c, X86_FEATURE_PSE);

on KNL systems.  If anyone cares, they know where to find us.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND PATCH v3 5/5] mm: avoid unnecessary flush on change_huge_pmd()
  2022-03-11 20:41   ` Dave Hansen
@ 2022-03-11 20:53     ` Nadav Amit
  0 siblings, 0 replies; 14+ messages in thread
From: Nadav Amit @ 2022-03-11 20:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, x86



> On Mar 11, 2022, at 12:41 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 3/11/22 11:07, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Calls to change_protection_range() on THP can trigger, at least on x86,
>> two TLB flushes for one page: one immediately, when pmdp_invalidate() is
>> called by change_huge_pmd(), and then another one later (that can be
>> batched) when change_protection_range() finishes.
>> 
>> The first TLB flush is only necessary to prevent the dirty bit (and with
>> a lesser importance the access bit) from changing while the PTE is
>> modified. However, this is not necessary as the x86 CPUs set the
>> dirty-bit atomically with an additional check that the PTE is (still)
>> present. One caveat is Intel's Knights Landing that has a bug and does
>> not do so.
> 
> First of all, thank you for your diligence here.  This is a super
> obscure issue.  I think I put handling for it in the kernel and I'm not
> sure I would have even thought about this angle.
> 
> That said, I'm not sure this is all necessary.
> 
> Yes, the Dirty bit can get set unexpectedly in some PTEs.  But, the
> question is whether it is *VALUABLE* and needs to be preserved.  The
> current kernel code pretty much just lets the hardware set the Dirty bit
> and then ignores it.  If it were valuable, ignoring it would have been a
> bad thing.  We'd be losing data on today's kernels because the hardware
> told us about a write that happened but that the kernel ignored.
> 
> My mental model of what the microcode responsible for the erratum does
> is something along these lines:
> 
> 	if (write)
> 		pte |= _PAGE_DIRTY;
> 	if (!pte_present(pte))
> 		#PF
> 
> The PTE is marked dirty, but the write never actually executes.  The
> thread that triggered the A/D setting *also* gets a fault.
> 

This makes perfect sense. I guess I misunderstood or forgot the erratum.
But feel free to recheck. It would allow to remove the KNL check, and
probably the first patch in this series. But I don’t think it would
allow to get rid of pmdp_invalidate_ad() since I do not fell comfortable
just to use pmdp_establish() directly: I do not know about other
architectures well enough to say that they have the same atomicity
guarantees when it comes to A/D bits.

> I'll double-check with some Intel folks to make sure I'm not missing
> something.  But, either way, I don't think we should be going to this
> much trouble for the good ol' Xeon Phi.  I doubt there are many still
> around and I *REALLY* doubt they're running new kernels.
> 
> *If* we need this (and I'm not convinced we do), my first instinct would
> be to just do this instead:
> 
> 	clear_cpu_cap(c, X86_FEATURE_PSE);
> 
> on KNL systems.  If anyone cares, they know where to find us.

I think that it is not necessary and your understanding of the erratum
is the right one. Let me know if you find it is not the case.

BTW: Thanks for the quick response, and sorry for the time it took me
to send v3.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault
  2022-03-11 20:38     ` Nadav Amit
@ 2022-03-11 20:59       ` Dave Hansen
  2022-03-11 21:16         ` Nadav Amit
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2022-03-11 20:59 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, x86

On 3/11/22 12:38, Nadav Amit wrote:
>> On Mar 11, 2022, at 11:41 AM, Dave Hansen <dave.hansen@intel.com> wrote:
...
>> Can any sane code trigger this?
> 
> Well, regarding this question and the previous one: I do not think that
> this scenario is possible today since mprotect() holds the mmap_lock
> for write. There is no other code that I am aware of that toggles
> the NX bit on a present entry.
> 
> But I will not bet my life on it. That’s the reason for the somewhat
> vague phrasing that I used.

From the userspace perspective, mmap(MAP_FIXED) can do this too.  But,
sane userspace can't rely on the syscall to have done any work and the
TLB flushing is currently done before the syscall returns.

I'd put it this way:

	Today, it is possible for a thread to end up in access_error()
	for a PF_INSN fault and observe a VM_EXEC VMA.  If you are
	generous, this could be considered a spurious fault.

	However, the faulting thread would have had to race with the
	thread which was changing the PTE and the VMA and is currently
	*in* mprotect() (or some other syscall).  In other words, the
	faulting thread can encounter this situation, but it never had
	any assurance from the kernel that it wouldn't fault.  This is
	because the faulting thread never had a chance to observe the
	syscall return.

	There is no evidence that the existing behavior can cause any
	issues with sane userspace.

>>> index d0074c6ed31a..ad0ef0a6087a 100644
>>> --- a/arch/x86/mm/fault.c
>>> +++ b/arch/x86/mm/fault.c
>>> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>>> 				       (error_code & X86_PF_INSTR), foreign))
>>> 		return 1;
>>>
>>> -	if (error_code & X86_PF_WRITE) {
>>> +	if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
>>> +		/*
>>> +		 * CPUs are not expected to set the two error code bits
>>> +		 * together, but to ensure that hypervisors do not misbehave,
>>> +		 * run an additional sanity check.
>>> +		 */
>>> +		if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
>>> +					(X86_PF_WRITE|X86_PF_INSTR)) {
>>> +			WARN_ON_ONCE(1);
>>> +			return 1;
>>> +		}
>>
>> access_error() is only used on the do_user_addr_fault() side of things.
>> Can we stick this check somewhere that also works for kernel address
>> faults?  This is a generic sanity check.  It can also be in a separate
>> patch.
> 
> I can wrap it in a different function and also call it from
> do_kern_addr_fault() or spurious_kernel_fault().
> 
> Anyhow, spurious_kernel_fault() should handle spurious faults on
> executable code correctly. 

This is really about checking the sanity of the "hardware"-provided
error code.  Let's just do it in  handle_page_fault(), maybe hidden in a
function like:

void check_error_code_sanity(unsigned long error_code)
{
	WARN_ON_ONCE(...);
}

You can leave the X86_PF_PK check in place for now.  It's probably going
away soon anyway.

>> Also, we should *probably* stop talking about CPUs here.  If there's
>> ever something wonky with error code bits, I'd put my money on a weird
>> hypervisor before any kind of CPU issue.
> 
> I thought I manage to convey exactly that in the comment. Can you provide
> a better phrasing?

Maybe:

	/*
	 * X86_PF_INSTR for instruction _fetches_.  Fetches never write.
	 * X86_PF_WRITE should never be set with X86_PF_INSTR.
	 *
	 * This is most likely due to a buggy hypervisor.
	 */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault
  2022-03-11 20:59       ` Dave Hansen
@ 2022-03-11 21:16         ` Nadav Amit
  2022-03-11 21:23           ` Dave Hansen
  0 siblings, 1 reply; 14+ messages in thread
From: Nadav Amit @ 2022-03-11 21:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, x86



> On Mar 11, 2022, at 12:59 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 3/11/22 12:38, Nadav Amit wrote:
>>> On Mar 11, 2022, at 11:41 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> ...
>>> Can any sane code trigger this?
>> 
>> Well, regarding this question and the previous one: I do not think that
>> this scenario is possible today since mprotect() holds the mmap_lock
>> for write. There is no other code that I am aware of that toggles
>> the NX bit on a present entry.
>> 
>> But I will not bet my life on it. That’s the reason for the somewhat
>> vague phrasing that I used.
> 
> From the userspace perspective, mmap(MAP_FIXED) can do this too.  But,
> sane userspace can't rely on the syscall to have done any work and the
> TLB flushing is currently done before the syscall returns.
> 
> I'd put it this way:
> 
> 	Today, it is possible for a thread to end up in access_error()
> 	for a PF_INSN fault and observe a VM_EXEC VMA.  If you are
> 	generous, this could be considered a spurious fault.
> 
> 	However, the faulting thread would have had to race with the
> 	thread which was changing the PTE and the VMA and is currently
> 	*in* mprotect() (or some other syscall).  In other words, the
> 	faulting thread can encounter this situation, but it never had
> 	any assurance from the kernel that it wouldn't fault.  This is
> 	because the faulting thread never had a chance to observe the
> 	syscall return.
> 
> 	There is no evidence that the existing behavior can cause any
> 	issues with sane userspace.

Done. Thanks.

> 
>>>> index d0074c6ed31a..ad0ef0a6087a 100644
>>>> --- a/arch/x86/mm/fault.c
>>>> +++ b/arch/x86/mm/fault.c
>>>> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>>>> 				       (error_code & X86_PF_INSTR), foreign))
>>>> 		return 1;
>>>> 
>>>> -	if (error_code & X86_PF_WRITE) {
>>>> +	if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
>>>> +		/*
>>>> +		 * CPUs are not expected to set the two error code bits
>>>> +		 * together, but to ensure that hypervisors do not misbehave,
>>>> +		 * run an additional sanity check.
>>>> +		 */
>>>> +		if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
>>>> +					(X86_PF_WRITE|X86_PF_INSTR)) {
>>>> +			WARN_ON_ONCE(1);
>>>> +			return 1;
>>>> +		}
>>> 
>>> access_error() is only used on the do_user_addr_fault() side of things.
>>> Can we stick this check somewhere that also works for kernel address
>>> faults?  This is a generic sanity check.  It can also be in a separate
>>> patch.
>> 
>> I can wrap it in a different function and also call it from
>> do_kern_addr_fault() or spurious_kernel_fault().
>> 
>> Anyhow, spurious_kernel_fault() should handle spurious faults on
>> executable code correctly. 
> 
> This is really about checking the sanity of the "hardware"-provided
> error code.  Let's just do it in  handle_page_fault(), maybe hidden in a
> function like:
> 
> void check_error_code_sanity(unsigned long error_code)
> {
> 	WARN_ON_ONCE(...);
> }
> 
> You can leave the X86_PF_PK check in place for now.  It's probably going
> away soon anyway.

Done. Thanks. But note that removing the check from access_error() means
that if the assertion is broken, userspace might crash inadvertently
(in contrast to the version I sent, which would have potentially led to
infinite stream of page-faults). I don’t know which behavior is better,
so let’s go with your version and just hope it doesn’t happen.

> 
>>> Also, we should *probably* stop talking about CPUs here.  If there's
>>> ever something wonky with error code bits, I'd put my money on a weird
>>> hypervisor before any kind of CPU issue.
>> 
>> I thought I manage to convey exactly that in the comment. Can you provide
>> a better phrasing?
> 
> Maybe:
> 
> 	/*
> 	 * X86_PF_INSTR for instruction _fetches_.  Fetches never write.
> 	 * X86_PF_WRITE should never be set with X86_PF_INSTR.
> 	 *
> 	 * This is most likely due to a buggy hypervisor.
> 	 */

Done, thank you.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault
  2022-03-11 21:16         ` Nadav Amit
@ 2022-03-11 21:23           ` Dave Hansen
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2022-03-11 21:23 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin, x86

On 3/11/22 13:16, Nadav Amit wrote:
>> This is really about checking the sanity of the "hardware"-provided
>> error code.  Let's just do it in  handle_page_fault(), maybe hidden in a
>> function like:
>>
>> void check_error_code_sanity(unsigned long error_code)
>> {
>> 	WARN_ON_ONCE(...);
>> }
>>
>> You can leave the X86_PF_PK check in place for now.  It's probably going
>> away soon anyway.
> Done. Thanks. But note that removing the check from access_error() means
> that if the assertion is broken, userspace might crash inadvertently
> (in contrast to the version I sent, which would have potentially led to
> infinite stream of page-faults). I don’t know which behavior is better,
> so let’s go with your version and just hope it doesn’t happen.

Actually, crashing sounds much nicer to me than infinite page faults.
It's a lot easier to debug, *especially* with a warning on dmesg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RESEND PATCH v3 4/5] mm/mprotect: do not flush on permission promotion
  2022-03-11 19:07 ` [RESEND PATCH v3 4/5] mm/mprotect: do not flush on permission promotion Nadav Amit
@ 2022-03-11 22:45   ` Nadav Amit
  0 siblings, 0 replies; 14+ messages in thread
From: Nadav Amit @ 2022-03-11 22:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Linux Kernel Mailing List, Andrew Morton, Andrea Arcangeli,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, x86


> On Mar 11, 2022, at 11:07 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> From: Nadav Amit <namit@vmware.com>
> 
> Currently, using mprotect() to unprotect a memory region or uffd to
> unprotect a memory region causes a TLB flush. At least on x86, as
> protection is promoted, no TLB flush is needed.
> 
> Add an arch-specific pte_may_need_flush() which tells whether a TLB
> flush is needed based on the old PTE and the new one. Implement an x86
> pte_may_need_flush().
> 
> For x86, besides the simple logic that PTE protection promotion or
> changes of software bits does require a flush, also add logic that
> considers the dirty-bit. If the dirty-bit is clear and write-protect is
> set, no TLB flush is needed, as x86 updates the dirty-bit atomically
> on write, and if the bit is clear, the PTE is reread.
> 
> 

[snip]

> + */
> +static inline bool pte_needs_flush(pte_t oldpte, pte_t newpte)
> +{
> +	/* !PRESENT -> * ; no need for flush */
> +	if (!pte_present(oldpte))
> +		return false;
> +
> +	/* PRESENT -> !PRESENT ; needs flush */
> +	if (!pte_present(newpte))
> +		return true;
> +
> +	/* PFN changed ; needs flush */
> +	if (pte_pfn(oldpte) != pte_pfn(newpte))
> +		return true;
> +
> +	return pte_flags_need_flush(pte_flags(oldpte), pte_flags(newpte),
> +				    _PAGE_ACCESSED);
> +}


Looking again at this patch, I think the PRESENT -> !PRESENT is
not needed:

1. It might trigger unnecessary flushes for PROT_NONE (or NUMA)
2. Real PRESENT -> !PRESENT is already checked by
   pte_flags_need_flush().


Let me know if I am missing something. Otherwise I will change it
for v4.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-03-11 23:12 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-11 19:07 [RESEND PATCH v3 0/5] mm/mprotect: avoid unnecessary TLB flushes Nadav Amit
2022-03-11 19:07 ` [RESEND PATCH v3 1/5] x86: Detection of Knights Landing A/D leak Nadav Amit
2022-03-11 19:07 ` [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault Nadav Amit
2022-03-11 19:41   ` Dave Hansen
2022-03-11 20:38     ` Nadav Amit
2022-03-11 20:59       ` Dave Hansen
2022-03-11 21:16         ` Nadav Amit
2022-03-11 21:23           ` Dave Hansen
2022-03-11 19:07 ` [RESEND PATCH v3 3/5] mm/mprotect: use mmu_gather Nadav Amit
2022-03-11 19:07 ` [RESEND PATCH v3 4/5] mm/mprotect: do not flush on permission promotion Nadav Amit
2022-03-11 22:45   ` Nadav Amit
2022-03-11 19:07 ` [RESEND PATCH v3 5/5] mm: avoid unnecessary flush on change_huge_pmd() Nadav Amit
2022-03-11 20:41   ` Dave Hansen
2022-03-11 20:53     ` Nadav Amit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.