[PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit
@ 2021-02-10  2:33 Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 01/14] x86/fault: Fix AMD erratum #91 errata fixup for user code Andy Lutomirski
                   ` (14 more replies)
  0 siblings, 15 replies; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski

This series is a whole bunch of page fault cleanups, plus a couple
of OOPS diagnostic improvements.  The overall goals are to clean up
handling of the faulting CPL, the USER bit in the error_code, and
the log messages generated by #PF OOPSes.

This series can also be seen as CET preparation.  CET introduces the
WRUSS instruction, which is the very first way for CPL 0 code to
cause a #PF fault with the USER bit set.  Let's get the page fault
code into shape before we start using WRUSS :)

Changes from v1:
 - Various changelog improvements.
 - Reorder patches (SMAP moved after SMEP)
 - Add the efi_recover_from_page_fault() patch
 - Tidy up and improve the AMD erratum detection code

Andy Lutomirski (14):
  x86/fault: Fix AMD erratum #91 errata fixup for user code
  x86/fault: Skip the AMD erratum #91 workaround on unaffected CPUs
  x86/fault: Fold mm_fault_error() into do_user_addr_fault()
  x86/fault/32: Move is_f00f_bug() to do_kern_addr_fault()
  x86/fault: Document the locking in the fault_signal_pending() path
  x86/fault: Correct a few user vs kernel checks wrt WRUSS
  x86/fault: Improve kernel-executing-user-memory handling
  x86/fault: Skip erratum #93 workaround on new CPUs
  x86/fault: Split the OOPS code out from no_context()
  x86/fault: Bypass no_context() for implicit kernel faults from
    usermode
  x86/fault: Rename no_context() to kernelmode_fixup_or_oops()
  x86/fault: Don't look for extable entries for SMEP violations
  x86/fault: Don't run fixups for SMAP violations
  x86/fault, x86/efi: Fix and rename efi_recover_from_page_fault()

 arch/x86/include/asm/efi.h     |   2 +-
 arch/x86/mm/fault.c            | 380 +++++++++++++++++++--------------
 arch/x86/platform/efi/quirks.c |  16 +-
 3 files changed, 227 insertions(+), 171 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 01/14] x86/fault: Fix AMD erratum #91 errata fixup for user code
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10  7:34   ` Christoph Hellwig
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 02/14] x86/fault: Skip the AMD erratum #91 workaround on unaffected CPUs Andy Lutomirski
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski, stable,
	Peter Zijlstra, Christoph Hellwig

The recent rework of probe_kernel_address() and its conversion to
get_kernel_nofault() inadvertently broke is_prefetch(). Before this change,
probe_kernel_address() was used as a sloppy "read user or kernel memory"
helper, but it doesn't do that any more.  The new get_kernel_nofault()
reads *kernel* memory only, which completely broke is_prefetch() for user
access.

Adjust the code to the the correct accessor based on access mode.  The
manual address bounds check is no longer necessary, since the accessor
helpers (get_user() / get_kernel_nofault()) do the right thing all by
themselves.  As a bonus, by using the correct accessor, we don't need the
open-coded address bounds check.

Fixes: eab0c6089b68 ("maccess: unify the probe kernel arch hooks")
Cc: stable@vger.kernel.org
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f1f1b5a0956a..441c3e9b8971 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -54,7 +54,7 @@ kmmio_fault(struct pt_regs *regs, unsigned long addr)
  * 32-bit mode:
  *
  *   Sometimes AMD Athlon/Opteron CPUs report invalid exceptions on prefetch.
- *   Check that here and ignore it.
+ *   Check that here and ignore it.  This is AMD erratum #91.
  *
  * 64-bit mode:
  *
@@ -83,11 +83,7 @@ check_prefetch_opcode(struct pt_regs *regs, unsigned char *instr,
 #ifdef CONFIG_X86_64
 	case 0x40:
 		/*
-		 * In AMD64 long mode 0x40..0x4F are valid REX prefixes
-		 * Need to figure out under what instruction mode the
-		 * instruction was issued. Could check the LDT for lm,
-		 * but for now it's good enough to assume that long
-		 * mode only uses well known segments or kernel.
+		 * In 64-bit mode 0x40..0x4F are valid REX prefixes
 		 */
 		return (!user_mode(regs) || user_64bit_mode(regs));
 #endif
@@ -127,20 +123,31 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr)
 	instr = (void *)convert_ip_to_linear(current, regs);
 	max_instr = instr + 15;
 
-	if (user_mode(regs) && instr >= (unsigned char *)TASK_SIZE_MAX)
-		return 0;
+	/*
+	 * This code has historically always bailed out if IP points to a
+	 * not-present page (e.g. due to a race).  No one has ever
+	 * complained about this.
+	 */
+	pagefault_disable();
 
 	while (instr < max_instr) {
 		unsigned char opcode;
 
-		if (get_kernel_nofault(opcode, instr))
-			break;
+		if (user_mode(regs)) {
+			if (get_user(opcode, instr))
+				break;
+		} else {
+			if (get_kernel_nofault(opcode, instr))
+				break;
+		}
 
 		instr++;
 
 		if (!check_prefetch_opcode(regs, instr, opcode, &prefetch))
 			break;
 	}
+
+	pagefault_enable();
 	return prefetch;
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 02/14] x86/fault: Skip the AMD erratum #91 workaround on unaffected CPUs
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 01/14] x86/fault: Fix AMD erratum #91 errata fixup for user code Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 03/14] x86/fault: Fold mm_fault_error() into do_user_addr_fault() Andy Lutomirski
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski

According to the Revision Guide for AMD Athlon™ 64 and AMD Opteron™
Processors, only early revisions of family 0xF are affected.  This will
avoid unnecessarily fetching instruction bytes before sending SIGSEGV to
user programs.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 441c3e9b8971..818902b08c52 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -106,6 +106,15 @@ check_prefetch_opcode(struct pt_regs *regs, unsigned char *instr,
 	}
 }
 
+static bool is_amd_k8_pre_npt(void)
+{
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+
+	return unlikely(IS_ENABLED(CONFIG_CPU_SUP_AMD) &&
+			c->x86_vendor == X86_VENDOR_AMD &&
+			c->x86 == 0xf && c->x86_model < 0x40);
+}
+
 static int
 is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr)
 {
@@ -113,6 +122,10 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr)
 	unsigned char *instr;
 	int prefetch = 0;
 
+	/* Erratum #91 affects AMD K8, pre-NPT CPUs */
+	if (!is_amd_k8_pre_npt())
+		return 0;
+
 	/*
 	 * If it was a exec (instruction fetch) fault on NX page, then
 	 * do not ignore the fault:
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 03/14] x86/fault: Fold mm_fault_error() into do_user_addr_fault()
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 01/14] x86/fault: Fix AMD erratum #91 errata fixup for user code Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 02/14] x86/fault: Skip the AMD erratum #91 workaround on unaffected CPUs Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 04/14] x86/fault/32: Move is_f00f_bug() to do_kern_addr_fault() Andy Lutomirski
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski, Peter Zijlstra

mm_fault_error() is logically just the end of do_user_addr_fault().
Combine the functions.  This makes the code easier to read.

Most of the churn here is from renaming hw_error_code to error_code in
do_user_addr_fault().

This makes no difference at all to the generated code (objdump -dr) as
compared to changing noinline to __always_inline in the definition of
mm_fault_error().

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 97 +++++++++++++++++++++------------------------
 1 file changed, 45 insertions(+), 52 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 818902b08c52..91cf7a672c04 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -981,40 +981,6 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 	force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
 }
 
-static noinline void
-mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-	       unsigned long address, vm_fault_t fault)
-{
-	if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) {
-		no_context(regs, error_code, address, 0, 0);
-		return;
-	}
-
-	if (fault & VM_FAULT_OOM) {
-		/* Kernel mode? Handle exceptions or die: */
-		if (!(error_code & X86_PF_USER)) {
-			no_context(regs, error_code, address,
-				   SIGSEGV, SEGV_MAPERR);
-			return;
-		}
-
-		/*
-		 * We ran out of memory, call the OOM killer, and return the
-		 * userspace (which will retry the fault, or kill us if we got
-		 * oom-killed):
-		 */
-		pagefault_out_of_memory();
-	} else {
-		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
-			     VM_FAULT_HWPOISON_LARGE))
-			do_sigbus(regs, error_code, address, fault);
-		else if (fault & VM_FAULT_SIGSEGV)
-			bad_area_nosemaphore(regs, error_code, address);
-		else
-			BUG();
-	}
-}
-
 static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte)
 {
 	if ((error_code & X86_PF_WRITE) && !pte_write(*pte))
@@ -1252,7 +1218,7 @@ NOKPROBE_SYMBOL(do_kern_addr_fault);
 /* Handle faults in the user portion of the address space */
 static inline
 void do_user_addr_fault(struct pt_regs *regs,
-			unsigned long hw_error_code,
+			unsigned long error_code,
 			unsigned long address)
 {
 	struct vm_area_struct *vma;
@@ -1272,8 +1238,8 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * Reserved bits are never expected to be set on
 	 * entries in the user portion of the page tables.
 	 */
-	if (unlikely(hw_error_code & X86_PF_RSVD))
-		pgtable_bad(regs, hw_error_code, address);
+	if (unlikely(error_code & X86_PF_RSVD))
+		pgtable_bad(regs, error_code, address);
 
 	/*
 	 * If SMAP is on, check for invalid kernel (supervisor) access to user
@@ -1283,10 +1249,10 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * enforcement appears to be consistent with the USER bit.
 	 */
 	if (unlikely(cpu_feature_enabled(X86_FEATURE_SMAP) &&
-		     !(hw_error_code & X86_PF_USER) &&
+		     !(error_code & X86_PF_USER) &&
 		     !(regs->flags & X86_EFLAGS_AC)))
 	{
-		bad_area_nosemaphore(regs, hw_error_code, address);
+		bad_area_nosemaphore(regs, error_code, address);
 		return;
 	}
 
@@ -1295,7 +1261,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * in a region with pagefaults disabled then we must not take the fault
 	 */
 	if (unlikely(faulthandler_disabled() || !mm)) {
-		bad_area_nosemaphore(regs, hw_error_code, address);
+		bad_area_nosemaphore(regs, error_code, address);
 		return;
 	}
 
@@ -1316,9 +1282,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
-	if (hw_error_code & X86_PF_WRITE)
+	if (error_code & X86_PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
-	if (hw_error_code & X86_PF_INSTR)
+	if (error_code & X86_PF_INSTR)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
 #ifdef CONFIG_X86_64
@@ -1334,7 +1300,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * to consider the PF_PK bit.
 	 */
 	if (is_vsyscall_vaddr(address)) {
-		if (emulate_vsyscall(hw_error_code, regs, address))
+		if (emulate_vsyscall(error_code, regs, address))
 			return;
 	}
 #endif
@@ -1357,7 +1323,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 			 * Fault from code in kernel from
 			 * which we do not expect faults.
 			 */
-			bad_area_nosemaphore(regs, hw_error_code, address);
+			bad_area_nosemaphore(regs, error_code, address);
 			return;
 		}
 retry:
@@ -1373,17 +1339,17 @@ void do_user_addr_fault(struct pt_regs *regs,
 
 	vma = find_vma(mm, address);
 	if (unlikely(!vma)) {
-		bad_area(regs, hw_error_code, address);
+		bad_area(regs, error_code, address);
 		return;
 	}
 	if (likely(vma->vm_start <= address))
 		goto good_area;
 	if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
-		bad_area(regs, hw_error_code, address);
+		bad_area(regs, error_code, address);
 		return;
 	}
 	if (unlikely(expand_stack(vma, address))) {
-		bad_area(regs, hw_error_code, address);
+		bad_area(regs, error_code, address);
 		return;
 	}
 
@@ -1392,8 +1358,8 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * we can handle it..
 	 */
 good_area:
-	if (unlikely(access_error(hw_error_code, vma))) {
-		bad_area_access_error(regs, hw_error_code, address, vma);
+	if (unlikely(access_error(error_code, vma))) {
+		bad_area_access_error(regs, error_code, address, vma);
 		return;
 	}
 
@@ -1415,7 +1381,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	/* Quick path to respond to signals */
 	if (fault_signal_pending(fault, regs)) {
 		if (!user_mode(regs))
-			no_context(regs, hw_error_code, address, SIGBUS,
+			no_context(regs, error_code, address, SIGBUS,
 				   BUS_ADRERR);
 		return;
 	}
@@ -1432,11 +1398,38 @@ void do_user_addr_fault(struct pt_regs *regs,
 	}
 
 	mmap_read_unlock(mm);
-	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, hw_error_code, address, fault);
+	if (likely(!(fault & VM_FAULT_ERROR)))
+		return;
+
+	if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) {
+		no_context(regs, error_code, address, 0, 0);
 		return;
 	}
 
+	if (fault & VM_FAULT_OOM) {
+		/* Kernel mode? Handle exceptions or die: */
+		if (!(error_code & X86_PF_USER)) {
+			no_context(regs, error_code, address,
+				   SIGSEGV, SEGV_MAPERR);
+			return;
+		}
+
+		/*
+		 * We ran out of memory, call the OOM killer, and return the
+		 * userspace (which will retry the fault, or kill us if we got
+		 * oom-killed):
+		 */
+		pagefault_out_of_memory();
+	} else {
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
+			     VM_FAULT_HWPOISON_LARGE))
+			do_sigbus(regs, error_code, address, fault);
+		else if (fault & VM_FAULT_SIGSEGV)
+			bad_area_nosemaphore(regs, error_code, address);
+		else
+			BUG();
+	}
+
 	check_v8086_mode(regs, address, tsk);
 }
 NOKPROBE_SYMBOL(do_user_addr_fault);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 04/14] x86/fault/32: Move is_f00f_bug() to do_kern_addr_fault()
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
                   ` (2 preceding siblings ...)
  2021-02-10  2:33 ` [PATCH v2 03/14] x86/fault: Fold mm_fault_error() into do_user_addr_fault() Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 05/14] x86/fault: Document the locking in the fault_signal_pending() path Andy Lutomirski
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski, Peter Zijlstra

bad_area() and its relatives are called from many places in fault.c, and
exactly one of them wants the F00F workaround.

__bad_area_nosemaphore() no longer contains any kernel fault code, which
prepares for further cleanups.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 91cf7a672c04..3ffed003f281 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -482,10 +482,12 @@ static int is_errata100(struct pt_regs *regs, unsigned long address)
 }
 
 /* Pentium F0 0F C7 C8 bug workaround: */
-static int is_f00f_bug(struct pt_regs *regs, unsigned long address)
+static int is_f00f_bug(struct pt_regs *regs, unsigned long error_code,
+		       unsigned long address)
 {
 #ifdef CONFIG_X86_F00F_BUG
-	if (boot_cpu_has_bug(X86_BUG_F00F) && idt_is_f00f_address(address)) {
+	if (boot_cpu_has_bug(X86_BUG_F00F) && !(error_code & X86_PF_USER) &&
+	    idt_is_f00f_address(address)) {
 		handle_invalid_op(regs);
 		return 1;
 	}
@@ -853,9 +855,6 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
-	if (is_f00f_bug(regs, address))
-		return;
-
 	no_context(regs, error_code, address, SIGSEGV, si_code);
 }
 
@@ -1195,6 +1194,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 	}
 #endif
 
+	if (is_f00f_bug(regs, hw_error_code, address))
+		return;
+
 	/* Was the fault spurious, caused by lazy TLB invalidation? */
 	if (spurious_kernel_fault(hw_error_code, address))
 		return;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 05/14] x86/fault: Document the locking in the fault_signal_pending() path
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
                   ` (3 preceding siblings ...)
  2021-02-10  2:33 ` [PATCH v2 04/14] x86/fault/32: Move is_f00f_bug() to do_kern_addr_fault() Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 06/14] x86/fault: Correct a few user vs kernel checks wrt WRUSS Andy Lutomirski
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski, Peter Zijlstra

If fault_signal_pending() returns true, then the core mm has unlocked the
mm for us.  Add a comment to help future readers of this code.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 3ffed003f281..013910b7b93f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1380,8 +1380,11 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 */
 	fault = handle_mm_fault(vma, address, flags, regs);
 
-	/* Quick path to respond to signals */
 	if (fault_signal_pending(fault, regs)) {
+		/*
+		 * Quick path to respond to signals.  The core mm code
+		 * has unlocked the mm for us if we get here.
+		 */
 		if (!user_mode(regs))
 			no_context(regs, error_code, address, SIGBUS,
 				   BUS_ADRERR);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 06/14] x86/fault: Correct a few user vs kernel checks wrt WRUSS
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
                   ` (4 preceding siblings ...)
  2021-02-10  2:33 ` [PATCH v2 05/14] x86/fault: Document the locking in the fault_signal_pending() path Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 07/14] x86/fault: Improve kernel-executing-user-memory handling Andy Lutomirski
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski, Peter Zijlstra

In general, page fault errors for WRUSS should be just like get_user(),
etc.  Fix three bugs in this area:

There is a comment that says that, if the kernel can't handle a page fault
on a user address due to OOM, the OOM-kill-and-retry logic would be
skipped.  The code checked kernel *privilege*, not kernel mode, so it
missed WRUSS.  This means that the kernel would malfunction if it got OOM
on a WRUSS fault -- this would be a kernel-mode, user-privilege fault, and
the OOM killer would be invoked and the handler would retry the faulting
instruction.

A failed user access from kernel while a fatal signal is pending should
fail even if the instruction in question was WRUSS.

do_sigbus() should not send SIGBUS for WRUSS -- it should handle it like
any other kernel mode failure.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 013910b7b93f..b1104844260d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -945,7 +945,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 	  vm_fault_t fault)
 {
 	/* Kernel mode? Handle exceptions or die: */
-	if (!(error_code & X86_PF_USER)) {
+	if (!user_mode(regs)) {
 		no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
 		return;
 	}
@@ -1217,7 +1217,14 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 }
 NOKPROBE_SYMBOL(do_kern_addr_fault);
 
-/* Handle faults in the user portion of the address space */
+/*
+ * Handle faults in the user portion of the address space.  Nothing in here
+ * should check X86_PF_USER without a specific justification: for almost
+ * all purposes, we should treat a normal kernel access to user memory
+ * (e.g. get_user(), put_user(), etc.) the same as the WRUSS instruction.
+ * The one exception is AC flag handling, which is, per the x86
+ * architecture, special for WRUSS.
+ */
 static inline
 void do_user_addr_fault(struct pt_regs *regs,
 			unsigned long error_code,
@@ -1406,14 +1413,14 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (likely(!(fault & VM_FAULT_ERROR)))
 		return;
 
-	if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) {
+	if (fatal_signal_pending(current) && !user_mode(regs)) {
 		no_context(regs, error_code, address, 0, 0);
 		return;
 	}
 
 	if (fault & VM_FAULT_OOM) {
 		/* Kernel mode? Handle exceptions or die: */
-		if (!(error_code & X86_PF_USER)) {
+		if (!user_mode(regs)) {
 			no_context(regs, error_code, address,
 				   SIGSEGV, SEGV_MAPERR);
 			return;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 07/14] x86/fault: Improve kernel-executing-user-memory handling
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
                   ` (5 preceding siblings ...)
  2021-02-10  2:33 ` [PATCH v2 06/14] x86/fault: Correct a few user vs kernel checks wrt WRUSS Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 08/14] x86/fault: Skip erratum #93 workaround on new CPUs Andy Lutomirski
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski, Peter Zijlstra

Right now we treat the case of the kernel trying to execute from user
memory more or less just like the kernel getting a page fault on a user
access.  In the failure path, we check for erratum #93, try to otherwise
fix up the error, and then oops.

If we manage to jump to the user address space, with or without SMEP, we
should not try to resolve the page fault.  This is an error, pure and
simple.  Rearrange the code so that we catch this case early, check for
erratum #93, and bail out.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b1104844260d..cbb1a9754473 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -447,6 +447,9 @@ static int is_errata93(struct pt_regs *regs, unsigned long address)
 	    || boot_cpu_data.x86 != 0xf)
 		return 0;
 
+	if (user_mode(regs))
+		return 0;
+
 	if (address != regs->ip)
 		return 0;
 
@@ -744,9 +747,6 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	if (is_prefetch(regs, error_code, address))
 		return;
 
-	if (is_errata93(regs, address))
-		return;
-
 	/*
 	 * Buggy firmware could access regions which might page fault, try to
 	 * recover from such faults.
@@ -1239,6 +1239,21 @@ void do_user_addr_fault(struct pt_regs *regs,
 	tsk = current;
 	mm = tsk->mm;
 
+	if (unlikely((error_code & (X86_PF_USER | X86_PF_INSTR)) == X86_PF_INSTR)) {
+		/*
+		 * Whoops, this is kernel mode code trying to execute from
+		 * user memory.  Unless this is AMD erratum #93, which
+		 * corrupts RIP such that it looks like a user address,
+		 * this is unrecoverable.  Don't even try to look up the
+		 * VMA.
+		 */
+		if (is_errata93(regs, address))
+			return;
+
+		bad_area_nosemaphore(regs, error_code, address);
+		return;
+	}
+
 	/* kprobes don't want to hook the spurious faults: */
 	if (unlikely(kprobe_page_fault(regs, X86_TRAP_PF)))
 		return;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 08/14] x86/fault: Skip erratum #93 workaround on new CPUs
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
                   ` (6 preceding siblings ...)
  2021-02-10  2:33 ` [PATCH v2 07/14] x86/fault: Improve kernel-executing-user-memory handling Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10  6:06   ` Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 09/14] x86/fault: Split the OOPS code out from no_context() Andy Lutomirski
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski

Erratum #93 applies to the first generation of AMD K8 CPUs.  Skip the
workaround on newer CPUs.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index cbb1a9754473..3fe2f4800b69 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -442,9 +442,8 @@ static void dump_pagetable(unsigned long address)
  */
 static int is_errata93(struct pt_regs *regs, unsigned long address)
 {
-#if defined(CONFIG_X86_64) && defined(CONFIG_CPU_SUP_AMD)
-	if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD
-	    || boot_cpu_data.x86 != 0xf)
+#if defined(CONFIG_X86_64)
+	if (!is_amd_k8_pre_npt())
 		return 0;
 
 	if (user_mode(regs))
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 09/14] x86/fault: Split the OOPS code out from no_context()
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
                   ` (7 preceding siblings ...)
  2021-02-10  2:33 ` [PATCH v2 08/14] x86/fault: Skip erratum #93 workaround on new CPUs Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 10/14] x86/fault: Bypass no_context() for implicit kernel faults from usermode Andy Lutomirski
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski, Peter Zijlstra

Not all callers of no_context() want to run exception fixups.
Separate the OOPS code out from the fixup code in no_context().

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 116 +++++++++++++++++++++++---------------------
 1 file changed, 62 insertions(+), 54 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 3fe2f4800b69..8b8bd0a4f4b2 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -654,53 +654,20 @@ static void set_signal_archinfo(unsigned long address,
 }
 
 static noinline void
-no_context(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int signal, int si_code)
+page_fault_oops(struct pt_regs *regs, unsigned long error_code,
+		unsigned long address)
 {
-	struct task_struct *tsk = current;
 	unsigned long flags;
 	int sig;
 
 	if (user_mode(regs)) {
 		/*
-		 * This is an implicit supervisor-mode access from user
-		 * mode.  Bypass all the kernel-mode recovery code and just
-		 * OOPS.
+		 * Implicit kernel access from user mode?  Skip the stack
+		 * overflow and EFI special cases.
 		 */
 		goto oops;
 	}
 
-	/* Are we prepared to handle this kernel fault? */
-	if (fixup_exception(regs, X86_TRAP_PF, error_code, address)) {
-		/*
-		 * Any interrupt that takes a fault gets the fixup. This makes
-		 * the below recursive fault logic only apply to a faults from
-		 * task context.
-		 */
-		if (in_interrupt())
-			return;
-
-		/*
-		 * Per the above we're !in_interrupt(), aka. task context.
-		 *
-		 * In this case we need to make sure we're not recursively
-		 * faulting through the emulate_vsyscall() logic.
-		 */
-		if (current->thread.sig_on_uaccess_err && signal) {
-			sanitize_error_code(address, &error_code);
-
-			set_signal_archinfo(address, error_code);
-
-			/* XXX: hwpoison faults will set the wrong code. */
-			force_sig_fault(signal, si_code, (void __user *)address);
-		}
-
-		/*
-		 * Barring that, we can do the fixup and be happy.
-		 */
-		return;
-	}
-
 #ifdef CONFIG_VMAP_STACK
 	/*
 	 * Stack overflow?  During boot, we can fault near the initial
@@ -708,8 +675,8 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	 * that we're in vmalloc space to avoid this.
 	 */
 	if (is_vmalloc_addr((void *)address) &&
-	    (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
-	     address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
+	    (((unsigned long)current->stack - 1 - address < PAGE_SIZE) ||
+	     address - ((unsigned long)current->stack + THREAD_SIZE) < PAGE_SIZE)) {
 		unsigned long stack = __this_cpu_ist_top_va(DF) - sizeof(void *);
 		/*
 		 * We're likely to be running with very little stack space
@@ -732,20 +699,6 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	}
 #endif
 
-	/*
-	 * 32-bit:
-	 *
-	 *   Valid to do another page fault here, because if this fault
-	 *   had been triggered by is_prefetch fixup_exception would have
-	 *   handled it.
-	 *
-	 * 64-bit:
-	 *
-	 *   Hall of shame of CPU/BIOS bugs.
-	 */
-	if (is_prefetch(regs, error_code, address))
-		return;
-
 	/*
 	 * Buggy firmware could access regions which might page fault, try to
 	 * recover from such faults.
@@ -762,7 +715,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 
 	show_fault_oops(regs, error_code, address);
 
-	if (task_stack_end_corrupted(tsk))
+	if (task_stack_end_corrupted(current))
 		printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
 
 	sig = SIGKILL;
@@ -775,6 +728,61 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	oops_end(flags, regs, sig);
 }
 
+static noinline void
+no_context(struct pt_regs *regs, unsigned long error_code,
+	   unsigned long address, int signal, int si_code)
+{
+	if (user_mode(regs)) {
+		/*
+		 * This is an implicit supervisor-mode access from user
+		 * mode.  Bypass all the kernel-mode recovery code and just
+		 * OOPS.
+		 */
+		goto oops;
+	}
+
+	/* Are we prepared to handle this kernel fault? */
+	if (fixup_exception(regs, X86_TRAP_PF, error_code, address)) {
+		/*
+		 * Any interrupt that takes a fault gets the fixup. This makes
+		 * the below recursive fault logic only apply to a faults from
+		 * task context.
+		 */
+		if (in_interrupt())
+			return;
+
+		/*
+		 * Per the above we're !in_interrupt(), aka. task context.
+		 *
+		 * In this case we need to make sure we're not recursively
+		 * faulting through the emulate_vsyscall() logic.
+		 */
+		if (current->thread.sig_on_uaccess_err && signal) {
+			sanitize_error_code(address, &error_code);
+
+			set_signal_archinfo(address, error_code);
+
+			/* XXX: hwpoison faults will set the wrong code. */
+			force_sig_fault(signal, si_code, (void __user *)address);
+		}
+
+		/*
+		 * Barring that, we can do the fixup and be happy.
+		 */
+		return;
+	}
+
+	/*
+	 * AMD erratum #91 manifests as a spurious page fault on a PREFETCH
+	 * instruction.
+	 */
+	if (is_prefetch(regs, error_code, address))
+		return;
+
+oops:
+	page_fault_oops(regs, error_code, address);
+}
+
 /*
  * Print out info about fatal segfaults, if the show_unhandled_signals
  * sysctl is set:
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 10/14] x86/fault: Bypass no_context() for implicit kernel faults from usermode
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
                   ` (8 preceding siblings ...)
  2021-02-10  2:33 ` [PATCH v2 09/14] x86/fault: Split the OOPS code out from no_context() Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 11/14] x86/fault: Rename no_context() to kernelmode_fixup_or_oops() Andy Lutomirski
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski, Peter Zijlstra

We can drop an indentation level and remove the last
user_mode(regs) == true caller of no_context() by directly OOPSing for
implicit kernel faults from usermode.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 59 ++++++++++++++++++++++++---------------------
 1 file changed, 32 insertions(+), 27 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8b8bd0a4f4b2..f735639455a5 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -825,44 +825,49 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 {
 	struct task_struct *tsk = current;
 
-	/* User mode accesses just cause a SIGSEGV */
-	if (user_mode(regs) && (error_code & X86_PF_USER)) {
-		/*
-		 * It's possible to have interrupts off here:
-		 */
-		local_irq_enable();
+	if (!user_mode(regs)) {
+		no_context(regs, error_code, address, pkey, si_code);
+		return;
+	}
 
-		/*
-		 * Valid to do another page fault here because this one came
-		 * from user space:
-		 */
-		if (is_prefetch(regs, error_code, address))
-			return;
+	if (!(error_code & X86_PF_USER)) {
+		/* Implicit user access to kernel memory -- just oops */
+		page_fault_oops(regs, error_code, address);
+		return;
+	}
 
-		if (is_errata100(regs, address))
-			return;
+	/*
+	 * User mode accesses just cause a SIGSEGV.
+	 * It's possible to have interrupts off here:
+	 */
+	local_irq_enable();
 
-		sanitize_error_code(address, &error_code);
+	/*
+	 * Valid to do another page fault here because this one came
+	 * from user space:
+	 */
+	if (is_prefetch(regs, error_code, address))
+		return;
 
-		if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address))
-			return;
+	if (is_errata100(regs, address))
+		return;
 
-		if (likely(show_unhandled_signals))
-			show_signal_msg(regs, error_code, address, tsk);
+	sanitize_error_code(address, &error_code);
 
-		set_signal_archinfo(address, error_code);
+	if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address))
+		return;
 
-		if (si_code == SEGV_PKUERR)
-			force_sig_pkuerr((void __user *)address, pkey);
+	if (likely(show_unhandled_signals))
+		show_signal_msg(regs, error_code, address, tsk);
 
-		force_sig_fault(SIGSEGV, si_code, (void __user *)address);
+	set_signal_archinfo(address, error_code);
 
-		local_irq_disable();
+	if (si_code == SEGV_PKUERR)
+		force_sig_pkuerr((void __user *)address, pkey);
 
-		return;
-	}
+	force_sig_fault(SIGSEGV, si_code, (void __user *)address);
 
-	no_context(regs, error_code, address, SIGSEGV, si_code);
+	local_irq_disable();
 }
 
 static noinline void
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 11/14] x86/fault: Rename no_context() to kernelmode_fixup_or_oops()
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
                   ` (9 preceding siblings ...)
  2021-02-10  2:33 ` [PATCH v2 10/14] x86/fault: Bypass no_context() for implicit kernel faults from usermode Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 12/14] x86/fault: Don't look for extable entries for SMEP violations Andy Lutomirski
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski, Peter Zijlstra

The name no_context() has never been very clear.  It's only called for
faults from kernel mode, so rename it and change the no-longer-useful
user_mode(regs) check to a WARN_ON_ONCE.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 28 ++++++++++------------------
 1 file changed, 10 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f735639455a5..9fb636b2a3da 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -729,17 +729,10 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,
 }
 
 static noinline void
-no_context(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int signal, int si_code)
+kernelmode_fixup_or_oops(struct pt_regs *regs, unsigned long error_code,
+			 unsigned long address, int signal, int si_code)
 {
-	if (user_mode(regs)) {
-		/*
-		 * This is an implicit supervisor-mode access from user
-		 * mode.  Bypass all the kernel-mode recovery code and just
-		 * OOPS.
-		 */
-		goto oops;
-	}
+	WARN_ON_ONCE(user_mode(regs));
 
 	/* Are we prepared to handle this kernel fault? */
 	if (fixup_exception(regs, X86_TRAP_PF, error_code, address)) {
@@ -779,7 +772,6 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	if (is_prefetch(regs, error_code, address))
 		return;
 
-oops:
 	page_fault_oops(regs, error_code, address);
 }
 
@@ -826,7 +818,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 	struct task_struct *tsk = current;
 
 	if (!user_mode(regs)) {
-		no_context(regs, error_code, address, pkey, si_code);
+		kernelmode_fixup_or_oops(regs, error_code, address, pkey, si_code);
 		return;
 	}
 
@@ -958,7 +950,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 {
 	/* Kernel mode? Handle exceptions or die: */
 	if (!user_mode(regs)) {
-		no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+		kernelmode_fixup_or_oops(regs, error_code, address, SIGBUS, BUS_ADRERR);
 		return;
 	}
 
@@ -1420,8 +1412,8 @@ void do_user_addr_fault(struct pt_regs *regs,
 		 * has unlocked the mm for us if we get here.
 		 */
 		if (!user_mode(regs))
-			no_context(regs, error_code, address, SIGBUS,
-				   BUS_ADRERR);
+			kernelmode_fixup_or_oops(regs, error_code, address,
+						 SIGBUS, BUS_ADRERR);
 		return;
 	}
 
@@ -1441,15 +1433,15 @@ void do_user_addr_fault(struct pt_regs *regs,
 		return;
 
 	if (fatal_signal_pending(current) && !user_mode(regs)) {
-		no_context(regs, error_code, address, 0, 0);
+		kernelmode_fixup_or_oops(regs, error_code, address, 0, 0);
 		return;
 	}
 
 	if (fault & VM_FAULT_OOM) {
 		/* Kernel mode? Handle exceptions or die: */
 		if (!user_mode(regs)) {
-			no_context(regs, error_code, address,
-				   SIGSEGV, SEGV_MAPERR);
+			kernelmode_fixup_or_oops(regs, error_code, address,
+						 SIGSEGV, SEGV_MAPERR);
 			return;
 		}
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 12/14] x86/fault: Don't look for extable entries for SMEP violations
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
                   ` (10 preceding siblings ...)
  2021-02-10  2:33 ` [PATCH v2 11/14] x86/fault: Rename no_context() to kernelmode_fixup_or_oops() Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 13/14] x86/fault: Don't run fixups for SMAP violations Andy Lutomirski
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski, Peter Zijlstra

If we get a SMEP violation or a fault that would have been a SMEP
violation if we had SMEP, we shouldn't run fixups.  Just OOPS.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9fb636b2a3da..466415bdf58c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1249,12 +1249,12 @@ void do_user_addr_fault(struct pt_regs *regs,
 		 * user memory.  Unless this is AMD erratum #93, which
 		 * corrupts RIP such that it looks like a user address,
 		 * this is unrecoverable.  Don't even try to look up the
-		 * VMA.
+		 * VMA or look for extable entries.
 		 */
 		if (is_errata93(regs, address))
 			return;
 
-		bad_area_nosemaphore(regs, error_code, address);
+		page_fault_oops(regs, error_code, address);
 		return;
 	}
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 13/14] x86/fault: Don't run fixups for SMAP violations
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
                   ` (11 preceding siblings ...)
  2021-02-10  2:33 ` [PATCH v2 12/14] x86/fault: Don't look for extable entries for SMEP violations Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  2021-02-10  2:33 ` [PATCH v2 14/14] x86/fault, x86/efi: Fix and rename efi_recover_from_page_fault() Andy Lutomirski
  2021-02-10 13:57 ` [RFC][PATCH] kprobes: Remove kprobe::fault_handler Peter Zijlstra
  14 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski, Peter Zijlstra

A SMAP-violating kernel access is not a recoverable condition.  Imagine
kernel code that, outside of a uaccess region, dereferences a pointer to
the user range by accident.  If SMAP is on, this will reliably generate
as an intentional user access.  This makes it easy for bugs to be
overlooked if code is inadequately tested both with and without SMAP.

We discovered this because BPF can generate invalid accesses to user
memory, but those warnings only got printed if SMAP was off.  With this
patch, this type of error will be discovered with SMAP on as well.

Cc: Yonghong Song <yhs@fb.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/fault.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 466415bdf58c..eed217d4a877 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1278,9 +1278,12 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 */
 	if (unlikely(cpu_feature_enabled(X86_FEATURE_SMAP) &&
 		     !(error_code & X86_PF_USER) &&
-		     !(regs->flags & X86_EFLAGS_AC)))
-	{
-		bad_area_nosemaphore(regs, error_code, address);
+		     !(regs->flags & X86_EFLAGS_AC))) {
+		/*
+		 * No extable entry here.  This was a kernel access to an
+		 * invalid pointer.  get_kernel_nofault() will not get here.
+		 */
+		page_fault_oops(regs, error_code, address);
 		return;
 	}
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 14/14] x86/fault, x86/efi: Fix and rename efi_recover_from_page_fault()
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
                   ` (12 preceding siblings ...)
  2021-02-10  2:33 ` [PATCH v2 13/14] x86/fault: Don't run fixups for SMAP violations Andy Lutomirski
@ 2021-02-10  2:33 ` Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] x86/{fault,efi}: " tip-bot2 for Andy Lutomirski
  2021-02-11  8:38   ` [PATCH v2 14/14] x86/fault, x86/efi: " Ard Biesheuvel
  2021-02-10 13:57 ` [RFC][PATCH] kprobes: Remove kprobe::fault_handler Peter Zijlstra
  14 siblings, 2 replies; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  2:33 UTC (permalink / raw)
  To: x86
  Cc: LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Andy Lutomirski, Peter Zijlstra,
	Ard Biesheuvel, linux-efi

efi_recover_from_page_fault() doesn't recover -- it does a special EFI
mini-oops.  Rename it to make it clear that it crashes.

While renaming it, I noticed a blatant bug: a page fault oops in a
different thread happening concurrently with an EFI runtime service call
would be misinterpreted as an EFI page fault.  Fix that.

This isn't quite exact.  We could do better by using a special CS for
calls into EFI.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/efi.h     |  2 +-
 arch/x86/mm/fault.c            | 11 ++++++-----
 arch/x86/platform/efi/quirks.c | 16 ++++++++++++----
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index c98f78330b09..4b7706ddd8b6 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -150,7 +150,7 @@ extern void __init efi_apply_memmap_quirks(void);
 extern int __init efi_reuse_config(u64 tables, int nr_tables);
 extern void efi_delete_dummy_variable(void);
 extern void efi_switch_mm(struct mm_struct *mm);
-extern void efi_recover_from_page_fault(unsigned long phys_addr);
+extern void efi_crash_gracefully_on_page_fault(unsigned long phys_addr);
 extern void efi_free_boot_services(void);
 
 /* kexec external ABI */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eed217d4a877..dfdd56d9c020 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -16,7 +16,7 @@
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
-#include <linux/efi.h>			/* efi_recover_from_page_fault()*/
+#include <linux/efi.h>			/* efi_crash_gracefully_on_page_fault()*/
 #include <linux/mm_types.h>
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
@@ -25,7 +25,7 @@
 #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
 #include <asm/vm86.h>			/* struct vm86			*/
 #include <asm/mmu_context.h>		/* vma_pkey()			*/
-#include <asm/efi.h>			/* efi_recover_from_page_fault()*/
+#include <asm/efi.h>			/* efi_crash_gracefully_on_page_fault()*/
 #include <asm/desc.h>			/* store_idt(), ...		*/
 #include <asm/cpu_entry_area.h>		/* exception stack		*/
 #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/
@@ -700,11 +700,12 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,
 #endif
 
 	/*
-	 * Buggy firmware could access regions which might page fault, try to
-	 * recover from such faults.
+	 * Buggy firmware could access regions which might page fault.  If
+	 * this happens, EFI has a special OOPS path that will try to
+	 * avoid hanging the system.
 	 */
 	if (IS_ENABLED(CONFIG_EFI))
-		efi_recover_from_page_fault(address);
+		efi_crash_gracefully_on_page_fault(address);
 
 oops:
 	/*
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 5a40fe411ebd..0463ef9cddd6 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -687,15 +687,25 @@ int efi_capsule_setup_info(struct capsule_info *cap_info, void *kbuff,
  * @return: Returns, if the page fault is not handled. This function
  * will never return if the page fault is handled successfully.
  */
-void efi_recover_from_page_fault(unsigned long phys_addr)
+void efi_crash_gracefully_on_page_fault(unsigned long phys_addr)
 {
 	if (!IS_ENABLED(CONFIG_X86_64))
 		return;
 
+	/*
+	 * If we are in an interrupt nested inside an EFI runtime service,
+	 * then this is a regular OOPS, not an EFI failure.
+	 */
+	if (in_interrupt() || in_nmi() || in_softirq())
+		return;
+
 	/*
 	 * Make sure that an efi runtime service caused the page fault.
+	 * READ_ONCE() because we might be OOPSing in a different thread,
+	 * and we don't want to trip KTSAN while trying to OOPS.
 	 */
-	if (efi_rts_work.efi_rts_id == EFI_NONE)
+	if (READ_ONCE(efi_rts_work.efi_rts_id) == EFI_NONE ||
+	    current_work() != &efi_rts_work.work)
 		return;
 
 	/*
@@ -747,6 +757,4 @@ void efi_recover_from_page_fault(unsigned long phys_addr)
 		set_current_state(TASK_IDLE);
 		schedule();
 	}
-
-	return;
 }
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 08/14] x86/fault: Skip erratum #93 workaround on new CPUs
  2021-02-10  2:33 ` [PATCH v2 08/14] x86/fault: Skip erratum #93 workaround on new CPUs Andy Lutomirski
@ 2021-02-10  6:06   ` Andy Lutomirski
  2021-02-10 13:29     ` Borislav Petkov
  0 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10  6:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu

On Tue, Feb 9, 2021 at 6:33 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> Erratum #93 applies to the first generation of AMD K8 CPUs.  Skip the
> workaround on newer CPUs.

Whoops, this breaks the !CPU_SUP_AMD build.  It needs a fixup like this:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fault&id=06772a3b6918bf6d6d0778946149b7d56ae30d80

Boris, do you want a v3?

>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/mm/fault.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index cbb1a9754473..3fe2f4800b69 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -442,9 +442,8 @@ static void dump_pagetable(unsigned long address)
>   */
>  static int is_errata93(struct pt_regs *regs, unsigned long address)
>  {
> -#if defined(CONFIG_X86_64) && defined(CONFIG_CPU_SUP_AMD)
> -       if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD
> -           || boot_cpu_data.x86 != 0xf)
> +#if defined(CONFIG_X86_64)
> +       if (!is_amd_k8_pre_npt())
>                 return 0;
>
>         if (user_mode(regs))
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 01/14] x86/fault: Fix AMD erratum #91 errata fixup for user code
  2021-02-10  2:33 ` [PATCH v2 01/14] x86/fault: Fix AMD erratum #91 errata fixup for user code Andy Lutomirski
@ 2021-02-10  7:34   ` Christoph Hellwig
  2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2021-02-10  7:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, stable, Peter Zijlstra,
	Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 08/14] x86/fault: Skip erratum #93 workaround on new CPUs
  2021-02-10  6:06   ` Andy Lutomirski
@ 2021-02-10 13:29     ` Borislav Petkov
  2021-02-10 16:09       ` Andy Lutomirski
  0 siblings, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2021-02-10 13:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu

On Tue, Feb 09, 2021 at 10:06:02PM -0800, Andy Lutomirski wrote:
> On Tue, Feb 9, 2021 at 6:33 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > Erratum #93 applies to the first generation of AMD K8 CPUs.  Skip the
> > workaround on newer CPUs.
> 
> Whoops, this breaks the !CPU_SUP_AMD build.  It needs a fixup like this:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fault&id=06772a3b6918bf6d6d0778946149b7d56ae30d80
> 
> Boris, do you want a v3?

Do we even want to apply patch 8 at all? This has always ran on K8 NPT
CPUs too and those are pretty much obsolete by now so I'm thinking,
leave sleeping dogs lie...?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC][PATCH] kprobes: Remove kprobe::fault_handler
  2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
                   ` (13 preceding siblings ...)
  2021-02-10  2:33 ` [PATCH v2 14/14] x86/fault, x86/efi: Fix and rename efi_recover_from_page_fault() Andy Lutomirski
@ 2021-02-10 13:57 ` Peter Zijlstra
  2021-02-10 19:09   ` Christoph Hellwig
  2021-02-10 19:47   ` Alexei Starovoitov
  14 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2021-02-10 13:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu


Somewhat related.. I had this pending.

---
Subject: kprobes: Remove kprobe::fault_handler
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Feb 2 10:43:41 CET 2021

The reason for kprobe::fault_handler(), as given by their comment:

 * We come here because instructions in the pre/post
 * handler caused the page_fault, this could happen
 * if handler tries to access user space by
 * copy_from_user(), get_user() etc. Let the
 * user-specified handler try to fix it first.

If just plain bad. Those other handlers are ran from non-preemptible
context and had better use _nofault() functions. Also, there is no
upstream usage of this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 Documentation/trace/kprobes.rst    |   24 +++++-------------------
 arch/arc/kernel/kprobes.c          |   10 ----------
 arch/arm/probes/kprobes/core.c     |    9 ---------
 arch/arm64/kernel/probes/kprobes.c |   10 ----------
 arch/csky/kernel/probes/kprobes.c  |   10 ----------
 arch/ia64/kernel/kprobes.c         |    9 ---------
 arch/mips/kernel/kprobes.c         |    3 ---
 arch/powerpc/kernel/kprobes.c      |   10 ----------
 arch/s390/kernel/kprobes.c         |   10 ----------
 arch/sh/kernel/kprobes.c           |   10 ----------
 arch/sparc/kernel/kprobes.c        |   10 ----------
 arch/x86/kernel/kprobes/core.c     |   10 ----------
 include/linux/kprobes.h            |    8 --------
 kernel/kprobes.c                   |   19 -------------------
 samples/kprobes/kprobe_example.c   |   15 ---------------
 15 files changed, 5 insertions(+), 162 deletions(-)

--- a/Documentation/trace/kprobes.rst
+++ b/Documentation/trace/kprobes.rst
@@ -362,14 +362,11 @@ register_kprobe
 	#include <linux/kprobes.h>
 	int register_kprobe(struct kprobe *kp);
 
-Sets a breakpoint at the address kp->addr.  When the breakpoint is
-hit, Kprobes calls kp->pre_handler.  After the probed instruction
-is single-stepped, Kprobe calls kp->post_handler.  If a fault
-occurs during execution of kp->pre_handler or kp->post_handler,
-or during single-stepping of the probed instruction, Kprobes calls
-kp->fault_handler.  Any or all handlers can be NULL. If kp->flags
-is set KPROBE_FLAG_DISABLED, that kp will be registered but disabled,
-so, its handlers aren't hit until calling enable_kprobe(kp).
+Sets a breakpoint at the address kp->addr.  When the breakpoint is hit, Kprobes
+calls kp->pre_handler.  After the probed instruction is single-stepped, Kprobe
+calls kp->post_handler.  Any or all handlers can be NULL. If kp->flags is set
+KPROBE_FLAG_DISABLED, that kp will be registered but disabled, so, its handlers
+aren't hit until calling enable_kprobe(kp).
 
 .. note::
 
@@ -415,17 +412,6 @@ the breakpoint was hit.  Return 0 here u
 p and regs are as described for the pre_handler.  flags always seems
 to be zero.
 
-User's fault-handler (kp->fault_handler)::
-
-	#include <linux/kprobes.h>
-	#include <linux/ptrace.h>
-	int fault_handler(struct kprobe *p, struct pt_regs *regs, int trapnr);
-
-p and regs are as described for the pre_handler.  trapnr is the
-architecture-specific trap number associated with the fault (e.g.,
-on i386, 13 for a general protection fault or 14 for a page fault).
-Returns 1 if it successfully handled the exception.
-
 register_kretprobe
 ------------------
 
--- a/arch/arc/kernel/kprobes.c
+++ b/arch/arc/kernel/kprobes.c
@@ -324,16 +324,6 @@ int __kprobes kprobe_fault_handler(struc
 		kprobes_inc_nmissed_count(cur);
 
 		/*
-		 * We come here because instructions in the pre/post
-		 * handler caused the page_fault, this could happen
-		 * if handler tries to access user space by
-		 * copy_from_user(), get_user() etc. Let the
-		 * user-specified handler try to fix it first.
-		 */
-		if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
-			return 1;
-
-		/*
 		 * In case the user-specified fault handler returned zero,
 		 * try to fix up.
 		 */
--- a/arch/arm/probes/kprobes/core.c
+++ b/arch/arm/probes/kprobes/core.c
@@ -358,15 +358,6 @@ int __kprobes kprobe_fault_handler(struc
 		 */
 		kprobes_inc_nmissed_count(cur);
 
-		/*
-		 * We come here because instructions in the pre/post
-		 * handler caused the page_fault, this could happen
-		 * if handler tries to access user space by
-		 * copy_from_user(), get_user() etc. Let the
-		 * user-specified handler try to fix it.
-		 */
-		if (cur->fault_handler && cur->fault_handler(cur, regs, fsr))
-			return 1;
 		break;
 
 	default:
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -283,16 +283,6 @@ int __kprobes kprobe_fault_handler(struc
 		kprobes_inc_nmissed_count(cur);
 
 		/*
-		 * We come here because instructions in the pre/post
-		 * handler caused the page_fault, this could happen
-		 * if handler tries to access user space by
-		 * copy_from_user(), get_user() etc. Let the
-		 * user-specified handler try to fix it first.
-		 */
-		if (cur->fault_handler && cur->fault_handler(cur, regs, fsr))
-			return 1;
-
-		/*
 		 * In case the user-specified fault handler returned
 		 * zero, try to fix up.
 		 */
--- a/arch/csky/kernel/probes/kprobes.c
+++ b/arch/csky/kernel/probes/kprobes.c
@@ -302,16 +302,6 @@ int __kprobes kprobe_fault_handler(struc
 		kprobes_inc_nmissed_count(cur);
 
 		/*
-		 * We come here because instructions in the pre/post
-		 * handler caused the page_fault, this could happen
-		 * if handler tries to access user space by
-		 * copy_from_user(), get_user() etc. Let the
-		 * user-specified handler try to fix it first.
-		 */
-		if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
-			return 1;
-
-		/*
 		 * In case the user-specified fault handler returned
 		 * zero, try to fix up.
 		 */
--- a/arch/ia64/kernel/kprobes.c
+++ b/arch/ia64/kernel/kprobes.c
@@ -851,15 +851,6 @@ int __kprobes kprobe_fault_handler(struc
 		kprobes_inc_nmissed_count(cur);
 
 		/*
-		 * We come here because instructions in the pre/post
-		 * handler caused the page_fault, this could happen
-		 * if handler tries to access user space by
-		 * copy_from_user(), get_user() etc. Let the
-		 * user-specified handler try to fix it first.
-		 */
-		if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
-			return 1;
-		/*
 		 * In case the user-specified fault handler returned
 		 * zero, try to fix up.
 		 */
--- a/arch/mips/kernel/kprobes.c
+++ b/arch/mips/kernel/kprobes.c
@@ -403,9 +403,6 @@ int kprobe_fault_handler(struct pt_regs
 	struct kprobe *cur = kprobe_running();
 	struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
 
-	if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
-		return 1;
-
 	if (kcb->kprobe_status & KPROBE_HIT_SS) {
 		resume_execution(cur, regs, kcb);
 		regs->cp0_status |= kcb->kprobe_old_SR;
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -509,16 +509,6 @@ int kprobe_fault_handler(struct pt_regs
 		kprobes_inc_nmissed_count(cur);
 
 		/*
-		 * We come here because instructions in the pre/post
-		 * handler caused the page_fault, this could happen
-		 * if handler tries to access user space by
-		 * copy_from_user(), get_user() etc. Let the
-		 * user-specified handler try to fix it first.
-		 */
-		if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
-			return 1;
-
-		/*
 		 * In case the user-specified fault handler returned
 		 * zero, try to fix up.
 		 */
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -453,16 +453,6 @@ static int kprobe_trap_handler(struct pt
 		kprobes_inc_nmissed_count(p);
 
 		/*
-		 * We come here because instructions in the pre/post
-		 * handler caused the page_fault, this could happen
-		 * if handler tries to access user space by
-		 * copy_from_user(), get_user() etc. Let the
-		 * user-specified handler try to fix it first.
-		 */
-		if (p->fault_handler && p->fault_handler(p, regs, trapnr))
-			return 1;
-
-		/*
 		 * In case the user-specified fault handler returned
 		 * zero, try to fix up.
 		 */
--- a/arch/sh/kernel/kprobes.c
+++ b/arch/sh/kernel/kprobes.c
@@ -390,16 +390,6 @@ int __kprobes kprobe_fault_handler(struc
 		kprobes_inc_nmissed_count(cur);
 
 		/*
-		 * We come here because instructions in the pre/post
-		 * handler caused the page_fault, this could happen
-		 * if handler tries to access user space by
-		 * copy_from_user(), get_user() etc. Let the
-		 * user-specified handler try to fix it first.
-		 */
-		if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
-			return 1;
-
-		/*
 		 * In case the user-specified fault handler returned
 		 * zero, try to fix up.
 		 */
--- a/arch/sparc/kernel/kprobes.c
+++ b/arch/sparc/kernel/kprobes.c
@@ -353,16 +353,6 @@ int __kprobes kprobe_fault_handler(struc
 		kprobes_inc_nmissed_count(cur);
 
 		/*
-		 * We come here because instructions in the pre/post
-		 * handler caused the page_fault, this could happen
-		 * if handler tries to access user space by
-		 * copy_from_user(), get_user() etc. Let the
-		 * user-specified handler try to fix it first.
-		 */
-		if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
-			return 1;
-
-		/*
 		 * In case the user-specified fault handler returned
 		 * zero, try to fix up.
 		 */
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -962,16 +962,6 @@ int kprobe_fault_handler(struct pt_regs
 		 * these specific fault cases.
 		 */
 		kprobes_inc_nmissed_count(cur);
-
-		/*
-		 * We come here because instructions in the pre/post
-		 * handler caused the page_fault, this could happen
-		 * if handler tries to access user space by
-		 * copy_from_user(), get_user() etc. Let the
-		 * user-specified handler try to fix it first.
-		 */
-		if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
-			return 1;
 	}
 
 	return 0;
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -54,8 +54,6 @@ struct kretprobe_instance;
 typedef int (*kprobe_pre_handler_t) (struct kprobe *, struct pt_regs *);
 typedef void (*kprobe_post_handler_t) (struct kprobe *, struct pt_regs *,
 				       unsigned long flags);
-typedef int (*kprobe_fault_handler_t) (struct kprobe *, struct pt_regs *,
-				       int trapnr);
 typedef int (*kretprobe_handler_t) (struct kretprobe_instance *,
 				    struct pt_regs *);
 
@@ -83,12 +81,6 @@ struct kprobe {
 	/* Called after addr is executed, unless... */
 	kprobe_post_handler_t post_handler;
 
-	/*
-	 * ... called if executing addr causes a fault (eg. page fault).
-	 * Return 1 if it handled fault, otherwise kernel will see it.
-	 */
-	kprobe_fault_handler_t fault_handler;
-
 	/* Saved opcode (which has been replaced with breakpoint) */
 	kprobe_opcode_t opcode;
 
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1183,23 +1183,6 @@ static void aggr_post_handler(struct kpr
 }
 NOKPROBE_SYMBOL(aggr_post_handler);
 
-static int aggr_fault_handler(struct kprobe *p, struct pt_regs *regs,
-			      int trapnr)
-{
-	struct kprobe *cur = __this_cpu_read(kprobe_instance);
-
-	/*
-	 * if we faulted "during" the execution of a user specified
-	 * probe handler, invoke just that probe's fault handler
-	 */
-	if (cur && cur->fault_handler) {
-		if (cur->fault_handler(cur, regs, trapnr))
-			return 1;
-	}
-	return 0;
-}
-NOKPROBE_SYMBOL(aggr_fault_handler);
-
 /* Walks the list and increments nmissed count for multiprobe case */
 void kprobes_inc_nmissed_count(struct kprobe *p)
 {
@@ -1330,7 +1313,6 @@ static void init_aggr_kprobe(struct kpro
 	ap->addr = p->addr;
 	ap->flags = p->flags & ~KPROBE_FLAG_OPTIMIZED;
 	ap->pre_handler = aggr_pre_handler;
-	ap->fault_handler = aggr_fault_handler;
 	/* We don't care the kprobe which has gone. */
 	if (p->post_handler && !kprobe_gone(p))
 		ap->post_handler = aggr_post_handler;
@@ -1991,7 +1973,6 @@ int register_kretprobe(struct kretprobe
 
 	rp->kp.pre_handler = pre_handler_kretprobe;
 	rp->kp.post_handler = NULL;
-	rp->kp.fault_handler = NULL;
 
 	/* Pre-allocate memory for max kretprobe instances */
 	if (rp->maxactive <= 0) {
--- a/samples/kprobes/kprobe_example.c
+++ b/samples/kprobes/kprobe_example.c
@@ -79,26 +79,11 @@ static void __kprobes handler_post(struc
 #endif
 }
 
-/*
- * fault_handler: this is called if an exception is generated for any
- * instruction within the pre- or post-handler, or when Kprobes
- * single-steps the probed instruction.
- */
-static int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
-{
-	pr_info("fault_handler: p->addr = 0x%p, trap #%dn", p->addr, trapnr);
-	/* Return 0 because we don't handle the fault. */
-	return 0;
-}
-/* NOKPROBE_SYMBOL() is also available */
-NOKPROBE_SYMBOL(handler_fault);
-
 static int __init kprobe_init(void)
 {
 	int ret;
 	kp.pre_handler = handler_pre;
 	kp.post_handler = handler_post;
-	kp.fault_handler = handler_fault;
 
 	ret = register_kprobe(&kp);
 	if (ret < 0) {


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 08/14] x86/fault: Skip erratum #93 workaround on new CPUs
  2021-02-10 13:29     ` Borislav Petkov
@ 2021-02-10 16:09       ` Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: Andy Lutomirski @ 2021-02-10 16:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, LKML, Dave Hansen, Alexei Starovoitov,
	Daniel Borkmann, Yonghong Song, Masami Hiramatsu

On Wed, Feb 10, 2021 at 5:29 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Feb 09, 2021 at 10:06:02PM -0800, Andy Lutomirski wrote:
> > On Tue, Feb 9, 2021 at 6:33 PM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > Erratum #93 applies to the first generation of AMD K8 CPUs.  Skip the
> > > workaround on newer CPUs.
> >
> > Whoops, this breaks the !CPU_SUP_AMD build.  It needs a fixup like this:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fault&id=06772a3b6918bf6d6d0778946149b7d56ae30d80
> >
> > Boris, do you want a v3?
>
> Do we even want to apply patch 8 at all? This has always ran on K8 NPT
> CPUs too and those are pretty much obsolete by now so I'm thinking,
> leave sleeping dogs lie...?

I think we do want this patch.  Without it, there's a fairly sizeable
range of bogus RIP values that will result in applying the fixup
instead of properly oopsing, even on new CPUs.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/fault: Don't run fixups for SMAP violations
  2021-02-10  2:33 ` [PATCH v2 13/14] x86/fault: Don't run fixups for SMAP violations Andy Lutomirski
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     ca247283781d754216395a41c5e8be8ec79a5f1c
Gitweb:        https://git.kernel.org/tip/ca247283781d754216395a41c5e8be8ec79a5f1c
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:45 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 16:27:57 +01:00

x86/fault: Don't run fixups for SMAP violations

A SMAP-violating kernel access is not a recoverable condition.  Imagine
kernel code that, outside of a uaccess region, dereferences a pointer to
the user range by accident.  If SMAP is on, this will reliably generate
as an intentional user access.  This makes it easy for bugs to be
overlooked if code is inadequately tested both with and without SMAP.

This was discovered because BPF can generate invalid accesses to user
memory, but those warnings only got printed if SMAP was off. Make it so
that this type of error will be discovered with SMAP on as well.

 [ bp: Massage commit message. ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/66a02343624b1ff46f02a838c497fc05c1a871b3.1612924255.git.luto@kernel.org
---
 arch/x86/mm/fault.c |  9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 1a0cfed..1c3054b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1279,9 +1279,12 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 */
 	if (unlikely(cpu_feature_enabled(X86_FEATURE_SMAP) &&
 		     !(error_code & X86_PF_USER) &&
-		     !(regs->flags & X86_EFLAGS_AC)))
-	{
-		bad_area_nosemaphore(regs, error_code, address);
+		     !(regs->flags & X86_EFLAGS_AC))) {
+		/*
+		 * No extable entry here.  This was a kernel access to an
+		 * invalid pointer.  get_kernel_nofault() will not get here.
+		 */
+		page_fault_oops(regs, error_code, address);
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/fault: Don't look for extable entries for SMEP violations
  2021-02-10  2:33 ` [PATCH v2 12/14] x86/fault: Don't look for extable entries for SMEP violations Andy Lutomirski
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     66fcd98883816dba3b66da20b5fc86fa410638b5
Gitweb:        https://git.kernel.org/tip/66fcd98883816dba3b66da20b5fc86fa410638b5
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:44 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 14:45:39 +01:00

x86/fault: Don't look for extable entries for SMEP violations

If the kernel gets a SMEP violation or a fault that would have been a
SMEP violation if it had SMEP support, it shouldn't run fixups. Just
OOPS.

 [ bp: Massage commit message. ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/46160d8babce2abf1d6daa052146002efa24ac56.1612924255.git.luto@kernel.org
---
 arch/x86/mm/fault.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 3566a59..1a0cfed 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1250,12 +1250,12 @@ void do_user_addr_fault(struct pt_regs *regs,
 		 * user memory.  Unless this is AMD erratum #93, which
 		 * corrupts RIP such that it looks like a user address,
 		 * this is unrecoverable.  Don't even try to look up the
-		 * VMA.
+		 * VMA or look for extable entries.
 		 */
 		if (is_errata93(regs, address))
 			return;
 
-		bad_area_nosemaphore(regs, error_code, address);
+		page_fault_oops(regs, error_code, address);
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/{fault,efi}: Fix and rename efi_recover_from_page_fault()
  2021-02-10  2:33 ` [PATCH v2 14/14] x86/fault, x86/efi: Fix and rename efi_recover_from_page_fault() Andy Lutomirski
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  2021-02-11  8:38   ` [PATCH v2 14/14] x86/fault, x86/efi: " Ard Biesheuvel
  1 sibling, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     c46f52231e79af025e2c89e889d69ec20a4c024f
Gitweb:        https://git.kernel.org/tip/c46f52231e79af025e2c89e889d69ec20a4c024f
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:46 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 18:39:23 +01:00

x86/{fault,efi}: Fix and rename efi_recover_from_page_fault()

efi_recover_from_page_fault() doesn't recover -- it does a special EFI
mini-oops.  Rename it to make it clear that it crashes.

While renaming it, I noticed a blatant bug: a page fault oops in a
different thread happening concurrently with an EFI runtime service call
would be misinterpreted as an EFI page fault.  Fix that.

This isn't quite exact. The situation could be improved by using a
special CS for calls into EFI.

 [ bp: Massage commit message and simplify in interrupt check. ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/f43b1e80830dc78ed60ed8b0826f4f189254570c.1612924255.git.luto@kernel.org
---
 arch/x86/include/asm/efi.h     |  2 +-
 arch/x86/mm/fault.c            | 11 ++++++-----
 arch/x86/platform/efi/quirks.c | 16 ++++++++++++----
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index c98f783..4b7706d 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -150,7 +150,7 @@ extern void __init efi_apply_memmap_quirks(void);
 extern int __init efi_reuse_config(u64 tables, int nr_tables);
 extern void efi_delete_dummy_variable(void);
 extern void efi_switch_mm(struct mm_struct *mm);
-extern void efi_recover_from_page_fault(unsigned long phys_addr);
+extern void efi_crash_gracefully_on_page_fault(unsigned long phys_addr);
 extern void efi_free_boot_services(void);
 
 /* kexec external ABI */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 1c3054b..7b3a125 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -16,7 +16,7 @@
 #include <linux/prefetch.h>		/* prefetchw			*/
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
-#include <linux/efi.h>			/* efi_recover_from_page_fault()*/
+#include <linux/efi.h>			/* efi_crash_gracefully_on_page_fault()*/
 #include <linux/mm_types.h>
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
@@ -25,7 +25,7 @@
 #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
 #include <asm/vm86.h>			/* struct vm86			*/
 #include <asm/mmu_context.h>		/* vma_pkey()			*/
-#include <asm/efi.h>			/* efi_recover_from_page_fault()*/
+#include <asm/efi.h>			/* efi_crash_gracefully_on_page_fault()*/
 #include <asm/desc.h>			/* store_idt(), ...		*/
 #include <asm/cpu_entry_area.h>		/* exception stack		*/
 #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/
@@ -701,11 +701,12 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,
 #endif
 
 	/*
-	 * Buggy firmware could access regions which might page fault, try to
-	 * recover from such faults.
+	 * Buggy firmware could access regions which might page fault.  If
+	 * this happens, EFI has a special OOPS path that will try to
+	 * avoid hanging the system.
 	 */
 	if (IS_ENABLED(CONFIG_EFI))
-		efi_recover_from_page_fault(address);
+		efi_crash_gracefully_on_page_fault(address);
 
 oops:
 	/*
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 5a40fe4..67d93a2 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -687,15 +687,25 @@ int efi_capsule_setup_info(struct capsule_info *cap_info, void *kbuff,
  * @return: Returns, if the page fault is not handled. This function
  * will never return if the page fault is handled successfully.
  */
-void efi_recover_from_page_fault(unsigned long phys_addr)
+void efi_crash_gracefully_on_page_fault(unsigned long phys_addr)
 {
 	if (!IS_ENABLED(CONFIG_X86_64))
 		return;
 
 	/*
+	 * If we get an interrupt/NMI while processing an EFI runtime service
+	 * then this is a regular OOPS, not an EFI failure.
+	 */
+	if (in_interrupt())
+		return;
+
+	/*
 	 * Make sure that an efi runtime service caused the page fault.
+	 * READ_ONCE() because we might be OOPSing in a different thread,
+	 * and we don't want to trip KTSAN while trying to OOPS.
 	 */
-	if (efi_rts_work.efi_rts_id == EFI_NONE)
+	if (READ_ONCE(efi_rts_work.efi_rts_id) == EFI_NONE ||
+	    current_work() != &efi_rts_work.work)
 		return;
 
 	/*
@@ -747,6 +757,4 @@ void efi_recover_from_page_fault(unsigned long phys_addr)
 		set_current_state(TASK_IDLE);
 		schedule();
 	}
-
-	return;
 }

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/fault: Rename no_context() to kernelmode_fixup_or_oops()
  2021-02-10  2:33 ` [PATCH v2 11/14] x86/fault: Rename no_context() to kernelmode_fixup_or_oops() Andy Lutomirski
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     6456a2a69ee16ad402f26d272d0b67ce1d25061f
Gitweb:        https://git.kernel.org/tip/6456a2a69ee16ad402f26d272d0b67ce1d25061f
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:43 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 14:41:19 +01:00

x86/fault: Rename no_context() to kernelmode_fixup_or_oops()

The name no_context() has never been very clear.  It's only called for
faults from kernel mode, so rename it and change the no-longer-useful
user_mode(regs) check to a WARN_ON_ONCE.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/c21940efe676024bb4bc721f7d70c29c420e127e.1612924255.git.luto@kernel.org
---
 arch/x86/mm/fault.c | 28 ++++++++++------------------
 1 file changed, 10 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 187975b..3566a59 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -730,17 +730,10 @@ oops:
 }
 
 static noinline void
-no_context(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int signal, int si_code)
+kernelmode_fixup_or_oops(struct pt_regs *regs, unsigned long error_code,
+			 unsigned long address, int signal, int si_code)
 {
-	if (user_mode(regs)) {
-		/*
-		 * This is an implicit supervisor-mode access from user
-		 * mode.  Bypass all the kernel-mode recovery code and just
-		 * OOPS.
-		 */
-		goto oops;
-	}
+	WARN_ON_ONCE(user_mode(regs));
 
 	/* Are we prepared to handle this kernel fault? */
 	if (fixup_exception(regs, X86_TRAP_PF, error_code, address)) {
@@ -780,7 +773,6 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	if (is_prefetch(regs, error_code, address))
 		return;
 
-oops:
 	page_fault_oops(regs, error_code, address);
 }
 
@@ -827,7 +819,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 	struct task_struct *tsk = current;
 
 	if (!user_mode(regs)) {
-		no_context(regs, error_code, address, pkey, si_code);
+		kernelmode_fixup_or_oops(regs, error_code, address, pkey, si_code);
 		return;
 	}
 
@@ -959,7 +951,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 {
 	/* Kernel mode? Handle exceptions or die: */
 	if (!user_mode(regs)) {
-		no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+		kernelmode_fixup_or_oops(regs, error_code, address, SIGBUS, BUS_ADRERR);
 		return;
 	}
 
@@ -1421,8 +1413,8 @@ good_area:
 		 * has unlocked the mm for us if we get here.
 		 */
 		if (!user_mode(regs))
-			no_context(regs, error_code, address, SIGBUS,
-				   BUS_ADRERR);
+			kernelmode_fixup_or_oops(regs, error_code, address,
+						 SIGBUS, BUS_ADRERR);
 		return;
 	}
 
@@ -1442,15 +1434,15 @@ good_area:
 		return;
 
 	if (fatal_signal_pending(current) && !user_mode(regs)) {
-		no_context(regs, error_code, address, 0, 0);
+		kernelmode_fixup_or_oops(regs, error_code, address, 0, 0);
 		return;
 	}
 
 	if (fault & VM_FAULT_OOM) {
 		/* Kernel mode? Handle exceptions or die: */
 		if (!user_mode(regs)) {
-			no_context(regs, error_code, address,
-				   SIGSEGV, SEGV_MAPERR);
+			kernelmode_fixup_or_oops(regs, error_code, address,
+						 SIGSEGV, SEGV_MAPERR);
 			return;
 		}
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/fault: Bypass no_context() for implicit kernel faults from usermode
  2021-02-10  2:33 ` [PATCH v2 10/14] x86/fault: Bypass no_context() for implicit kernel faults from usermode Andy Lutomirski
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     5042d40a264c8a508d58ed71e4c07b05175b3635
Gitweb:        https://git.kernel.org/tip/5042d40a264c8a508d58ed71e4c07b05175b3635
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:42 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 14:39:52 +01:00

x86/fault: Bypass no_context() for implicit kernel faults from usermode

Drop an indentation level and remove the last user_mode(regs) == true
caller of no_context() by directly OOPSing for implicit kernel faults
from usermode.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/6e3d1129494a8de1e59d28012286e3a292a2296e.1612924255.git.luto@kernel.org
---
 arch/x86/mm/fault.c | 59 +++++++++++++++++++++++---------------------
 1 file changed, 32 insertions(+), 27 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index dbf6a94..187975b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -826,44 +826,49 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 {
 	struct task_struct *tsk = current;
 
-	/* User mode accesses just cause a SIGSEGV */
-	if (user_mode(regs) && (error_code & X86_PF_USER)) {
-		/*
-		 * It's possible to have interrupts off here:
-		 */
-		local_irq_enable();
+	if (!user_mode(regs)) {
+		no_context(regs, error_code, address, pkey, si_code);
+		return;
+	}
 
-		/*
-		 * Valid to do another page fault here because this one came
-		 * from user space:
-		 */
-		if (is_prefetch(regs, error_code, address))
-			return;
+	if (!(error_code & X86_PF_USER)) {
+		/* Implicit user access to kernel memory -- just oops */
+		page_fault_oops(regs, error_code, address);
+		return;
+	}
 
-		if (is_errata100(regs, address))
-			return;
+	/*
+	 * User mode accesses just cause a SIGSEGV.
+	 * It's possible to have interrupts off here:
+	 */
+	local_irq_enable();
 
-		sanitize_error_code(address, &error_code);
+	/*
+	 * Valid to do another page fault here because this one came
+	 * from user space:
+	 */
+	if (is_prefetch(regs, error_code, address))
+		return;
 
-		if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address))
-			return;
+	if (is_errata100(regs, address))
+		return;
 
-		if (likely(show_unhandled_signals))
-			show_signal_msg(regs, error_code, address, tsk);
+	sanitize_error_code(address, &error_code);
 
-		set_signal_archinfo(address, error_code);
+	if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address))
+		return;
 
-		if (si_code == SEGV_PKUERR)
-			force_sig_pkuerr((void __user *)address, pkey);
+	if (likely(show_unhandled_signals))
+		show_signal_msg(regs, error_code, address, tsk);
 
-		force_sig_fault(SIGSEGV, si_code, (void __user *)address);
+	set_signal_archinfo(address, error_code);
 
-		local_irq_disable();
+	if (si_code == SEGV_PKUERR)
+		force_sig_pkuerr((void __user *)address, pkey);
 
-		return;
-	}
+	force_sig_fault(SIGSEGV, si_code, (void __user *)address);
 
-	no_context(regs, error_code, address, SIGSEGV, si_code);
+	local_irq_disable();
 }
 
 static noinline void

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/fault: Improve kernel-executing-user-memory handling
  2021-02-10  2:33 ` [PATCH v2 07/14] x86/fault: Improve kernel-executing-user-memory handling Andy Lutomirski
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     03c81ea3331658f613bb2913d33764a4e0410cbd
Gitweb:        https://git.kernel.org/tip/03c81ea3331658f613bb2913d33764a4e0410cbd
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:39 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 14:20:54 +01:00

x86/fault: Improve kernel-executing-user-memory handling

Right now, the case of the kernel trying to execute from user memory
is treated more or less just like the kernel getting a page fault on a
user access. In the failure path, it checks for erratum #93, tries to
otherwise fix up the error, and then oopses.

If it manages to jump to the user address space, with or without SMEP,
it should not try to resolve the page fault. This is an error, pure and
simple. Rearrange the code so that this case is caught early, check for
erratum #93, and bail out.

 [ bp: Massage commit message. ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/ab8719c7afb8bd501c4eee0e36493150fbbe5f6a.1612924255.git.luto@kernel.org
---
 arch/x86/mm/fault.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b110484..cbb1a97 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -447,6 +447,9 @@ static int is_errata93(struct pt_regs *regs, unsigned long address)
 	    || boot_cpu_data.x86 != 0xf)
 		return 0;
 
+	if (user_mode(regs))
+		return 0;
+
 	if (address != regs->ip)
 		return 0;
 
@@ -744,9 +747,6 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	if (is_prefetch(regs, error_code, address))
 		return;
 
-	if (is_errata93(regs, address))
-		return;
-
 	/*
 	 * Buggy firmware could access regions which might page fault, try to
 	 * recover from such faults.
@@ -1239,6 +1239,21 @@ void do_user_addr_fault(struct pt_regs *regs,
 	tsk = current;
 	mm = tsk->mm;
 
+	if (unlikely((error_code & (X86_PF_USER | X86_PF_INSTR)) == X86_PF_INSTR)) {
+		/*
+		 * Whoops, this is kernel mode code trying to execute from
+		 * user memory.  Unless this is AMD erratum #93, which
+		 * corrupts RIP such that it looks like a user address,
+		 * this is unrecoverable.  Don't even try to look up the
+		 * VMA.
+		 */
+		if (is_errata93(regs, address))
+			return;
+
+		bad_area_nosemaphore(regs, error_code, address);
+		return;
+	}
+
 	/* kprobes don't want to hook the spurious faults: */
 	if (unlikely(kprobe_page_fault(regs, X86_TRAP_PF)))
 		return;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/fault: Correct a few user vs kernel checks wrt WRUSS
  2021-02-10  2:33 ` [PATCH v2 06/14] x86/fault: Correct a few user vs kernel checks wrt WRUSS Andy Lutomirski
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     56e62cd28aaae2fcbec8af67b05843c47c6da170
Gitweb:        https://git.kernel.org/tip/56e62cd28aaae2fcbec8af67b05843c47c6da170
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:38 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 14:13:32 +01:00

x86/fault: Correct a few user vs kernel checks wrt WRUSS

In general, page fault errors for WRUSS should be just like get_user(),
etc.  Fix three bugs in this area:

There is a comment that says that, if the kernel can't handle a page fault
on a user address due to OOM, the OOM-kill-and-retry logic would be
skipped.  The code checked kernel *privilege*, not kernel mode, so it
missed WRUSS.  This means that the kernel would malfunction if it got OOM
on a WRUSS fault -- this would be a kernel-mode, user-privilege fault, and
the OOM killer would be invoked and the handler would retry the faulting
instruction.

A failed user access from kernel while a fatal signal is pending should
fail even if the instruction in question was WRUSS.

do_sigbus() should not send SIGBUS for WRUSS -- it should handle it like
any other kernel mode failure.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/a7b7bcea730bd4069e6b7e629236bb2cf526c2fb.1612924255.git.luto@kernel.org
---
 arch/x86/mm/fault.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 013910b..b110484 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -945,7 +945,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 	  vm_fault_t fault)
 {
 	/* Kernel mode? Handle exceptions or die: */
-	if (!(error_code & X86_PF_USER)) {
+	if (!user_mode(regs)) {
 		no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
 		return;
 	}
@@ -1217,7 +1217,14 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 }
 NOKPROBE_SYMBOL(do_kern_addr_fault);
 
-/* Handle faults in the user portion of the address space */
+/*
+ * Handle faults in the user portion of the address space.  Nothing in here
+ * should check X86_PF_USER without a specific justification: for almost
+ * all purposes, we should treat a normal kernel access to user memory
+ * (e.g. get_user(), put_user(), etc.) the same as the WRUSS instruction.
+ * The one exception is AC flag handling, which is, per the x86
+ * architecture, special for WRUSS.
+ */
 static inline
 void do_user_addr_fault(struct pt_regs *regs,
 			unsigned long error_code,
@@ -1406,14 +1413,14 @@ good_area:
 	if (likely(!(fault & VM_FAULT_ERROR)))
 		return;
 
-	if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) {
+	if (fatal_signal_pending(current) && !user_mode(regs)) {
 		no_context(regs, error_code, address, 0, 0);
 		return;
 	}
 
 	if (fault & VM_FAULT_OOM) {
 		/* Kernel mode? Handle exceptions or die: */
-		if (!(error_code & X86_PF_USER)) {
+		if (!user_mode(regs)) {
 			no_context(regs, error_code, address,
 				   SIGSEGV, SEGV_MAPERR);
 			return;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/fault: Document the locking in the fault_signal_pending() path
  2021-02-10  2:33 ` [PATCH v2 05/14] x86/fault: Document the locking in the fault_signal_pending() path Andy Lutomirski
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     ef2544fb3f6457b79fc73cea39dafd67ee0f2824
Gitweb:        https://git.kernel.org/tip/ef2544fb3f6457b79fc73cea39dafd67ee0f2824
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:37 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 14:12:07 +01:00

x86/fault: Document the locking in the fault_signal_pending() path

If fault_signal_pending() returns true, then the core mm has unlocked the
mm for us.  Add a comment to help future readers of this code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/c56de3d103f40e6304437b150aa7b215530d23f7.1612924255.git.luto@kernel.org
---
 arch/x86/mm/fault.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 3ffed00..013910b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1380,8 +1380,11 @@ good_area:
 	 */
 	fault = handle_mm_fault(vma, address, flags, regs);
 
-	/* Quick path to respond to signals */
 	if (fault_signal_pending(fault, regs)) {
+		/*
+		 * Quick path to respond to signals.  The core mm code
+		 * has unlocked the mm for us if we get here.
+		 */
 		if (!user_mode(regs))
 			no_context(regs, error_code, address, SIGBUS,
 				   BUS_ADRERR);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/fault: Split the OOPS code out from no_context()
  2021-02-10  2:33 ` [PATCH v2 09/14] x86/fault: Split the OOPS code out from no_context() Andy Lutomirski
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     2cc624b0a7e68ba8957b18600181f7d5b0f3e1b6
Gitweb:        https://git.kernel.org/tip/2cc624b0a7e68ba8957b18600181f7d5b0f3e1b6
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:41 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 14:33:36 +01:00

x86/fault: Split the OOPS code out from no_context()

Not all callers of no_context() want to run exception fixups.
Separate the OOPS code out from the fixup code in no_context().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/450f8d8eabafb83a5df349108c8e5ea83a2f939d.1612924255.git.luto@kernel.org
---
 arch/x86/mm/fault.c | 116 ++++++++++++++++++++++---------------------
 1 file changed, 62 insertions(+), 54 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index cbb1a97..dbf6a94 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -655,53 +655,20 @@ static void set_signal_archinfo(unsigned long address,
 }
 
 static noinline void
-no_context(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int signal, int si_code)
+page_fault_oops(struct pt_regs *regs, unsigned long error_code,
+		unsigned long address)
 {
-	struct task_struct *tsk = current;
 	unsigned long flags;
 	int sig;
 
 	if (user_mode(regs)) {
 		/*
-		 * This is an implicit supervisor-mode access from user
-		 * mode.  Bypass all the kernel-mode recovery code and just
-		 * OOPS.
+		 * Implicit kernel access from user mode?  Skip the stack
+		 * overflow and EFI special cases.
 		 */
 		goto oops;
 	}
 
-	/* Are we prepared to handle this kernel fault? */
-	if (fixup_exception(regs, X86_TRAP_PF, error_code, address)) {
-		/*
-		 * Any interrupt that takes a fault gets the fixup. This makes
-		 * the below recursive fault logic only apply to a faults from
-		 * task context.
-		 */
-		if (in_interrupt())
-			return;
-
-		/*
-		 * Per the above we're !in_interrupt(), aka. task context.
-		 *
-		 * In this case we need to make sure we're not recursively
-		 * faulting through the emulate_vsyscall() logic.
-		 */
-		if (current->thread.sig_on_uaccess_err && signal) {
-			sanitize_error_code(address, &error_code);
-
-			set_signal_archinfo(address, error_code);
-
-			/* XXX: hwpoison faults will set the wrong code. */
-			force_sig_fault(signal, si_code, (void __user *)address);
-		}
-
-		/*
-		 * Barring that, we can do the fixup and be happy.
-		 */
-		return;
-	}
-
 #ifdef CONFIG_VMAP_STACK
 	/*
 	 * Stack overflow?  During boot, we can fault near the initial
@@ -709,8 +676,8 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	 * that we're in vmalloc space to avoid this.
 	 */
 	if (is_vmalloc_addr((void *)address) &&
-	    (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
-	     address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
+	    (((unsigned long)current->stack - 1 - address < PAGE_SIZE) ||
+	     address - ((unsigned long)current->stack + THREAD_SIZE) < PAGE_SIZE)) {
 		unsigned long stack = __this_cpu_ist_top_va(DF) - sizeof(void *);
 		/*
 		 * We're likely to be running with very little stack space
@@ -734,20 +701,6 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 #endif
 
 	/*
-	 * 32-bit:
-	 *
-	 *   Valid to do another page fault here, because if this fault
-	 *   had been triggered by is_prefetch fixup_exception would have
-	 *   handled it.
-	 *
-	 * 64-bit:
-	 *
-	 *   Hall of shame of CPU/BIOS bugs.
-	 */
-	if (is_prefetch(regs, error_code, address))
-		return;
-
-	/*
 	 * Buggy firmware could access regions which might page fault, try to
 	 * recover from such faults.
 	 */
@@ -763,7 +716,7 @@ oops:
 
 	show_fault_oops(regs, error_code, address);
 
-	if (task_stack_end_corrupted(tsk))
+	if (task_stack_end_corrupted(current))
 		printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
 
 	sig = SIGKILL;
@@ -776,6 +729,61 @@ oops:
 	oops_end(flags, regs, sig);
 }
 
+static noinline void
+no_context(struct pt_regs *regs, unsigned long error_code,
+	   unsigned long address, int signal, int si_code)
+{
+	if (user_mode(regs)) {
+		/*
+		 * This is an implicit supervisor-mode access from user
+		 * mode.  Bypass all the kernel-mode recovery code and just
+		 * OOPS.
+		 */
+		goto oops;
+	}
+
+	/* Are we prepared to handle this kernel fault? */
+	if (fixup_exception(regs, X86_TRAP_PF, error_code, address)) {
+		/*
+		 * Any interrupt that takes a fault gets the fixup. This makes
+		 * the below recursive fault logic only apply to a faults from
+		 * task context.
+		 */
+		if (in_interrupt())
+			return;
+
+		/*
+		 * Per the above we're !in_interrupt(), aka. task context.
+		 *
+		 * In this case we need to make sure we're not recursively
+		 * faulting through the emulate_vsyscall() logic.
+		 */
+		if (current->thread.sig_on_uaccess_err && signal) {
+			sanitize_error_code(address, &error_code);
+
+			set_signal_archinfo(address, error_code);
+
+			/* XXX: hwpoison faults will set the wrong code. */
+			force_sig_fault(signal, si_code, (void __user *)address);
+		}
+
+		/*
+		 * Barring that, we can do the fixup and be happy.
+		 */
+		return;
+	}
+
+	/*
+	 * AMD erratum #91 manifests as a spurious page fault on a PREFETCH
+	 * instruction.
+	 */
+	if (is_prefetch(regs, error_code, address))
+		return;
+
+oops:
+	page_fault_oops(regs, error_code, address);
+}
+
 /*
  * Print out info about fatal segfaults, if the show_unhandled_signals
  * sysctl is set:

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/fault: Fold mm_fault_error() into do_user_addr_fault()
  2021-02-10  2:33 ` [PATCH v2 03/14] x86/fault: Fold mm_fault_error() into do_user_addr_fault() Andy Lutomirski
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     ec352711ceba890ea3a0c182c2d49c86c1a5e30e
Gitweb:        https://git.kernel.org/tip/ec352711ceba890ea3a0c182c2d49c86c1a5e30e
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:35 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 14:10:07 +01:00

x86/fault: Fold mm_fault_error() into do_user_addr_fault()

mm_fault_error() is logically just the end of do_user_addr_fault().
Combine the functions.  This makes the code easier to read.

Most of the churn here is from renaming hw_error_code to error_code in
do_user_addr_fault().

This makes no difference at all to the generated code (objdump -dr) as
compared to changing noinline to __always_inline in the definition of
mm_fault_error().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/dedc4d9c9b047e51ce38b991bd23971a28af4e7b.1612924255.git.luto@kernel.org
---
 arch/x86/mm/fault.c | 97 ++++++++++++++++++++------------------------
 1 file changed, 45 insertions(+), 52 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 818902b..91cf7a6 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -981,40 +981,6 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 	force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
 }
 
-static noinline void
-mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-	       unsigned long address, vm_fault_t fault)
-{
-	if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) {
-		no_context(regs, error_code, address, 0, 0);
-		return;
-	}
-
-	if (fault & VM_FAULT_OOM) {
-		/* Kernel mode? Handle exceptions or die: */
-		if (!(error_code & X86_PF_USER)) {
-			no_context(regs, error_code, address,
-				   SIGSEGV, SEGV_MAPERR);
-			return;
-		}
-
-		/*
-		 * We ran out of memory, call the OOM killer, and return the
-		 * userspace (which will retry the fault, or kill us if we got
-		 * oom-killed):
-		 */
-		pagefault_out_of_memory();
-	} else {
-		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
-			     VM_FAULT_HWPOISON_LARGE))
-			do_sigbus(regs, error_code, address, fault);
-		else if (fault & VM_FAULT_SIGSEGV)
-			bad_area_nosemaphore(regs, error_code, address);
-		else
-			BUG();
-	}
-}
-
 static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte)
 {
 	if ((error_code & X86_PF_WRITE) && !pte_write(*pte))
@@ -1252,7 +1218,7 @@ NOKPROBE_SYMBOL(do_kern_addr_fault);
 /* Handle faults in the user portion of the address space */
 static inline
 void do_user_addr_fault(struct pt_regs *regs,
-			unsigned long hw_error_code,
+			unsigned long error_code,
 			unsigned long address)
 {
 	struct vm_area_struct *vma;
@@ -1272,8 +1238,8 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * Reserved bits are never expected to be set on
 	 * entries in the user portion of the page tables.
 	 */
-	if (unlikely(hw_error_code & X86_PF_RSVD))
-		pgtable_bad(regs, hw_error_code, address);
+	if (unlikely(error_code & X86_PF_RSVD))
+		pgtable_bad(regs, error_code, address);
 
 	/*
 	 * If SMAP is on, check for invalid kernel (supervisor) access to user
@@ -1283,10 +1249,10 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * enforcement appears to be consistent with the USER bit.
 	 */
 	if (unlikely(cpu_feature_enabled(X86_FEATURE_SMAP) &&
-		     !(hw_error_code & X86_PF_USER) &&
+		     !(error_code & X86_PF_USER) &&
 		     !(regs->flags & X86_EFLAGS_AC)))
 	{
-		bad_area_nosemaphore(regs, hw_error_code, address);
+		bad_area_nosemaphore(regs, error_code, address);
 		return;
 	}
 
@@ -1295,7 +1261,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * in a region with pagefaults disabled then we must not take the fault
 	 */
 	if (unlikely(faulthandler_disabled() || !mm)) {
-		bad_area_nosemaphore(regs, hw_error_code, address);
+		bad_area_nosemaphore(regs, error_code, address);
 		return;
 	}
 
@@ -1316,9 +1282,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
-	if (hw_error_code & X86_PF_WRITE)
+	if (error_code & X86_PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
-	if (hw_error_code & X86_PF_INSTR)
+	if (error_code & X86_PF_INSTR)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
 #ifdef CONFIG_X86_64
@@ -1334,7 +1300,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * to consider the PF_PK bit.
 	 */
 	if (is_vsyscall_vaddr(address)) {
-		if (emulate_vsyscall(hw_error_code, regs, address))
+		if (emulate_vsyscall(error_code, regs, address))
 			return;
 	}
 #endif
@@ -1357,7 +1323,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 			 * Fault from code in kernel from
 			 * which we do not expect faults.
 			 */
-			bad_area_nosemaphore(regs, hw_error_code, address);
+			bad_area_nosemaphore(regs, error_code, address);
 			return;
 		}
 retry:
@@ -1373,17 +1339,17 @@ retry:
 
 	vma = find_vma(mm, address);
 	if (unlikely(!vma)) {
-		bad_area(regs, hw_error_code, address);
+		bad_area(regs, error_code, address);
 		return;
 	}
 	if (likely(vma->vm_start <= address))
 		goto good_area;
 	if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
-		bad_area(regs, hw_error_code, address);
+		bad_area(regs, error_code, address);
 		return;
 	}
 	if (unlikely(expand_stack(vma, address))) {
-		bad_area(regs, hw_error_code, address);
+		bad_area(regs, error_code, address);
 		return;
 	}
 
@@ -1392,8 +1358,8 @@ retry:
 	 * we can handle it..
 	 */
 good_area:
-	if (unlikely(access_error(hw_error_code, vma))) {
-		bad_area_access_error(regs, hw_error_code, address, vma);
+	if (unlikely(access_error(error_code, vma))) {
+		bad_area_access_error(regs, error_code, address, vma);
 		return;
 	}
 
@@ -1415,7 +1381,7 @@ good_area:
 	/* Quick path to respond to signals */
 	if (fault_signal_pending(fault, regs)) {
 		if (!user_mode(regs))
-			no_context(regs, hw_error_code, address, SIGBUS,
+			no_context(regs, error_code, address, SIGBUS,
 				   BUS_ADRERR);
 		return;
 	}
@@ -1432,11 +1398,38 @@ good_area:
 	}
 
 	mmap_read_unlock(mm);
-	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, hw_error_code, address, fault);
+	if (likely(!(fault & VM_FAULT_ERROR)))
+		return;
+
+	if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) {
+		no_context(regs, error_code, address, 0, 0);
 		return;
 	}
 
+	if (fault & VM_FAULT_OOM) {
+		/* Kernel mode? Handle exceptions or die: */
+		if (!(error_code & X86_PF_USER)) {
+			no_context(regs, error_code, address,
+				   SIGSEGV, SEGV_MAPERR);
+			return;
+		}
+
+		/*
+		 * We ran out of memory, call the OOM killer, and return the
+		 * userspace (which will retry the fault, or kill us if we got
+		 * oom-killed):
+		 */
+		pagefault_out_of_memory();
+	} else {
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
+			     VM_FAULT_HWPOISON_LARGE))
+			do_sigbus(regs, error_code, address, fault);
+		else if (fault & VM_FAULT_SIGSEGV)
+			bad_area_nosemaphore(regs, error_code, address);
+		else
+			BUG();
+	}
+
 	check_v8086_mode(regs, address, tsk);
 }
 NOKPROBE_SYMBOL(do_user_addr_fault);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/fault/32: Move is_f00f_bug() to do_kern_addr_fault()
  2021-02-10  2:33 ` [PATCH v2 04/14] x86/fault/32: Move is_f00f_bug() to do_kern_addr_fault() Andy Lutomirski
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     f42a40fd53fb5c77bae67d917d66078dbaa46bc2
Gitweb:        https://git.kernel.org/tip/f42a40fd53fb5c77bae67d917d66078dbaa46bc2
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:36 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 14:11:07 +01:00

x86/fault/32: Move is_f00f_bug() to do_kern_addr_fault()

bad_area() and its relatives are called from many places in fault.c, and
exactly one of them wants the F00F workaround.

__bad_area_nosemaphore() no longer contains any kernel fault code, which
prepares for further cleanups.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/e9668729a48ce6754022b0a4415631e8ebdd00e7.1612924255.git.luto@kernel.org
---
 arch/x86/mm/fault.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 91cf7a6..3ffed00 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -482,10 +482,12 @@ static int is_errata100(struct pt_regs *regs, unsigned long address)
 }
 
 /* Pentium F0 0F C7 C8 bug workaround: */
-static int is_f00f_bug(struct pt_regs *regs, unsigned long address)
+static int is_f00f_bug(struct pt_regs *regs, unsigned long error_code,
+		       unsigned long address)
 {
 #ifdef CONFIG_X86_F00F_BUG
-	if (boot_cpu_has_bug(X86_BUG_F00F) && idt_is_f00f_address(address)) {
+	if (boot_cpu_has_bug(X86_BUG_F00F) && !(error_code & X86_PF_USER) &&
+	    idt_is_f00f_address(address)) {
 		handle_invalid_op(regs);
 		return 1;
 	}
@@ -853,9 +855,6 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 		return;
 	}
 
-	if (is_f00f_bug(regs, address))
-		return;
-
 	no_context(regs, error_code, address, SIGSEGV, si_code);
 }
 
@@ -1195,6 +1194,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 	}
 #endif
 
+	if (is_f00f_bug(regs, hw_error_code, address))
+		return;
+
 	/* Was the fault spurious, caused by lazy TLB invalidation? */
 	if (spurious_kernel_fault(hw_error_code, address))
 		return;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/fault: Skip the AMD erratum #91 workaround on unaffected CPUs
  2021-02-10  2:33 ` [PATCH v2 02/14] x86/fault: Skip the AMD erratum #91 workaround on unaffected CPUs Andy Lutomirski
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     d24df8ecf9b6f81029f520ae7158a8670a28d70b
Gitweb:        https://git.kernel.org/tip/d24df8ecf9b6f81029f520ae7158a8670a28d70b
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:34 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 13:38:12 +01:00

x86/fault: Skip the AMD erratum #91 workaround on unaffected CPUs

According to the Revision Guide for AMD Athlon™ 64 and AMD Opteron™
Processors, only early revisions of family 0xF are affected. This will
avoid unnecessarily fetching instruction bytes before sending SIGSEGV to
user programs.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/477173b7784bc28afb3e53d76ae5ef143917e8dd.1612924255.git.luto@kernel.org
---
 arch/x86/mm/fault.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 441c3e9..818902b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -106,6 +106,15 @@ check_prefetch_opcode(struct pt_regs *regs, unsigned char *instr,
 	}
 }
 
+static bool is_amd_k8_pre_npt(void)
+{
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+
+	return unlikely(IS_ENABLED(CONFIG_CPU_SUP_AMD) &&
+			c->x86_vendor == X86_VENDOR_AMD &&
+			c->x86 == 0xf && c->x86_model < 0x40);
+}
+
 static int
 is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr)
 {
@@ -113,6 +122,10 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr)
 	unsigned char *instr;
 	int prefetch = 0;
 
+	/* Erratum #91 affects AMD K8, pre-NPT CPUs */
+	if (!is_amd_k8_pre_npt())
+		return 0;
+
 	/*
 	 * If it was a exec (instruction fetch) fault on NX page, then
 	 * do not ignore the fault:

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: x86/mm] x86/fault: Fix AMD erratum #91 errata fixup for user code
  2021-02-10  2:33 ` [PATCH v2 01/14] x86/fault: Fix AMD erratum #91 errata fixup for user code Andy Lutomirski
  2021-02-10  7:34   ` Christoph Hellwig
@ 2021-02-10 17:45   ` tip-bot2 for Andy Lutomirski
  1 sibling, 0 replies; 36+ messages in thread
From: tip-bot2 for Andy Lutomirski @ 2021-02-10 17:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Andy Lutomirski, Borislav Petkov, Christoph Hellwig, stable, x86,
	linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     35f1c89b0cce247bf0213df243ed902989b1dcda
Gitweb:        https://git.kernel.org/tip/35f1c89b0cce247bf0213df243ed902989b1dcda
Author:        Andy Lutomirski <luto@kernel.org>
AuthorDate:    Tue, 09 Feb 2021 18:33:33 -08:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Wed, 10 Feb 2021 13:11:41 +01:00

x86/fault: Fix AMD erratum #91 errata fixup for user code

The recent rework of probe_kernel_address() and its conversion to
get_kernel_nofault() inadvertently broke is_prefetch(). Before this
change, probe_kernel_address() was used as a sloppy "read user or
kernel memory" helper, but it doesn't do that any more. The new
get_kernel_nofault() reads *kernel* memory only, which completely broke
is_prefetch() for user access.

Adjust the code to the correct accessor based on access mode. The
manual address bounds check is no longer necessary, since the accessor
helpers (get_user() / get_kernel_nofault()) do the right thing all by
themselves. As a bonus, by using the correct accessor, the open-coded
address bounds check is not needed anymore.

 [ bp: Massage commit message. ]

Fixes: eab0c6089b68 ("maccess: unify the probe kernel arch hooks")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/b91f7f92f3367d2d3a88eec3b09c6aab1b2dc8ef.1612924255.git.luto@kernel.org
---
 arch/x86/mm/fault.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f1f1b5a..441c3e9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -54,7 +54,7 @@ kmmio_fault(struct pt_regs *regs, unsigned long addr)
  * 32-bit mode:
  *
  *   Sometimes AMD Athlon/Opteron CPUs report invalid exceptions on prefetch.
- *   Check that here and ignore it.
+ *   Check that here and ignore it.  This is AMD erratum #91.
  *
  * 64-bit mode:
  *
@@ -83,11 +83,7 @@ check_prefetch_opcode(struct pt_regs *regs, unsigned char *instr,
 #ifdef CONFIG_X86_64
 	case 0x40:
 		/*
-		 * In AMD64 long mode 0x40..0x4F are valid REX prefixes
-		 * Need to figure out under what instruction mode the
-		 * instruction was issued. Could check the LDT for lm,
-		 * but for now it's good enough to assume that long
-		 * mode only uses well known segments or kernel.
+		 * In 64-bit mode 0x40..0x4F are valid REX prefixes
 		 */
 		return (!user_mode(regs) || user_64bit_mode(regs));
 #endif
@@ -127,20 +123,31 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr)
 	instr = (void *)convert_ip_to_linear(current, regs);
 	max_instr = instr + 15;
 
-	if (user_mode(regs) && instr >= (unsigned char *)TASK_SIZE_MAX)
-		return 0;
+	/*
+	 * This code has historically always bailed out if IP points to a
+	 * not-present page (e.g. due to a race).  No one has ever
+	 * complained about this.
+	 */
+	pagefault_disable();
 
 	while (instr < max_instr) {
 		unsigned char opcode;
 
-		if (get_kernel_nofault(opcode, instr))
-			break;
+		if (user_mode(regs)) {
+			if (get_user(opcode, instr))
+				break;
+		} else {
+			if (get_kernel_nofault(opcode, instr))
+				break;
+		}
 
 		instr++;
 
 		if (!check_prefetch_opcode(regs, instr, opcode, &prefetch))
 			break;
 	}
+
+	pagefault_enable();
 	return prefetch;
 }
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC][PATCH] kprobes: Remove kprobe::fault_handler
  2021-02-10 13:57 ` [RFC][PATCH] kprobes: Remove kprobe::fault_handler Peter Zijlstra
@ 2021-02-10 19:09   ` Christoph Hellwig
  2021-02-10 19:47   ` Alexei Starovoitov
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2021-02-10 19:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, x86, LKML, Dave Hansen, Alexei Starovoitov,
	Daniel Borkmann, Yonghong Song, Masami Hiramatsu

On Wed, Feb 10, 2021 at 02:57:03PM +0100, Peter Zijlstra wrote:
> 
> Somewhat related.. I had this pending.
> 
> ---
> Subject: kprobes: Remove kprobe::fault_handler
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Feb 2 10:43:41 CET 2021
> 
> The reason for kprobe::fault_handler(), as given by their comment:
> 
>  * We come here because instructions in the pre/post
>  * handler caused the page_fault, this could happen
>  * if handler tries to access user space by
>  * copy_from_user(), get_user() etc. Let the
>  * user-specified handler try to fix it first.
> 
> If just plain bad. Those other handlers are ran from non-preemptible
> context and had better use _nofault() functions. Also, there is no
> upstream usage of this.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC][PATCH] kprobes: Remove kprobe::fault_handler
  2021-02-10 13:57 ` [RFC][PATCH] kprobes: Remove kprobe::fault_handler Peter Zijlstra
  2021-02-10 19:09   ` Christoph Hellwig
@ 2021-02-10 19:47   ` Alexei Starovoitov
  1 sibling, 0 replies; 36+ messages in thread
From: Alexei Starovoitov @ 2021-02-10 19:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, X86 ML, LKML, Dave Hansen, Alexei Starovoitov,
	Daniel Borkmann, Yonghong Song, Masami Hiramatsu

On Wed, Feb 10, 2021 at 5:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> Somewhat related.. I had this pending.
>
> ---
> Subject: kprobes: Remove kprobe::fault_handler
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Feb 2 10:43:41 CET 2021
>
> The reason for kprobe::fault_handler(), as given by their comment:
>
>  * We come here because instructions in the pre/post
>  * handler caused the page_fault, this could happen
>  * if handler tries to access user space by
>  * copy_from_user(), get_user() etc. Let the
>  * user-specified handler try to fix it first.
>
> If just plain bad. Those other handlers are ran from non-preemptible
> context and had better use _nofault() functions. Also, there is no
> upstream usage of this.

No objections from me.

Since Masami mentioned that systemtap used that you
probably want to give them a courtesy heads-up that it's going away.
Though looking at systemtap source code I couldn't find any
reference to it. So it's likely a nop for them anyway.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 14/14] x86/fault, x86/efi: Fix and rename efi_recover_from_page_fault()
  2021-02-10  2:33 ` [PATCH v2 14/14] x86/fault, x86/efi: Fix and rename efi_recover_from_page_fault() Andy Lutomirski
  2021-02-10 17:45   ` [tip: x86/mm] x86/{fault,efi}: " tip-bot2 for Andy Lutomirski
@ 2021-02-11  8:38   ` Ard Biesheuvel
  1 sibling, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2021-02-11  8:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, LKML, Dave Hansen, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Masami Hiramatsu, Peter Zijlstra, linux-efi

On Wed, 10 Feb 2021 at 03:34, Andy Lutomirski <luto@kernel.org> wrote:
>
> efi_recover_from_page_fault() doesn't recover -- it does a special EFI
> mini-oops.  Rename it to make it clear that it crashes.
>
> While renaming it, I noticed a blatant bug: a page fault oops in a
> different thread happening concurrently with an EFI runtime service call
> would be misinterpreted as an EFI page fault.  Fix that.
>
> This isn't quite exact.  We could do better by using a special CS for
> calls into EFI.
>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: linux-efi@vger.kernel.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  arch/x86/include/asm/efi.h     |  2 +-
>  arch/x86/mm/fault.c            | 11 ++++++-----
>  arch/x86/platform/efi/quirks.c | 16 ++++++++++++----
>  3 files changed, 19 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
> index c98f78330b09..4b7706ddd8b6 100644
> --- a/arch/x86/include/asm/efi.h
> +++ b/arch/x86/include/asm/efi.h
> @@ -150,7 +150,7 @@ extern void __init efi_apply_memmap_quirks(void);
>  extern int __init efi_reuse_config(u64 tables, int nr_tables);
>  extern void efi_delete_dummy_variable(void);
>  extern void efi_switch_mm(struct mm_struct *mm);
> -extern void efi_recover_from_page_fault(unsigned long phys_addr);
> +extern void efi_crash_gracefully_on_page_fault(unsigned long phys_addr);
>  extern void efi_free_boot_services(void);
>
>  /* kexec external ABI */
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index eed217d4a877..dfdd56d9c020 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -16,7 +16,7 @@
>  #include <linux/prefetch.h>            /* prefetchw                    */
>  #include <linux/context_tracking.h>    /* exception_enter(), ...       */
>  #include <linux/uaccess.h>             /* faulthandler_disabled()      */
> -#include <linux/efi.h>                 /* efi_recover_from_page_fault()*/
> +#include <linux/efi.h>                 /* efi_crash_gracefully_on_page_fault()*/
>  #include <linux/mm_types.h>
>
>  #include <asm/cpufeature.h>            /* boot_cpu_has, ...            */
> @@ -25,7 +25,7 @@
>  #include <asm/vsyscall.h>              /* emulate_vsyscall             */
>  #include <asm/vm86.h>                  /* struct vm86                  */
>  #include <asm/mmu_context.h>           /* vma_pkey()                   */
> -#include <asm/efi.h>                   /* efi_recover_from_page_fault()*/
> +#include <asm/efi.h>                   /* efi_crash_gracefully_on_page_fault()*/
>  #include <asm/desc.h>                  /* store_idt(), ...             */
>  #include <asm/cpu_entry_area.h>                /* exception stack              */
>  #include <asm/pgtable_areas.h>         /* VMALLOC_START, ...           */
> @@ -700,11 +700,12 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,
>  #endif
>
>         /*
> -        * Buggy firmware could access regions which might page fault, try to
> -        * recover from such faults.
> +        * Buggy firmware could access regions which might page fault.  If
> +        * this happens, EFI has a special OOPS path that will try to
> +        * avoid hanging the system.
>          */
>         if (IS_ENABLED(CONFIG_EFI))
> -               efi_recover_from_page_fault(address);
> +               efi_crash_gracefully_on_page_fault(address);
>
>  oops:
>         /*
> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 5a40fe411ebd..0463ef9cddd6 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -687,15 +687,25 @@ int efi_capsule_setup_info(struct capsule_info *cap_info, void *kbuff,
>   * @return: Returns, if the page fault is not handled. This function
>   * will never return if the page fault is handled successfully.
>   */
> -void efi_recover_from_page_fault(unsigned long phys_addr)
> +void efi_crash_gracefully_on_page_fault(unsigned long phys_addr)
>  {
>         if (!IS_ENABLED(CONFIG_X86_64))
>                 return;
>
> +       /*
> +        * If we are in an interrupt nested inside an EFI runtime service,
> +        * then this is a regular OOPS, not an EFI failure.
> +        */
> +       if (in_interrupt() || in_nmi() || in_softirq())
> +               return;
> +
>         /*
>          * Make sure that an efi runtime service caused the page fault.
> +        * READ_ONCE() because we might be OOPSing in a different thread,
> +        * and we don't want to trip KTSAN while trying to OOPS.
>          */
> -       if (efi_rts_work.efi_rts_id == EFI_NONE)
> +       if (READ_ONCE(efi_rts_work.efi_rts_id) == EFI_NONE ||
> +           current_work() != &efi_rts_work.work)
>                 return;
>
>         /*
> @@ -747,6 +757,4 @@ void efi_recover_from_page_fault(unsigned long phys_addr)
>                 set_current_state(TASK_IDLE);
>                 schedule();
>         }
> -
> -       return;
>  }
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-02-11  8:49 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-10  2:33 [PATCH v2 00/14] x86/fault: #PF improvements, mostly related to USER bit Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 01/14] x86/fault: Fix AMD erratum #91 errata fixup for user code Andy Lutomirski
2021-02-10  7:34   ` Christoph Hellwig
2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 02/14] x86/fault: Skip the AMD erratum #91 workaround on unaffected CPUs Andy Lutomirski
2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 03/14] x86/fault: Fold mm_fault_error() into do_user_addr_fault() Andy Lutomirski
2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 04/14] x86/fault/32: Move is_f00f_bug() to do_kern_addr_fault() Andy Lutomirski
2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 05/14] x86/fault: Document the locking in the fault_signal_pending() path Andy Lutomirski
2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 06/14] x86/fault: Correct a few user vs kernel checks wrt WRUSS Andy Lutomirski
2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 07/14] x86/fault: Improve kernel-executing-user-memory handling Andy Lutomirski
2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 08/14] x86/fault: Skip erratum #93 workaround on new CPUs Andy Lutomirski
2021-02-10  6:06   ` Andy Lutomirski
2021-02-10 13:29     ` Borislav Petkov
2021-02-10 16:09       ` Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 09/14] x86/fault: Split the OOPS code out from no_context() Andy Lutomirski
2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 10/14] x86/fault: Bypass no_context() for implicit kernel faults from usermode Andy Lutomirski
2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 11/14] x86/fault: Rename no_context() to kernelmode_fixup_or_oops() Andy Lutomirski
2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 12/14] x86/fault: Don't look for extable entries for SMEP violations Andy Lutomirski
2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 13/14] x86/fault: Don't run fixups for SMAP violations Andy Lutomirski
2021-02-10 17:45   ` [tip: x86/mm] " tip-bot2 for Andy Lutomirski
2021-02-10  2:33 ` [PATCH v2 14/14] x86/fault, x86/efi: Fix and rename efi_recover_from_page_fault() Andy Lutomirski
2021-02-10 17:45   ` [tip: x86/mm] x86/{fault,efi}: " tip-bot2 for Andy Lutomirski
2021-02-11  8:38   ` [PATCH v2 14/14] x86/fault, x86/efi: " Ard Biesheuvel
2021-02-10 13:57 ` [RFC][PATCH] kprobes: Remove kprobe::fault_handler Peter Zijlstra
2021-02-10 19:09   ` Christoph Hellwig
2021-02-10 19:47   ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.