Hi! This is the first batch of a 73 patches series which consolidates the x86 entry code. This work started off as a trivial 5 patches series moving the heavy lifting of POSIX CPU timers out of interrupt context into thread/process context. This discovered that KVM is lacking to handle pending work items before entering guest mode and added the handling to the x86 KVM code. Review requested to make this a generic infrastructure. The next series grew to 25 patches implementing the generic infrastructure, converting x86 (and as a POC ARM64) over, but it turned out that this was slightly incomplete and still had some entanglement with the rest of the x86 entry code as some of that functionality is shared between syscall and interrupt entry/exit. And it also unearthed the nastyness of IOPL which got already addressed in mainline. This series addresses these issues in order to prepare for making the entry from userspace and exit to userspace (and it's counterpart enter guest) a generic infrastructure in order to restrict the necessary ASM work to the bare minimum. The series is split into 5 parts: - General cleanups and bugfixes - Consolidation of the syscall entry/exit code - Autogenerate simple exception/trap code and reduce the difference between 32 and 64 bit - Autogenerate complex exception/trap code and provide different entry points for #DB and #MC exceptions which allows to address the recently discovered RCU vs. world issues in a more structured way - Convert the device interrupt entry code to use the same mechanism as exceptions and traps and finally convert the system vectors over as well. The last step after all those cleanups is to move the return from exception/interrupt logic (user mode work, kernel preemption) completely from ASM into C-code, so the ASM code just has to take care about returning from the exception, which is horrible and convoluted enough already. At the end the x86 entry code is ready to move the syscall parts out into generic code and finally tackle the initial problem which started all of this. The complete series is available from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/entry which contains all 73 patches. The individual parts are tagged, so this part can be retrieved via: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-v1-part1 Thanks, tglx 8<--------------- entry/entry_32.S | 19 +++++++------------ include/asm/irq.h | 2 +- include/asm/mce.h | 3 --- include/asm/traps.h | 17 +++++++---------- kernel/cpu/mce/core.c | 12 ++++++++++-- kernel/cpu/mce/internal.h | 3 +++ kernel/irq.c | 3 +-- kernel/traps.c | 41 ++++++++++++++++++++++++++++++++++------- 8 files changed, 63 insertions(+), 37 deletions(-)
All exception entry points must have ASM_CLAC right at the beginning. The general_protection entry is missing one. Fixes: e59d1b0a2419 ("x86-32, smap: Add STAC/CLAC instructions to 32-bit kernel entry") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org --- arch/x86/entry/entry_32.S | 1 + 1 file changed, 1 insertion(+) --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1694,6 +1694,7 @@ SYM_CODE_START(int3) SYM_CODE_END(int3) SYM_CODE_START(general_protection) + ASM_CLAC pushl $do_general_protection jmp common_exception SYM_CODE_END(general_protection)
From: Andy Lutomirski <luto@kernel.org> do_machine_check() can be raised in almost any context including the most fragile ones. Prevent kprobes and tracing. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/include/asm/traps.h | 3 --- arch/x86/kernel/cpu/mce/core.c | 12 ++++++++++-- 2 files changed, 10 insertions(+), 5 deletions(-) --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -88,9 +88,6 @@ dotraplinkage void do_page_fault(struct dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code); dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code); dotraplinkage void do_alignment_check(struct pt_regs *regs, long error_code); -#ifdef CONFIG_X86_MCE -dotraplinkage void do_machine_check(struct pt_regs *regs, long error_code); -#endif dotraplinkage void do_simd_coprocessor_error(struct pt_regs *regs, long error_code); #ifdef CONFIG_X86_32 dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code); --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1213,8 +1213,14 @@ static void __mc_scan_banks(struct mce * * On Intel systems this is entered on all CPUs in parallel through * MCE broadcast. However some CPUs might be broken beyond repair, * so be always careful when synchronizing with others. + * + * Tracing and kprobes are disabled: if we interrupted a kernel context + * with IF=1, we need to minimize stack usage. There are also recursion + * issues: if the machine check was due to a failure of the memory + * backing the user stack, tracing that reads the user stack will cause + * potentially infinite recursion. */ -void do_machine_check(struct pt_regs *regs, long error_code) +void notrace do_machine_check(struct pt_regs *regs, long error_code) { DECLARE_BITMAP(valid_banks, MAX_NR_BANKS); DECLARE_BITMAP(toclear, MAX_NR_BANKS); @@ -1360,6 +1366,7 @@ void do_machine_check(struct pt_regs *re ist_exit(regs); } EXPORT_SYMBOL_GPL(do_machine_check); +NOKPROBE_SYMBOL(do_machine_check); #ifndef CONFIG_MEMORY_FAILURE int memory_failure(unsigned long pfn, int flags) @@ -1892,10 +1899,11 @@ static void unexpected_machine_check(str void (*machine_check_vector)(struct pt_regs *, long error_code) = unexpected_machine_check; -dotraplinkage void do_mce(struct pt_regs *regs, long error_code) +dotraplinkage notrace void do_mce(struct pt_regs *regs, long error_code) { machine_check_vector(regs, error_code); } +NOKPROBE_SYMBOL(do_mce); /* * Called for each booted CPU to set up machine checks.
Remove the pointless difference between 32 and 64 bit to make further unifications simpler. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/entry/entry_32.S | 2 +- arch/x86/include/asm/mce.h | 3 --- arch/x86/kernel/cpu/mce/internal.h | 3 +++ 3 files changed, 4 insertions(+), 4 deletions(-) --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1365,7 +1365,7 @@ SYM_CODE_END(divide_error) SYM_CODE_START(machine_check) ASM_CLAC pushl $0 - pushl machine_check_vector + pushl $do_mce jmp common_exception SYM_CODE_END(machine_check) #endif --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -238,9 +238,6 @@ extern void mce_disable_bank(int bank); /* * Exception handler */ - -/* Call the installed machine check handler for this CPU setup. */ -extern void (*machine_check_vector)(struct pt_regs *, long error_code); void do_machine_check(struct pt_regs *, long); /* --- a/arch/x86/kernel/cpu/mce/internal.h +++ b/arch/x86/kernel/cpu/mce/internal.h @@ -8,6 +8,9 @@ #include <linux/device.h> #include <asm/mce.h> +/* Pointer to the installed machine check handler for this CPU setup. */ +extern void (*machine_check_vector)(struct pt_regs *, long error_code); + enum severity_level { MCE_NO_SEVERITY, MCE_DEFERRED_SEVERITY,
From: Thomas Gleixner <tglx@linutronix.de> That function returns immediately after conditionally reenabling interrupts which is more than pointless and requires the ASM code to disable interrupts again. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20191023123117.871608831@linutronix.de --- arch/x86/kernel/traps.c | 1 - 1 file changed, 1 deletion(-) --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -862,7 +862,6 @@ do_simd_coprocessor_error(struct pt_regs dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code) { - cond_local_irq_enable(regs); } dotraplinkage void
Add a comment which explains why this empty handler for a reserved vector exists. Requested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/traps.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -862,6 +862,25 @@ do_simd_coprocessor_error(struct pt_regs dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code) { + /* + * This addresses a Pentium Pro Erratum: + * + * PROBLEM: If the APIC subsystem is configured in mixed mode with + * Virtual Wire mode implemented through the local APIC, an + * interrupt vector of 0Fh (Intel reserved encoding) may be + * generated by the local APIC (Int 15). This vector may be + * generated upon receipt of a spurious interrupt (an interrupt + * which is removed before the system receives the INTA sequence) + * instead of the programmed 8259 spurious interrupt vector. + * + * IMPLICATION: The spurious interrupt vector programmed in the + * 8259 is normally handled by an operating system's spurious + * interrupt handler. However, a vector of 0Fh is unknown to some + * operating systems, which would crash if this erratum occurred. + * + * In theory this could be limited to 32bit, but the handler is not + * hurting and who knows which other CPUs suffer from this. + */ } dotraplinkage void
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/include/asm/traps.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -76,13 +76,6 @@ dotraplinkage void do_coprocessor_segmen dotraplinkage void do_invalid_TSS(struct pt_regs *regs, long error_code); dotraplinkage void do_segment_not_present(struct pt_regs *regs, long error_code); dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code); -#ifdef CONFIG_X86_64 -dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code, unsigned long address); -asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs); -asmlinkage __visible notrace -struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s); -void __init trap_init(void); -#endif dotraplinkage void do_general_protection(struct pt_regs *regs, long error_code); dotraplinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address); dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code); @@ -94,6 +87,13 @@ dotraplinkage void do_iret_error(struct #endif dotraplinkage void do_mce(struct pt_regs *regs, long error_code); +#ifdef CONFIG_X86_64 +asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs); +asmlinkage __visible notrace +struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s); +void __init trap_init(void); +#endif + static inline int get_si_code(unsigned long condition) { if (condition & DR_STEP)
Nothing is using it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/include/asm/irq.h | 2 +- arch/x86/kernel/irq.c | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) --- a/arch/x86/include/asm/irq.h +++ b/arch/x86/include/asm/irq.h @@ -36,7 +36,7 @@ extern void native_init_IRQ(void); extern void handle_irq(struct irq_desc *desc, struct pt_regs *regs); -extern __visible unsigned int do_IRQ(struct pt_regs *regs); +extern __visible void do_IRQ(struct pt_regs *regs); extern void init_ISA_irqs(void); --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -230,7 +230,7 @@ u64 arch_irq_stat(void) * SMP cross-CPU interrupts have their own specific * handlers). */ -__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs) +__visible void __irq_entry do_IRQ(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); struct irq_desc * desc; @@ -263,7 +263,6 @@ u64 arch_irq_stat(void) exiting_irq(); set_irq_regs(old_regs); - return 1; } #ifdef CONFIG_X86_LOCAL_APIC
Nothing cares about the -1 "mark as interrupt" in the errorcode anymore. Just use 0 for all excpetions which do not have an errorcode consistently. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/entry/entry_32.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1290,7 +1290,7 @@ SYM_CODE_END(simd_coprocessor_error) SYM_CODE_START(device_not_available) ASM_CLAC - pushl $-1 # mark this as an int + pushl $0 pushl $do_device_not_available jmp common_exception SYM_CODE_END(device_not_available) @@ -1531,7 +1531,7 @@ SYM_CODE_START(debug) * Entry from sysenter is now handled in common_exception */ ASM_CLAC - pushl $-1 # mark this as an int + pushl $0 pushl $do_debug jmp common_exception SYM_CODE_END(debug) @@ -1682,7 +1682,7 @@ SYM_CODE_END(nmi) SYM_CODE_START(int3) ASM_CLAC - pushl $-1 # mark this as an int + pushl $0 SAVE_ALL switch_stacks=1 ENCODE_FRAME_POINTER
int3 is not using the common_exception path for purely historical reasons, but there is no reason to keep it the only exception which is different. Make it use common_exception so the upcoming changes to autogenerate the entry stubs do not have to special case int3. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/entry/entry_32.S | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1683,14 +1683,8 @@ SYM_CODE_END(nmi) SYM_CODE_START(int3) ASM_CLAC pushl $0 - - SAVE_ALL switch_stacks=1 - ENCODE_FRAME_POINTER - TRACE_IRQS_OFF - xorl %edx, %edx # zero error code - movl %esp, %eax # pt_regs pointer - call do_int3 - jmp ret_from_exception + pushl $do_int3 + jmp common_exception SYM_CODE_END(int3) SYM_CODE_START(general_protection)
#BP is not longer using IST and using ist_enter() and ist_exit() makes it harder to change ist_enter() and ist_exit()'s behavior. Instead open-code the very small amount of required logic. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/traps.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -572,14 +572,20 @@ dotraplinkage void notrace do_int3(struc return; /* - * Use ist_enter despite the fact that we don't use an IST stack. - * We can be called from a kprobe in non-CONTEXT_KERNEL kernel - * mode or even during context tracking state changes. + * Unlike any other non-IST entry, we can be called from a kprobe in + * non-CONTEXT_KERNEL kernel mode or even during context tracking + * state changes. Make sure that we wake up RCU even if we're coming + * from kernel code. * - * This means that we can't schedule. That's okay. + * This means that we can't schedule even if we came from a + * preemptible kernel context. That's okay. */ - ist_enter(regs); + if (!user_mode(regs)) { + rcu_nmi_enter(); + preempt_disable(); + } RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); + #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP, SIGTRAP) == NOTIFY_STOP) @@ -600,7 +606,10 @@ dotraplinkage void notrace do_int3(struc cond_local_irq_disable(regs); exit: - ist_exit(regs); + if (!user_mode(regs)) { + preempt_enable_no_resched(); + rcu_nmi_exit(); + } } NOKPROBE_SYMBOL(do_int3);
On Tue, Feb 25, 2020 at 10:36:37PM +0100, Thomas Gleixner wrote:
> All exception entry points must have ASM_CLAC right at the
> beginning. The general_protection entry is missing one.
>
> Fixes: e59d1b0a2419 ("x86-32, smap: Add STAC/CLAC instructions to 32-bit kernel entry")
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: stable@vger.kernel.org
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
On Tue, Feb 25, 2020 at 10:36:39PM +0100, Thomas Gleixner wrote:
> Remove the pointless difference between 32 and 64 bit to make further
> unifications simpler.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
On Tue, Feb 25, 2020 at 10:36:38PM +0100, Thomas Gleixner wrote:
> From: Andy Lutomirski <luto@kernel.org>
>
> do_machine_check() can be raised in almost any context including the most
> fragile ones. Prevent kprobes and tracing.
>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> arch/x86/include/asm/traps.h | 3 ---
> arch/x86/kernel/cpu/mce/core.c | 12 ++++++++++--
> 2 files changed, 10 insertions(+), 5 deletions(-)
>
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -88,9 +88,6 @@ dotraplinkage void do_page_fault(struct
> dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code);
> dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code);
> dotraplinkage void do_alignment_check(struct pt_regs *regs, long error_code);
> -#ifdef CONFIG_X86_MCE
> -dotraplinkage void do_machine_check(struct pt_regs *regs, long error_code);
> -#endif
> dotraplinkage void do_simd_coprocessor_error(struct pt_regs *regs, long error_code);
> #ifdef CONFIG_X86_32
> dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code);
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1213,8 +1213,14 @@ static void __mc_scan_banks(struct mce *
> * On Intel systems this is entered on all CPUs in parallel through
> * MCE broadcast. However some CPUs might be broken beyond repair,
> * so be always careful when synchronizing with others.
> + *
> + * Tracing and kprobes are disabled: if we interrupted a kernel context
> + * with IF=1, we need to minimize stack usage. There are also recursion
> + * issues: if the machine check was due to a failure of the memory
> + * backing the user stack, tracing that reads the user stack will cause
> + * potentially infinite recursion.
> */
> -void do_machine_check(struct pt_regs *regs, long error_code)
> +void notrace do_machine_check(struct pt_regs *regs, long error_code)
> {
> DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
> DECLARE_BITMAP(toclear, MAX_NR_BANKS);
> @@ -1360,6 +1366,7 @@ void do_machine_check(struct pt_regs *re
> ist_exit(regs);
> }
> EXPORT_SYMBOL_GPL(do_machine_check);
> +NOKPROBE_SYMBOL(do_machine_check);
That won't protect all the function called by do_machine_check(), right?
There are lots of them.
On Tue, Feb 25, 2020 at 10:36:40PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> That function returns immediately after conditionally reenabling interrupts which
> is more than pointless and requires the ASM code to disable interrupts again.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lore.kernel.org/r/20191023123117.871608831@linutronix.de
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
On 2/25/20 1:36 PM, Thomas Gleixner wrote:
> Hi!
>
> This is the first batch of a 73 patches series which consolidates the x86
> entry code.
>
Egads!
If there's a v2, I don't suppose you could tweak your subject-prefix to
generate something like [patch part1 2/10]?
--Andy
On 2/25/20 5:13 PM, Frederic Weisbecker wrote:
> On Tue, Feb 25, 2020 at 10:36:38PM +0100, Thomas Gleixner wrote:
>> From: Andy Lutomirski <luto@kernel.org>
>>
>> do_machine_check() can be raised in almost any context including the most
>> fragile ones. Prevent kprobes and tracing.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> ---
>> arch/x86/include/asm/traps.h | 3 ---
>> arch/x86/kernel/cpu/mce/core.c | 12 ++++++++++--
>> 2 files changed, 10 insertions(+), 5 deletions(-)
>>
>> --- a/arch/x86/include/asm/traps.h
>> +++ b/arch/x86/include/asm/traps.h
>> @@ -88,9 +88,6 @@ dotraplinkage void do_page_fault(struct
>> dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code);
>> dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code);
>> dotraplinkage void do_alignment_check(struct pt_regs *regs, long error_code);
>> -#ifdef CONFIG_X86_MCE
>> -dotraplinkage void do_machine_check(struct pt_regs *regs, long error_code);
>> -#endif
>> dotraplinkage void do_simd_coprocessor_error(struct pt_regs *regs, long error_code);
>> #ifdef CONFIG_X86_32
>> dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code);
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -1213,8 +1213,14 @@ static void __mc_scan_banks(struct mce *
>> * On Intel systems this is entered on all CPUs in parallel through
>> * MCE broadcast. However some CPUs might be broken beyond repair,
>> * so be always careful when synchronizing with others.
>> + *
>> + * Tracing and kprobes are disabled: if we interrupted a kernel context
>> + * with IF=1, we need to minimize stack usage. There are also recursion
>> + * issues: if the machine check was due to a failure of the memory
>> + * backing the user stack, tracing that reads the user stack will cause
>> + * potentially infinite recursion.
>> */
>> -void do_machine_check(struct pt_regs *regs, long error_code)
>> +void notrace do_machine_check(struct pt_regs *regs, long error_code)
>> {
>> DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
>> DECLARE_BITMAP(toclear, MAX_NR_BANKS);
>> @@ -1360,6 +1366,7 @@ void do_machine_check(struct pt_regs *re
>> ist_exit(regs);
>> }
>> EXPORT_SYMBOL_GPL(do_machine_check);
>> +NOKPROBE_SYMBOL(do_machine_check);
>
> That won't protect all the function called by do_machine_check(), right?
> There are lots of them.
>
It at least means we can survive to run actual C code in
do_machine_check(), which lets us try to mitigate this issue further.
PeterZ has patches for that, and maybe this series fixes it later on.
(I'm reading in order!)
On 2/25/20 1:36 PM, Thomas Gleixner wrote:
> Nothing cares about the -1 "mark as interrupt" in the errorcode anymore. Just
> use 0 for all excpetions which do not have an errorcode consistently.
>
I sincerely wish this were the case. But look at collect_syscall() in
lib/syscall.c.
It would be really quite nice to address this for real in some
low-overhead way. My suggestion would be to borrow a trick from 32-bit:
split regs->cs into ->cs and ->__csh, and stick CS_FROM_SYSCALL into
__csh for syscalls. This will only add any overhead at all to the int80
case. Then we could adjust syscall_get_nr() to look for CS_FROM_SYSCALL.
What do you think? An alternative would be to use the stack walking
machinery in collect_syscall(), since the mere existence of that
function is abomination and we may not care about performance.
--Andy
On 2/25/20 1:36 PM, Thomas Gleixner wrote:
> Hi!
>
> This is the first batch of a 73 patches series which consolidates the x86
> entry code.
>
>
In part 1, all but 08/10 are:
Reviewed-by: Andy Lutomirski <luto@kernel.org>
I object to patch 8 as described in my reply.
On Tue, Feb 25, 2020 at 10:36:38PM +0100, Thomas Gleixner wrote: > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -1213,8 +1213,14 @@ static void __mc_scan_banks(struct mce * > * On Intel systems this is entered on all CPUs in parallel through > * MCE broadcast. However some CPUs might be broken beyond repair, > * so be always careful when synchronizing with others. > + * > + * Tracing and kprobes are disabled: if we interrupted a kernel context > + * with IF=1, we need to minimize stack usage. There are also recursion > + * issues: if the machine check was due to a failure of the memory > + * backing the user stack, tracing that reads the user stack will cause ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Had to read this a couple of times to parse that formulation properly. Make that "... backing the user stack, tracing code which accesses same user stack will potentially cause an infinite recursion." With that: Reviewed-by: Borislav Petkov <bp@suse.de> Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Feb 25, 2020 at 09:29:00PM -0800, Andy Lutomirski wrote:
> >> +void notrace do_machine_check(struct pt_regs *regs, long error_code)
> >> {
> >> DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
> >> DECLARE_BITMAP(toclear, MAX_NR_BANKS);
> >> @@ -1360,6 +1366,7 @@ void do_machine_check(struct pt_regs *re
> >> ist_exit(regs);
> >> }
> >> EXPORT_SYMBOL_GPL(do_machine_check);
> >> +NOKPROBE_SYMBOL(do_machine_check);
> >
> > That won't protect all the function called by do_machine_check(), right?
> > There are lots of them.
> >
>
> It at least means we can survive to run actual C code in
> do_machine_check(), which lets us try to mitigate this issue further.
> PeterZ has patches for that, and maybe this series fixes it later on.
> (I'm reading in order!)
Yeah, I don't cover that either. Making the kernel completely kprobe
safe is _lots_ more work I think.
We really need some form of automation for this :/ The current situation
is completely nonsatisfactory.
On Wed, Feb 26, 2020 at 5:28 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Feb 25, 2020 at 09:29:00PM -0800, Andy Lutomirski wrote:
>
> > >> +void notrace do_machine_check(struct pt_regs *regs, long error_code)
> > >> {
> > >> DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
> > >> DECLARE_BITMAP(toclear, MAX_NR_BANKS);
> > >> @@ -1360,6 +1366,7 @@ void do_machine_check(struct pt_regs *re
> > >> ist_exit(regs);
> > >> }
> > >> EXPORT_SYMBOL_GPL(do_machine_check);
> > >> +NOKPROBE_SYMBOL(do_machine_check);
> > >
> > > That won't protect all the function called by do_machine_check(), right?
> > > There are lots of them.
> > >
> >
> > It at least means we can survive to run actual C code in
> > do_machine_check(), which lets us try to mitigate this issue further.
> > PeterZ has patches for that, and maybe this series fixes it later on.
> > (I'm reading in order!)
>
> Yeah, I don't cover that either. Making the kernel completely kprobe
> safe is _lots_ more work I think.
>
> We really need some form of automation for this :/ The current situation
> is completely nonsatisfactory.
I've looked at too many patches lately and lost track a bit of which
is which. Shouldn't a simple tracing_disable() or similar in
do_machine_check() be sufficient? We'd maybe want automation to check
everything before it. We still need to survive hitting a kprobe int3,
but that shouldn't have recursion issues.
(Yes, that function doesn't exist in current kernels. And we'd need
to make sure that BPF respects it.)
On Wed, Feb 26, 2020 at 07:10:01AM -0800, Andy Lutomirski wrote: > On Wed, Feb 26, 2020 at 5:28 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Feb 25, 2020 at 09:29:00PM -0800, Andy Lutomirski wrote: > > > > > >> +void notrace do_machine_check(struct pt_regs *regs, long error_code) > > > >> { > > > >> DECLARE_BITMAP(valid_banks, MAX_NR_BANKS); > > > >> DECLARE_BITMAP(toclear, MAX_NR_BANKS); > > > >> @@ -1360,6 +1366,7 @@ void do_machine_check(struct pt_regs *re > > > >> ist_exit(regs); > > > >> } > > > >> EXPORT_SYMBOL_GPL(do_machine_check); > > > >> +NOKPROBE_SYMBOL(do_machine_check); > > > > > > > > That won't protect all the function called by do_machine_check(), right? > > > > There are lots of them. > > > > > > > > > > It at least means we can survive to run actual C code in > > > do_machine_check(), which lets us try to mitigate this issue further. > > > PeterZ has patches for that, and maybe this series fixes it later on. > > > (I'm reading in order!) > > > > Yeah, I don't cover that either. Making the kernel completely kprobe > > safe is _lots_ more work I think. > > > > We really need some form of automation for this :/ The current situation > > is completely nonsatisfactory. > > I've looked at too many patches lately and lost track a bit of which > is which. Shouldn't a simple tracing_disable() or similar in > do_machine_check() be sufficient? It entirely depends on what the goal is :-/ On the one hand I see why people might want function tracing / kprobes enabled, OTOH it's all mighty frigging scary. Any tracing/probing/whatever on an MCE has the potential to make a bad situation worse -- not unlike the same on #DF. The same with that compiler instrumentation crap; allowing kprobes on *SAN code has the potential to inject probes in arbitrary random code. At the same time, if you're running a kernel with that on and injecting kprobes in it, you're welcome to own the remaining pieces. How far do we want to go? At some point I think we'll have to give people rope, show then the knot and leave them be. > We'd maybe want automation to check > everything before it. We still need to survive hitting a kprobe int3, > but that shouldn't have recursion issues. Right, so I think avoiding the obvious recursion issues is a more tractable problem and yes some 'safe' spot annotation should be enough to get automation working for that -- mostly.
On Tue, Feb 25, 2020 at 10:36:41PM +0100, Thomas Gleixner wrote:
> Add a comment which explains why this empty handler for a reserved vector
> exists.
>
> Requested-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> arch/x86/kernel/traps.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -862,6 +862,25 @@ do_simd_coprocessor_error(struct pt_regs
> dotraplinkage void
> do_spurious_interrupt_bug(struct pt_regs *regs, long error_code)
> {
> + /*
> + * This addresses a Pentium Pro Erratum:
> + *
> + * PROBLEM: If the APIC subsystem is configured in mixed mode with
> + * Virtual Wire mode implemented through the local APIC, an
> + * interrupt vector of 0Fh (Intel reserved encoding) may be
> + * generated by the local APIC (Int 15). This vector may be
> + * generated upon receipt of a spurious interrupt (an interrupt
> + * which is removed before the system receives the INTA sequence)
> + * instead of the programmed 8259 spurious interrupt vector.
> + *
> + * IMPLICATION: The spurious interrupt vector programmed in the
> + * 8259 is normally handled by an operating system's spurious
> + * interrupt handler. However, a vector of 0Fh is unknown to some
> + * operating systems, which would crash if this erratum occurred.
> + *
> + * In theory this could be limited to 32bit, but the handler is not
> + * hurting and who knows which other CPUs suffer from this.
> + */
> }
Nice to have!
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> On Feb 26, 2020, at 8:08 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Feb 26, 2020 at 07:10:01AM -0800, Andy Lutomirski wrote: >>> On Wed, Feb 26, 2020 at 5:28 AM Peter Zijlstra <peterz@infradead.org> wrote: >>>> On Tue, Feb 25, 2020 at 09:29:00PM -0800, Andy Lutomirski wrote: >>> >>>>>> +void notrace do_machine_check(struct pt_regs *regs, long error_code) >>>>>> { >>>>>> DECLARE_BITMAP(valid_banks, MAX_NR_BANKS); >>>>>> DECLARE_BITMAP(toclear, MAX_NR_BANKS); >>>>>> @@ -1360,6 +1366,7 @@ void do_machine_check(struct pt_regs *re >>>>>> ist_exit(regs); >>>>>> } >>>>>> EXPORT_SYMBOL_GPL(do_machine_check); >>>>>> +NOKPROBE_SYMBOL(do_machine_check); >>>>> >>>>> That won't protect all the function called by do_machine_check(), right? >>>>> There are lots of them. >>>>> >>>> >>>> It at least means we can survive to run actual C code in >>>> do_machine_check(), which lets us try to mitigate this issue further. >>>> PeterZ has patches for that, and maybe this series fixes it later on. >>>> (I'm reading in order!) >>> >>> Yeah, I don't cover that either. Making the kernel completely kprobe >>> safe is _lots_ more work I think. >>> >>> We really need some form of automation for this :/ The current situation >>> is completely nonsatisfactory. >> >> I've looked at too many patches lately and lost track a bit of which >> is which. Shouldn't a simple tracing_disable() or similar in >> do_machine_check() be sufficient? > > It entirely depends on what the goal is :-/ On the one hand I see why > people might want function tracing / kprobes enabled, OTOH it's all > mighty frigging scary. Any tracing/probing/whatever on an MCE has the > potential to make a bad situation worse -- not unlike the same on #DF. > > The same with that compiler instrumentation crap; allowing kprobes on > *SAN code has the potential to inject probes in arbitrary random code. > At the same time, if you're running a kernel with that on and injecting > kprobes in it, you're welcome to own the remaining pieces. > Agreed. > How far do we want to go? At some point I think we'll have to give > people rope, show then the knot and leave them be. If someone puts a kprobe on some TLB flush thing and an MCE does memory failure handling, it would be polite to avoid crashing. OTOH the x86 memory failure story is *so* bad that I’m not sure how well we can ever really expect this to work. I think we should aim to get the entry part correct, and if the meat of the function explodes, so be it. > >> We'd maybe want automation to check >> everything before it. We still need to survive hitting a kprobe int3, >> but that shouldn't have recursion issues. > > Right, so I think avoiding the obvious recursion issues is a more > tractable problem and yes some 'safe' spot annotation should be enough > to get automation working for that -- mostly.
On Tue, Feb 25, 2020 at 10:36:45PM +0100, Thomas Gleixner wrote:
> int3 is not using the common_exception path for purely historical reasons,
> but there is no reason to keep it the only exception which is different.
>
> Make it use common_exception so the upcoming changes to autogenerate the
> entry stubs do not have to special case int3.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
On Wed, Feb 26, 2020 at 09:28:51AM -0800, Andy Lutomirski wrote: > > It entirely depends on what the goal is :-/ On the one hand I see why > > people might want function tracing / kprobes enabled, OTOH it's all > > mighty frigging scary. Any tracing/probing/whatever on an MCE has the > > potential to make a bad situation worse -- not unlike the same on #DF. FWIW, I had this at the beginning of the #MC handler in a feeble attempt to poke at this: + hw_breakpoint_disable(); + static_key_disable(&__tracepoint_read_msr.key); + tracing_off(); But then Tony noted that some recoverable errors do get reported with an #MC exception so we would have to look at the error severity and then decide whether to allow tracing or not. But the error severity happens all the way down in __mc_scan_banks() - i.e., we've executed the half handler already. So, frankly, I wanna say, f*ck tracing etc - there are certain handlers which simply don't allow it. And we'll only consider changing that when a really good reason for it appears... -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Andy Lutomirski <luto@kernel.org> writes:
> On 2/25/20 1:36 PM, Thomas Gleixner wrote:
>> Nothing cares about the -1 "mark as interrupt" in the errorcode anymore. Just
>> use 0 for all excpetions which do not have an errorcode consistently.
>>
>
> I sincerely wish this were the case. But look at collect_syscall() in
> lib/syscall.c.
>
> It would be really quite nice to address this for real in some
> low-overhead way. My suggestion would be to borrow a trick from 32-bit:
> split regs->cs into ->cs and ->__csh, and stick CS_FROM_SYSCALL into
> __csh for syscalls. This will only add any overhead at all to the int80
> case. Then we could adjust syscall_get_nr() to look for CS_FROM_SYSCALL.
>
> What do you think? An alternative would be to use the stack walking
> machinery in collect_syscall(), since the mere existence of that
> function is abomination and we may not care about performance.
Looking deeper. The code in common_exception does:
movl PT_ORIG_EAX(%esp), %edx # get the error code
movl $-1, PT_ORIG_EAX(%esp) # no syscall to restart
So whatever the exception pushed on the stack the thing what
collect_syscall finds is -1.
The pushed value is used as the error_code argument for the exception
handler and I really can't find a single one which cares (anymore).
But darn and I overlooked that, it's propagated to do_trap() and
friends, but even if this causes a user visible change, I doubt that
anything cares about it today simply because for giggles a 64bit kernel
unconditionally pushes 0 for all exceptions which do not have a hardware
error code on stack. So any 32bit application which excpects a
particular error code (0/-1) in the signal would have been broken on the
first day it ran on a x64 bit kernel.
If someone yells regression, then that's really trivial to fix in
C-code.
Let me rephrase the changelog for this.
Thanks,
tglx
On Wed, Feb 26, 2020 at 10:42 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
>
> > On 2/25/20 1:36 PM, Thomas Gleixner wrote:
> >> Nothing cares about the -1 "mark as interrupt" in the errorcode anymore. Just
> >> use 0 for all excpetions which do not have an errorcode consistently.
> >>
> >
> > I sincerely wish this were the case. But look at collect_syscall() in
> > lib/syscall.c.
> >
> > It would be really quite nice to address this for real in some
> > low-overhead way. My suggestion would be to borrow a trick from 32-bit:
> > split regs->cs into ->cs and ->__csh, and stick CS_FROM_SYSCALL into
> > __csh for syscalls. This will only add any overhead at all to the int80
> > case. Then we could adjust syscall_get_nr() to look for CS_FROM_SYSCALL.
> >
> > What do you think? An alternative would be to use the stack walking
> > machinery in collect_syscall(), since the mere existence of that
> > function is abomination and we may not care about performance.
>
> Looking deeper. The code in common_exception does:
>
> movl PT_ORIG_EAX(%esp), %edx # get the error code
> movl $-1, PT_ORIG_EAX(%esp) # no syscall to restart
>
> So whatever the exception pushed on the stack the thing what
> collect_syscall finds is -1.
>
> The pushed value is used as the error_code argument for the exception
> handler and I really can't find a single one which cares (anymore).
>
> But darn and I overlooked that, it's propagated to do_trap() and
> friends, but even if this causes a user visible change, I doubt that
> anything cares about it today simply because for giggles a 64bit kernel
> unconditionally pushes 0 for all exceptions which do not have a hardware
> error code on stack. So any 32bit application which excpects a
> particular error code (0/-1) in the signal would have been broken on the
> first day it ran on a x64 bit kernel.
>
> If someone yells regression, then that's really trivial to fix in
> C-code.
I *think* this is plumbed much more directly to userspace:
$ cat /proc/$$/syscall
61 0xffffffff 0x7ffccf734ed0 0xa 0x0 0x1 0x0 0x7ffccf734eb8 0x7f0667465eda
That entire feature is highly dubious and I suppose we could just
delete it. But right now, we at least pretend that we can tell,
totally asynchronously, whether another task is in a syscall. Unless
we do *something*, though, I think you shouldn't make this change.
Sticking 0 in the error_code field in ucontext for a signal with no
error code seems entirely harmless to me in contrast.
On Wed, Feb 26, 2020 at 07:42:37PM +0100, Borislav Petkov wrote:
> On Wed, Feb 26, 2020 at 09:28:51AM -0800, Andy Lutomirski wrote:
> > > It entirely depends on what the goal is :-/ On the one hand I see why
> > > people might want function tracing / kprobes enabled, OTOH it's all
> > > mighty frigging scary. Any tracing/probing/whatever on an MCE has the
> > > potential to make a bad situation worse -- not unlike the same on #DF.
>
> FWIW, I had this at the beginning of the #MC handler in a feeble attempt
> to poke at this:
>
> + hw_breakpoint_disable();
> + static_key_disable(&__tracepoint_read_msr.key);
> + tracing_off();
You can't do static_key_disable() from an IST
> On Feb 26, 2020, at 10:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Feb 26, 2020 at 07:42:37PM +0100, Borislav Petkov wrote:
>> On Wed, Feb 26, 2020 at 09:28:51AM -0800, Andy Lutomirski wrote:
>>>> It entirely depends on what the goal is :-/ On the one hand I see why
>>>> people might want function tracing / kprobes enabled, OTOH it's all
>>>> mighty frigging scary. Any tracing/probing/whatever on an MCE has the
>>>> potential to make a bad situation worse -- not unlike the same on #DF.
>>
>> FWIW, I had this at the beginning of the #MC handler in a feeble attempt
>> to poke at this:
>>
>> + hw_breakpoint_disable();
>> + static_key_disable(&__tracepoint_read_msr.key);
>> + tracing_off();
>
> You can't do static_key_disable() from an IST
Can we set a percpu variable saying “in some stupid context, don’t trace”?
Andy Lutomirski <luto@kernel.org> writes: > On Wed, Feb 26, 2020 at 10:42 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> The pushed value is used as the error_code argument for the exception >> handler and I really can't find a single one which cares (anymore). >> >> But darn and I overlooked that, it's propagated to do_trap() and >> friends, but even if this causes a user visible change, I doubt that >> anything cares about it today simply because for giggles a 64bit kernel >> unconditionally pushes 0 for all exceptions which do not have a hardware >> error code on stack. So any 32bit application which excpects a >> particular error code (0/-1) in the signal would have been broken on the >> first day it ran on a x64 bit kernel. >> >> If someone yells regression, then that's really trivial to fix in >> C-code. > > I *think* this is plumbed much more directly to userspace: > > $ cat /proc/$$/syscall > 61 0xffffffff 0x7ffccf734ed0 0xa 0x0 0x1 0x0 0x7ffccf734eb8 0x7f0667465eda The task is in syscall 61. And the 0xffffffff is syscall args[0]. So I'm not sure what you try to demonstrate. > That entire feature is highly dubious and I suppose we could just > delete it. But right now, we at least pretend that we can tell, > totally asynchronously, whether another task is in a syscall. Unless > we do *something*, though, I think you shouldn't make this change. So if a task actually hits a breakpoint that syscall proc thing reads: -1 0xffffd0e0 0x565561a6 So even if the entry stub pushs 0, the fixup turns it into -1. Thanks, tglx
On Wed, 26 Feb 2020 11:09:03 -0800
Andy Lutomirski <luto@amacapital.net> wrote:
> > On Feb 26, 2020, at 10:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Feb 26, 2020 at 07:42:37PM +0100, Borislav Petkov wrote:
> >> On Wed, Feb 26, 2020 at 09:28:51AM -0800, Andy Lutomirski wrote:
> >>>> It entirely depends on what the goal is :-/ On the one hand I see why
> >>>> people might want function tracing / kprobes enabled, OTOH it's all
> >>>> mighty frigging scary. Any tracing/probing/whatever on an MCE has the
> >>>> potential to make a bad situation worse -- not unlike the same on #DF.
> >>
> >> FWIW, I had this at the beginning of the #MC handler in a feeble attempt
> >> to poke at this:
> >>
> >> + hw_breakpoint_disable();
> >> + static_key_disable(&__tracepoint_read_msr.key);
> >> + tracing_off();
> >
> > You can't do static_key_disable() from an IST
>
> Can we set a percpu variable saying “in some stupid context, don’t trace”?
Matter's what kind of tracing you care about? "tracing_off()" only stops
writing to the ring buffer, but all the mechanisms are still in place (the
things you want to stop).
And "tracing" is not the issue. The issue is usually the breakpoint that is
added to modify the jump labels to enable tracing, or a flag that has a
trace event do a user space stack dump. It's not tracing you want to stop,
its the other parts that are attached to tracing that makes it dangerous.
-- Steve
On 2/25/20 10:36 PM, Thomas Gleixner wrote:
> Hi!
>
> This is the first batch of a 73 patches series which consolidates the x86
> entry code.
>
> This work started off as a trivial 5 patches series moving the heavy
> lifting of POSIX CPU timers out of interrupt context into thread/process
> context. This discovered that KVM is lacking to handle pending work items
> before entering guest mode and added the handling to the x86 KVM
> code. Review requested to make this a generic infrastructure.
>
> The next series grew to 25 patches implementing the generic infrastructure,
> converting x86 (and as a POC ARM64) over, but it turned out that this was
> slightly incomplete and still had some entanglement with the rest of the
> x86 entry code as some of that functionality is shared between syscall and
> interrupt entry/exit. And it also unearthed the nastyness of IOPL which got
> already addressed in mainline.
>
> This series addresses these issues in order to prepare for making the entry
> from userspace and exit to userspace (and it's counterpart enter guest) a
> generic infrastructure in order to restrict the necessary ASM work to the
> bare minimum.
>
> The series is split into 5 parts:
>
> - General cleanups and bugfixes
>
> - Consolidation of the syscall entry/exit code
>
> - Autogenerate simple exception/trap code and reduce the difference
> between 32 and 64 bit
>
> - Autogenerate complex exception/trap code and provide different entry
> points for #DB and #MC exceptions which allows to address the
> recently discovered RCU vs. world issues in a more structured way
>
> - Convert the device interrupt entry code to use the same mechanism as
> exceptions and traps and finally convert the system vectors over as
> well. The last step after all those cleanups is to move the return
> from exception/interrupt logic (user mode work, kernel preemption)
> completely from ASM into C-code, so the ASM code just has to take
> care about returning from the exception, which is horrible and
> convoluted enough already.
>
> At the end the x86 entry code is ready to move the syscall parts out into
> generic code and finally tackle the initial problem which started all of
> this.
>
> The complete series is available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/entry
>
> which contains all 73 patches. The individual parts are tagged, so this
> part can be retrieved via:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-v1-part1
>
> Thanks,
>
> tglx
>
> 8<---------------
> entry/entry_32.S | 19 +++++++------------
> include/asm/irq.h | 2 +-
> include/asm/mce.h | 3 ---
> include/asm/traps.h | 17 +++++++----------
> kernel/cpu/mce/core.c | 12 ++++++++++--
> kernel/cpu/mce/internal.h | 3 +++
> kernel/irq.c | 3 +--
> kernel/traps.c | 41 ++++++++++++++++++++++++++++++++++-------
> 8 files changed, 63 insertions(+), 37 deletions(-)
>
For part I:
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
for all patches. I had a slight concern about patch 08 and propagating a different
error_code, but I agree with your argument that it is limited to 32-bit process and
now matches with the 64-bit behavior.
alex.
The following commit has been merged into the x86/entry branch of tip: Commit-ID: ac3607f92f70c762e24d0ae731168f7584de51ec Gitweb: https://git.kernel.org/tip/ac3607f92f70c762e24d0ae731168f7584de51ec Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Tue, 25 Feb 2020 22:36:45 +01:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 27 Feb 2020 14:48:41 +01:00 x86/entry/entry_32: Route int3 through common_exception int3 is not using the common_exception path for purely historical reasons, but there is no reason to keep it the only exception which is different. Make it use common_exception so the upcoming changes to autogenerate the entry stubs do not have to special case int3. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200225220217.042369808@linutronix.de --- arch/x86/entry/entry_32.S | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index a8b4438..0753f48 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1683,14 +1683,8 @@ SYM_CODE_END(nmi) SYM_CODE_START(int3) ASM_CLAC pushl $-1 # mark this as an int - - SAVE_ALL switch_stacks=1 - ENCODE_FRAME_POINTER - TRACE_IRQS_OFF - xorl %edx, %edx # zero error code - movl %esp, %eax # pt_regs pointer - call do_int3 - jmp ret_from_exception + pushl $do_int3 + jmp common_exception SYM_CODE_END(int3) SYM_CODE_START(general_protection)
The following commit has been merged into the x86/entry branch of tip: Commit-ID: 009cae30b6cb2e0af56c8fa44d89d11ba89fb2d1 Gitweb: https://git.kernel.org/tip/009cae30b6cb2e0af56c8fa44d89d11ba89fb2d1 Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Tue, 25 Feb 2020 22:36:46 +01:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 27 Feb 2020 14:48:41 +01:00 x86/traps: Stop using ist_enter/exit() in do_int3() #BP is not longer using IST and using ist_enter() and ist_exit() makes it harder to change ist_enter() and ist_exit()'s behavior. Instead open-code the very small amount of required logic. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200225220217.150607679@linutronix.de --- arch/x86/kernel/traps.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 7ffb6f4..c0bc9df 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -572,14 +572,20 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) return; /* - * Use ist_enter despite the fact that we don't use an IST stack. - * We can be called from a kprobe in non-CONTEXT_KERNEL kernel - * mode or even during context tracking state changes. + * Unlike any other non-IST entry, we can be called from a kprobe in + * non-CONTEXT_KERNEL kernel mode or even during context tracking + * state changes. Make sure that we wake up RCU even if we're coming + * from kernel code. * - * This means that we can't schedule. That's okay. + * This means that we can't schedule even if we came from a + * preemptible kernel context. That's okay. */ - ist_enter(regs); + if (!user_mode(regs)) { + rcu_nmi_enter(); + preempt_disable(); + } RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); + #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP, SIGTRAP) == NOTIFY_STOP) @@ -600,7 +606,10 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) cond_local_irq_disable(regs); exit: - ist_exit(regs); + if (!user_mode(regs)) { + preempt_enable_no_resched(); + rcu_nmi_exit(); + } } NOKPROBE_SYMBOL(do_int3);
The following commit has been merged into the x86/entry branch of tip: Commit-ID: d244d0e195bc12964bcf3b8eef45e715a7f203b0 Gitweb: https://git.kernel.org/tip/d244d0e195bc12964bcf3b8eef45e715a7f203b0 Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Tue, 25 Feb 2020 22:36:41 +01:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 27 Feb 2020 14:48:40 +01:00 x86/traps: Document do_spurious_interrupt_bug() Add a comment which explains why this empty handler for a reserved vector exists. Requested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200225220216.624165786@linutronix.de --- arch/x86/kernel/traps.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 474b8cb..7ffb6f4 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -862,6 +862,25 @@ do_simd_coprocessor_error(struct pt_regs *regs, long error_code) dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code) { + /* + * This addresses a Pentium Pro Erratum: + * + * PROBLEM: If the APIC subsystem is configured in mixed mode with + * Virtual Wire mode implemented through the local APIC, an + * interrupt vector of 0Fh (Intel reserved encoding) may be + * generated by the local APIC (Int 15). This vector may be + * generated upon receipt of a spurious interrupt (an interrupt + * which is removed before the system receives the INTA sequence) + * instead of the programmed 8259 spurious interrupt vector. + * + * IMPLICATION: The spurious interrupt vector programmed in the + * 8259 is normally handled by an operating system's spurious + * interrupt handler. However, a vector of 0Fh is unknown to some + * operating systems, which would crash if this erratum occurred. + * + * In theory this could be limited to 32bit, but the handler is not + * hurting and who knows which other CPUs suffer from this. + */ } dotraplinkage void
The following commit has been merged into the x86/entry branch of tip: Commit-ID: 17dbedb5da184b501c5c51fc78d1071005139e48 Gitweb: https://git.kernel.org/tip/17dbedb5da184b501c5c51fc78d1071005139e48 Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Tue, 25 Feb 2020 22:36:43 +01:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 27 Feb 2020 14:48:40 +01:00 x86/irq: Remove useless return value from do_IRQ() Nothing is using it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200225220216.826870369@linutronix.de --- arch/x86/include/asm/irq.h | 2 +- arch/x86/kernel/irq.c | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h index a176f61..72fba0e 100644 --- a/arch/x86/include/asm/irq.h +++ b/arch/x86/include/asm/irq.h @@ -36,7 +36,7 @@ extern void native_init_IRQ(void); extern void handle_irq(struct irq_desc *desc, struct pt_regs *regs); -extern __visible unsigned int do_IRQ(struct pt_regs *regs); +extern __visible void do_IRQ(struct pt_regs *regs); extern void init_ISA_irqs(void); diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 21efee3..c7965ff 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -230,7 +230,7 @@ u64 arch_irq_stat(void) * SMP cross-CPU interrupts have their own specific * handlers). */ -__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs) +__visible void __irq_entry do_IRQ(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); struct irq_desc * desc; @@ -263,7 +263,6 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs) exiting_irq(); set_irq_regs(old_regs); - return 1; } #ifdef CONFIG_X86_LOCAL_APIC
The following commit has been merged into the x86/entry branch of tip: Commit-ID: 3ba4f0a633ca4bfeadb24eba19cb53ebe246eabf Gitweb: https://git.kernel.org/tip/3ba4f0a633ca4bfeadb24eba19cb53ebe246eabf Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Tue, 25 Feb 2020 22:36:42 +01:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 27 Feb 2020 14:48:40 +01:00 x86/traps: Remove redundant declaration of do_double_fault() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200225220216.720335354@linutronix.de --- arch/x86/include/asm/traps.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index e1c660b..c26a7e1 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -76,13 +76,6 @@ dotraplinkage void do_coprocessor_segment_overrun(struct pt_regs *regs, long err dotraplinkage void do_invalid_TSS(struct pt_regs *regs, long error_code); dotraplinkage void do_segment_not_present(struct pt_regs *regs, long error_code); dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code); -#ifdef CONFIG_X86_64 -dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code, unsigned long address); -asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs); -asmlinkage __visible notrace -struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s); -void __init trap_init(void); -#endif dotraplinkage void do_general_protection(struct pt_regs *regs, long error_code); dotraplinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address); dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code); @@ -94,6 +87,13 @@ dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code); #endif dotraplinkage void do_mce(struct pt_regs *regs, long error_code); +#ifdef CONFIG_X86_64 +asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs); +asmlinkage __visible notrace +struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s); +void __init trap_init(void); +#endif + static inline int get_si_code(unsigned long condition) { if (condition & DR_STEP)
The following commit has been merged into the x86/entry branch of tip: Commit-ID: 55ba18d6ed37a28cf8b8ca79e9aef4cf98183bb7 Gitweb: https://git.kernel.org/tip/55ba18d6ed37a28cf8b8ca79e9aef4cf98183bb7 Author: Andy Lutomirski <luto@kernel.org> AuthorDate: Tue, 25 Feb 2020 22:36:38 +01:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 27 Feb 2020 14:48:39 +01:00 x86/mce: Disable tracing and kprobes on do_machine_check() do_machine_check() can be raised in almost any context including the most fragile ones. Prevent kprobes and tracing. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200225220216.315548935@linutronix.de --- arch/x86/include/asm/traps.h | 3 --- arch/x86/kernel/cpu/mce/core.c | 12 ++++++++++-- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index ffa0dc8..e1c660b 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -88,9 +88,6 @@ dotraplinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code); dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code); dotraplinkage void do_alignment_check(struct pt_regs *regs, long error_code); -#ifdef CONFIG_X86_MCE -dotraplinkage void do_machine_check(struct pt_regs *regs, long error_code); -#endif dotraplinkage void do_simd_coprocessor_error(struct pt_regs *regs, long error_code); #ifdef CONFIG_X86_32 dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code); diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 2c4f949..32ecc59 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1213,8 +1213,14 @@ static void __mc_scan_banks(struct mce *m, struct mce *final, * On Intel systems this is entered on all CPUs in parallel through * MCE broadcast. However some CPUs might be broken beyond repair, * so be always careful when synchronizing with others. + * + * Tracing and kprobes are disabled: if we interrupted a kernel context + * with IF=1, we need to minimize stack usage. There are also recursion + * issues: if the machine check was due to a failure of the memory + * backing the user stack, tracing that reads the user stack will cause + * potentially infinite recursion. */ -void do_machine_check(struct pt_regs *regs, long error_code) +void notrace do_machine_check(struct pt_regs *regs, long error_code) { DECLARE_BITMAP(valid_banks, MAX_NR_BANKS); DECLARE_BITMAP(toclear, MAX_NR_BANKS); @@ -1360,6 +1366,7 @@ out_ist: ist_exit(regs); } EXPORT_SYMBOL_GPL(do_machine_check); +NOKPROBE_SYMBOL(do_machine_check); #ifndef CONFIG_MEMORY_FAILURE int memory_failure(unsigned long pfn, int flags) @@ -1892,10 +1899,11 @@ static void unexpected_machine_check(struct pt_regs *regs, long error_code) void (*machine_check_vector)(struct pt_regs *, long error_code) = unexpected_machine_check; -dotraplinkage void do_mce(struct pt_regs *regs, long error_code) +dotraplinkage notrace void do_mce(struct pt_regs *regs, long error_code) { machine_check_vector(regs, error_code); } +NOKPROBE_SYMBOL(do_mce); /* * Called for each booted CPU to set up machine checks.
The following commit has been merged into the x86/entry branch of tip: Commit-ID: 840371bea19e85f30d19909171248cf8c5845fd7 Gitweb: https://git.kernel.org/tip/840371bea19e85f30d19909171248cf8c5845fd7 Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Tue, 25 Feb 2020 22:36:39 +01:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 27 Feb 2020 14:48:39 +01:00 x86/entry/32: Force MCE through do_mce() Remove the pointless difference between 32 and 64 bit to make further unifications simpler. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200225220216.428188397@linutronix.de --- arch/x86/entry/entry_32.S | 2 +- arch/x86/include/asm/mce.h | 3 --- arch/x86/kernel/cpu/mce/internal.h | 3 +++ 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 39243df..a8b4438 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1365,7 +1365,7 @@ SYM_CODE_END(divide_error) SYM_CODE_START(machine_check) ASM_CLAC pushl $0 - pushl machine_check_vector + pushl $do_mce jmp common_exception SYM_CODE_END(machine_check) #endif diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index 4359b95..a7e0b9d 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -238,9 +238,6 @@ extern void mce_disable_bank(int bank); /* * Exception handler */ - -/* Call the installed machine check handler for this CPU setup. */ -extern void (*machine_check_vector)(struct pt_regs *, long error_code); void do_machine_check(struct pt_regs *, long); /* diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h index b785c0d..c1cda0b 100644 --- a/arch/x86/kernel/cpu/mce/internal.h +++ b/arch/x86/kernel/cpu/mce/internal.h @@ -8,6 +8,9 @@ #include <linux/device.h> #include <asm/mce.h> +/* Pointer to the installed machine check handler for this CPU setup. */ +extern void (*machine_check_vector)(struct pt_regs *, long error_code); + enum severity_level { MCE_NO_SEVERITY, MCE_DEFERRED_SEVERITY,
The following commit has been merged into the x86/entry branch of tip: Commit-ID: 3d51507f29f2153a658df4a0674ec5b592b62085 Gitweb: https://git.kernel.org/tip/3d51507f29f2153a658df4a0674ec5b592b62085 Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Tue, 25 Feb 2020 22:36:37 +01:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 27 Feb 2020 14:48:38 +01:00 x86/entry/32: Add missing ASM_CLAC to general_protection entry All exception entry points must have ASM_CLAC right at the beginning. The general_protection entry is missing one. Fixes: e59d1b0a2419 ("x86-32, smap: Add STAC/CLAC instructions to 32-bit kernel entry") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200225220216.219537887@linutronix.de --- arch/x86/entry/entry_32.S | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 7e05604..39243df 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1694,6 +1694,7 @@ SYM_CODE_START(int3) SYM_CODE_END(int3) SYM_CODE_START(general_protection) + ASM_CLAC pushl $do_general_protection jmp common_exception SYM_CODE_END(general_protection)
Nothing cares about the -1 "mark as interrupt" in the errorcode of exception entries. It's only used to fill the error code when a signal is delivered, but this is already inconsistent vs. 64 bit as there all exceptions which do not have an error code set it to 0. So if 32bit applications would care about this, then they would have noticed more than a decade ago. Just use 0 for all excpetions which do not have an errorcode consistently. This does neither break /proc/$PID/syscall because this interface examines the error code / syscall number which is on the stack and that is set to -1 (no syscall) in common_exception unconditionally for all exceptions. The push in the entry stub is just there to fill the hardware error code slot on the stack for consistency of the stack layout. A transient observation of 0 is possible, but that's true for the other exceptions which use 0 already as well and that interface is an unreliable snapshot of dubious correctness anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> --- V2: Amend changelog. Rebased on top of tip x86/entry --- arch/x86/entry/entry_32.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1290,7 +1290,7 @@ SYM_CODE_END(simd_coprocessor_error) SYM_CODE_START(device_not_available) ASM_CLAC - pushl $-1 # mark this as an int + pushl $0 pushl $do_device_not_available jmp common_exception SYM_CODE_END(device_not_available) @@ -1531,7 +1531,7 @@ SYM_CODE_START(debug) * Entry from sysenter is now handled in common_exception */ ASM_CLAC - pushl $-1 # mark this as an int + pushl $0 pushl $do_debug jmp common_exception SYM_CODE_END(debug) @@ -1682,7 +1682,7 @@ SYM_CODE_END(nmi) SYM_CODE_START(int3) ASM_CLAC - pushl $-1 # mark this as an int + pushl $0 pushl $do_int3 jmp common_exception SYM_CODE_END(int3)
The following commit has been merged into the x86/entry branch of tip: Commit-ID: 65c668f5faebf549db086b7a6841b6f4187b4e4f Gitweb: https://git.kernel.org/tip/65c668f5faebf549db086b7a6841b6f4187b4e4f Author: Andy Lutomirski <luto@kernel.org> AuthorDate: Tue, 25 Feb 2020 22:36:46 +01:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 27 Feb 2020 15:28:39 +01:00 x86/traps: Stop using ist_enter/exit() in do_int3() #BP is not longer using IST and using ist_enter() and ist_exit() makes it harder to change ist_enter() and ist_exit()'s behavior. Instead open-code the very small amount of required logic. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200225220217.150607679@linutronix.de --- arch/x86/kernel/traps.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 7ffb6f4..c0bc9df 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -572,14 +572,20 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) return; /* - * Use ist_enter despite the fact that we don't use an IST stack. - * We can be called from a kprobe in non-CONTEXT_KERNEL kernel - * mode or even during context tracking state changes. + * Unlike any other non-IST entry, we can be called from a kprobe in + * non-CONTEXT_KERNEL kernel mode or even during context tracking + * state changes. Make sure that we wake up RCU even if we're coming + * from kernel code. * - * This means that we can't schedule. That's okay. + * This means that we can't schedule even if we came from a + * preemptible kernel context. That's okay. */ - ist_enter(regs); + if (!user_mode(regs)) { + rcu_nmi_enter(); + preempt_disable(); + } RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); + #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP, SIGTRAP) == NOTIFY_STOP) @@ -600,7 +606,10 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) cond_local_irq_disable(regs); exit: - ist_exit(regs); + if (!user_mode(regs)) { + preempt_enable_no_resched(); + rcu_nmi_exit(); + } } NOKPROBE_SYMBOL(do_int3);
The following commit has been merged into the x86/entry branch of tip: Commit-ID: e441a2ae0e9e9bb12fd3fbe2d59d923fadfe8ef7 Gitweb: https://git.kernel.org/tip/e441a2ae0e9e9bb12fd3fbe2d59d923fadfe8ef7 Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Thu, 27 Feb 2020 15:24:29 +01:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Sat, 29 Feb 2020 12:45:54 +01:00 x86/entry/32: Remove the 0/-1 distinction from exception entries Nothing cares about the -1 "mark as interrupt" in the errorcode of exception entries. It's only used to fill the error code when a signal is delivered, but this is already inconsistent vs. 64 bit as there all exceptions which do not have an error code set it to 0. So if 32 bit applications would care about this, then they would have noticed more than a decade ago. Just use 0 for all excpetions which do not have an errorcode consistently. This does neither break /proc/$PID/syscall because this interface examines the error code / syscall number which is on the stack and that is set to -1 (no syscall) in common_exception unconditionally for all exceptions. The push in the entry stub is just there to fill the hardware error code slot on the stack for consistency of the stack layout. A transient observation of 0 is possible, but that's true for the other exceptions which use 0 already as well and that interface is an unreliable snapshot of dubious correctness anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/87mu94m7ky.fsf@nanos.tec.linutronix.de --- arch/x86/entry/entry_32.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 0753f48..ddc87f2 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1290,7 +1290,7 @@ SYM_CODE_END(simd_coprocessor_error) SYM_CODE_START(device_not_available) ASM_CLAC - pushl $-1 # mark this as an int + pushl $0 pushl $do_device_not_available jmp common_exception SYM_CODE_END(device_not_available) @@ -1531,7 +1531,7 @@ SYM_CODE_START(debug) * Entry from sysenter is now handled in common_exception */ ASM_CLAC - pushl $-1 # mark this as an int + pushl $0 pushl $do_debug jmp common_exception SYM_CODE_END(debug) @@ -1682,7 +1682,7 @@ SYM_CODE_END(nmi) SYM_CODE_START(int3) ASM_CLAC - pushl $-1 # mark this as an int + pushl $0 pushl $do_int3 jmp common_exception SYM_CODE_END(int3)