linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack
@ 2014-11-18 23:15 Andy Lutomirski
  2014-11-18 23:15 ` [PATCH v3 1/3] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME Andy Lutomirski
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Andy Lutomirski @ 2014-11-18 23:15 UTC (permalink / raw)
  To: Borislav Petkov, x86, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Andy Lutomirski

We currently run IST interrupt handlers on the IST stack.  Changing
it may simplify a few things.  See patch 2 for details.

Patch 1 is a fix for a not-quite-bug in uprobes that Oleg noticed
that would be exposed by patch 2.

NB: Tony has seen odd behavior when stress-testing injected
machine checks with this series applied.  I suspect that
it's a bug in something else, possibly his BIOS.  Bugs in
this series shouldn't be ruled out, though.

Andy Lutomirski (3):
  uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
  x86, entry: Switch stacks on a paranoid entry from userspace
  sched, x86: Check that we're on the right stack in schedule and
    __might_sleep

 Documentation/x86/entry_64.txt         | 18 ++++---
 Documentation/x86/x86_64/kernel-stacks |  8 ++--
 arch/x86/Kconfig                       |  1 +
 arch/x86/include/asm/thread_info.h     | 19 +++++++-
 arch/x86/kernel/entry_64.S             | 86 ++++++++++++++++++----------------
 arch/x86/kernel/irq_32.c               | 13 ++---
 arch/x86/kernel/traps.c                | 23 ++-------
 include/linux/thread_info.h            |  7 +++
 kernel/Kconfig.locks                   |  3 ++
 kernel/events/uprobes.c                |  1 -
 kernel/sched/core.c                    | 14 ++++--
 11 files changed, 109 insertions(+), 84 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 1/3] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
  2014-11-18 23:15 [PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack Andy Lutomirski
@ 2014-11-18 23:15 ` Andy Lutomirski
  2014-11-18 23:15 ` [PATCH v3 2/3] x86, entry: Switch stacks on a paranoid entry from userspace Andy Lutomirski
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Andy Lutomirski @ 2014-11-18 23:15 UTC (permalink / raw)
  To: Borislav Petkov, x86, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Andy Lutomirski

x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set
but not on non-paranoid returns.  I suspect that this is a mistake
and that the code only works because int3 is paranoid.

Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a
workaround for the x86 bug.  With that bug fixed, we can remove
_TIF_NOTIFY_RESUME from the uprobes code.

Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/thread_info.h | 2 +-
 kernel/events/uprobes.c            | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 854053889d4d..547e344a6dc6 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -141,7 +141,7 @@ struct thread_info {
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
 	(_TIF_SIGPENDING | _TIF_MCE_NOTIFY | _TIF_NOTIFY_RESUME |	\
-	 _TIF_USER_RETURN_NOTIFY)
+	 _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW							\
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1d0af8a2c646..ed8f2cde34c5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1640,7 +1640,6 @@ bool uprobe_deny_signal(void)
 		if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
 			utask->state = UTASK_SSTEP_TRAPPED;
 			set_tsk_thread_flag(t, TIF_UPROBE);
-			set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
 		}
 	}
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 2/3] x86, entry: Switch stacks on a paranoid entry from userspace
  2014-11-18 23:15 [PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack Andy Lutomirski
  2014-11-18 23:15 ` [PATCH v3 1/3] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME Andy Lutomirski
@ 2014-11-18 23:15 ` Andy Lutomirski
  2014-11-18 23:15 ` [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep Andy Lutomirski
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Andy Lutomirski @ 2014-11-18 23:15 UTC (permalink / raw)
  To: Borislav Petkov, x86, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Andy Lutomirski

This causes all non-NMI, non-double-fault kernel entries from
userspace to run on the normal kernel stack.  Double-fault is
exempt to minimize confusion if we double-fault directly from
userspace due to a bad kernel stack.

This is, suprisingly, simpler and shorter than the current code.  It
removes the IMO rather frightening paranoid_userspace path, and it
make sync_regs much simpler.

There is no risk of stack overflow due to this change -- the kernel
stack that we switch to is empty.

This may enable the machine check code to be simplified at some
point, because a machine check from userspace will be able to safely
enable interrupts.  It will also allow the upcoming fsgsbase code to
be simplified, because it doesn't need to worry about usergs when
scheduling in paranoid_exit, as that code no longer exists.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 Documentation/x86/entry_64.txt         | 18 ++++---
 Documentation/x86/x86_64/kernel-stacks |  8 ++--
 arch/x86/kernel/entry_64.S             | 86 ++++++++++++++++++----------------
 arch/x86/kernel/traps.c                | 23 ++-------
 4 files changed, 67 insertions(+), 68 deletions(-)

diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt
index bc7226ef5055..8e53217e6d40 100644
--- a/Documentation/x86/entry_64.txt
+++ b/Documentation/x86/entry_64.txt
@@ -75,9 +75,6 @@ The expensive (paranoid) way is to read back the MSR_GS_BASE value
 	xorl %ebx,%ebx
 1:	ret
 
-and the whole paranoid non-paranoid macro complexity is about whether
-to suffer that RDMSR cost.
-
 If we are at an interrupt or user-trap/gate-alike boundary then we can
 use the faster check: the stack will be a reliable indicator of
 whether SWAPGS was already done: if we see that we are a secondary
@@ -90,6 +87,15 @@ which might have triggered right after a normal entry wrote CS to the
 stack but before we executed SWAPGS, then the only safe way to check
 for GS is the slower method: the RDMSR.
 
-So we try only to mark those entry methods 'paranoid' that absolutely
-need the more expensive check for the GS base - and we generate all
-'normal' entry points with the regular (faster) entry macros.
+Therefore, super-atomic entries (except NMI, which is handled separately)
+must use idtentry with paranoid=1 to handle gsbase correctly.  This
+triggers three main behavior changes:
+
+ - Interrupt entry will use the slower gsbase check.
+ - Interrupt entry from user mode will switch off the IST stack.
+ - Interrupt exit to kernel mode will not attempt to reschedule.
+
+We try to only use IST entries and the paranoid entry code for vectors
+that absolutely need the more expensive check for the GS base - and we
+generate all 'normal' entry points with the regular (faster) paranoid=0
+variant.
diff --git a/Documentation/x86/x86_64/kernel-stacks b/Documentation/x86/x86_64/kernel-stacks
index a01eec5d1d0b..e3c8a49d1a2f 100644
--- a/Documentation/x86/x86_64/kernel-stacks
+++ b/Documentation/x86/x86_64/kernel-stacks
@@ -40,9 +40,11 @@ An IST is selected by a non-zero value in the IST field of an
 interrupt-gate descriptor.  When an interrupt occurs and the hardware
 loads such a descriptor, the hardware automatically sets the new stack
 pointer based on the IST value, then invokes the interrupt handler.  If
-software wants to allow nested IST interrupts then the handler must
-adjust the IST values on entry to and exit from the interrupt handler.
-(This is occasionally done, e.g. for debug exceptions.)
+the interrupt came from user mode, then the interrupt handler prologue
+will switch back to the per-thread stack.  If software wants to allow
+nested IST interrupts then the handler must adjust the IST values on
+entry to and exit from the interrupt handler.  (This is occasionally
+done, e.g. for debug exceptions.)
 
 Events with different IST codes (i.e. with different stacks) can be
 nested.  For example, a debug interrupt can safely be interrupted by an
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index df088bb03fb3..de24b2eac6b2 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1064,6 +1064,11 @@ ENTRY(\sym)
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
 
 	.if \paranoid
+	.if \paranoid == 1
+	CFI_REMEMBER_STATE
+	testl $3, CS(%rsp)		/* If coming from userspace, switch */
+	jnz 1f				/* stacks. */
+	.endif
 	call save_paranoid
 	.else
 	call error_entry
@@ -1104,6 +1109,36 @@ ENTRY(\sym)
 	jmp error_exit			/* %ebx: no swapgs flag */
 	.endif
 
+	.if \paranoid == 1
+	CFI_RESTORE_STATE
+	/*
+	 * Paranoid entry from userspace.  Switch stacks and treat it
+	 * as a normal entry.  This means that paranoid handlers
+	 * run in real process context if user_mode(regs).
+	 */
+1:
+	call error_entry
+
+	DEFAULT_FRAME 0
+
+	movq %rsp,%rdi			/* pt_regs pointer */
+	call sync_regs
+	movq %rax,%rsp			/* switch stack */
+
+	movq %rsp,%rdi			/* pt_regs pointer */
+
+	.if \has_error_code
+	movq ORIG_RAX(%rsp),%rsi	/* get error code */
+	movq $-1,ORIG_RAX(%rsp)		/* no syscall to restart */
+	.else
+	xorl %esi,%esi			/* no error code */
+	.endif
+
+	call \do_sym
+
+	jmp error_exit			/* %ebx: no swapgs flag */
+	.endif
+
 	CFI_ENDPROC
 END(\sym)
 .endm
@@ -1124,7 +1159,7 @@ idtentry overflow do_overflow has_error_code=0
 idtentry bounds do_bounds has_error_code=0
 idtentry invalid_op do_invalid_op has_error_code=0
 idtentry device_not_available do_device_not_available has_error_code=0
-idtentry double_fault __do_double_fault has_error_code=1 paranoid=1
+idtentry double_fault __do_double_fault has_error_code=1 paranoid=2
 idtentry coprocessor_segment_overrun do_coprocessor_segment_overrun has_error_code=0
 idtentry invalid_TSS do_invalid_TSS has_error_code=1
 idtentry segment_not_present do_segment_not_present has_error_code=1
@@ -1305,16 +1340,14 @@ idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector(
 #endif
 
 	/*
-	 * "Paranoid" exit path from exception stack.
-	 * Paranoid because this is used by NMIs and cannot take
-	 * any kernel state for granted.
-	 * We don't do kernel preemption checks here, because only
-	 * NMI should be common and it does not enable IRQs and
-	 * cannot get reschedule ticks.
+	 * "Paranoid" exit path from exception stack.  This is invoked
+	 * only on return from non-NMI IST interrupts that came
+	 * from kernel space.
 	 *
-	 * "trace" is 0 for the NMI handler only, because irq-tracing
-	 * is fundamentally NMI-unsafe. (we cannot change the soft and
-	 * hard flags at once, atomically)
+	 * We may be returning to very strange contexts (e.g. very early
+	 * in syscall entry), so checking for preemption here would
+	 * be complicated.  Fortunately, we there's no good reason
+	 * to try to handle preemption here.
 	 */
 
 	/* ebx:	no swapgs flag */
@@ -1324,43 +1357,14 @@ ENTRY(paranoid_exit)
 	TRACE_IRQS_OFF_DEBUG
 	testl %ebx,%ebx				/* swapgs needed? */
 	jnz paranoid_restore
-	testl $3,CS(%rsp)
-	jnz   paranoid_userspace
-paranoid_swapgs:
 	TRACE_IRQS_IRETQ 0
 	SWAPGS_UNSAFE_STACK
 	RESTORE_ALL 8
-	jmp irq_return
+	INTERRUPT_RETURN
 paranoid_restore:
 	TRACE_IRQS_IRETQ_DEBUG 0
 	RESTORE_ALL 8
-	jmp irq_return
-paranoid_userspace:
-	GET_THREAD_INFO(%rcx)
-	movl TI_flags(%rcx),%ebx
-	andl $_TIF_WORK_MASK,%ebx
-	jz paranoid_swapgs
-	movq %rsp,%rdi			/* &pt_regs */
-	call sync_regs
-	movq %rax,%rsp			/* switch stack for scheduling */
-	testl $_TIF_NEED_RESCHED,%ebx
-	jnz paranoid_schedule
-	movl %ebx,%edx			/* arg3: thread flags */
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	xorl %esi,%esi 			/* arg2: oldset */
-	movq %rsp,%rdi 			/* arg1: &pt_regs */
-	call do_notify_resume
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
-	jmp paranoid_userspace
-paranoid_schedule:
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_ANY)
-	SCHEDULE_USER
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_OFF
-	jmp paranoid_userspace
+	INTERRUPT_RETURN
 	CFI_ENDPROC
 END(paranoid_exit)
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 0d0e922fafc1..837843513ac3 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -375,27 +375,14 @@ NOKPROBE_SYMBOL(do_int3);
 
 #ifdef CONFIG_X86_64
 /*
- * Help handler running on IST stack to switch back to user stack
- * for scheduling or signal handling. The actual stack switch is done in
- * entry.S
+ * Help handler running on IST stack to switch off the IST stack if the
+ * interrupted code was in user mode. The actual stack switch is done in
+ * entry_64.S
  */
 asmlinkage __visible struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
-	struct pt_regs *regs = eregs;
-	/* Did already sync */
-	if (eregs == (struct pt_regs *)eregs->sp)
-		;
-	/* Exception from user space */
-	else if (user_mode(eregs))
-		regs = task_pt_regs(current);
-	/*
-	 * Exception from kernel and interrupts are enabled. Move to
-	 * kernel process stack.
-	 */
-	else if (eregs->flags & X86_EFLAGS_IF)
-		regs = (struct pt_regs *)(eregs->sp -= sizeof(struct pt_regs));
-	if (eregs != regs)
-		*regs = *eregs;
+	struct pt_regs *regs = task_pt_regs(current);
+	*regs = *eregs;
 	return regs;
 }
 NOKPROBE_SYMBOL(sync_regs);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-18 23:15 [PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack Andy Lutomirski
  2014-11-18 23:15 ` [PATCH v3 1/3] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME Andy Lutomirski
  2014-11-18 23:15 ` [PATCH v3 2/3] x86, entry: Switch stacks on a paranoid entry from userspace Andy Lutomirski
@ 2014-11-18 23:15 ` Andy Lutomirski
  2014-11-19 18:40   ` Linus Torvalds
  2014-11-19 18:29 ` [PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack Luck, Tony
  2014-11-19 22:15 ` [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks Luck, Tony
  4 siblings, 1 reply; 29+ messages in thread
From: Andy Lutomirski @ 2014-11-18 23:15 UTC (permalink / raw)
  To: Borislav Petkov, x86, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Andy Lutomirski

On x86, sleeping while on an IST or irq stack has a surprisingly
good chance of working, but it can also fail dramatically.  Add an
arch hook to allow schedule and __might_sleep to catch sleeping on
the wrong stack.

This will also catch do_exit from a funny stack, which could leave
an IST stack shifted or an NMI nesting count incremented.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/thread_info.h | 17 +++++++++++++++++
 arch/x86/kernel/irq_32.c           | 13 +++----------
 include/linux/thread_info.h        |  7 +++++++
 kernel/Kconfig.locks               |  3 +++
 kernel/sched/core.c                | 14 ++++++++++----
 6 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ded8a6774ac9..a811286636d2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -137,6 +137,7 @@ config X86
 	select HAVE_ACPI_APEI_NMI if ACPI
 	select ACPI_LEGACY_TABLES_LOOKUP if ACPI
 	select X86_FEATURE_NAMES if PROC_FS
+	select HAVE_ARCH_SCHEDULE_ALLOWED
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 547e344a6dc6..05701f132473 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -170,6 +170,23 @@ static inline struct thread_info *current_thread_info(void)
 	return ti;
 }
 
+static inline unsigned long current_stack_pointer(void)
+{
+	unsigned long sp;
+#ifdef CONFIG_X86_64
+	asm("mov %%rsp,%0" : "=g" (sp));
+#else
+	asm("mov %%esp,%0" : "=g" (sp));
+#endif
+	return sp;
+}
+
+static inline bool arch_schedule_allowed(void)
+{
+	return ((current_stack_pointer() ^ this_cpu_read_stable(kernel_stack))
+		& ~(THREAD_SIZE - 1)) == 0;
+}
+
 #else /* !__ASSEMBLY__ */
 
 /* how to get the thread information struct from ASM */
diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 63ce838e5a54..28d28f5eb8f4 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -69,16 +69,9 @@ static void call_on_stack(void *func, void *stack)
 		     : "memory", "cc", "edx", "ecx", "eax");
 }
 
-/* how to get the current stack pointer from C */
-#define current_stack_pointer ({		\
-	unsigned long sp;			\
-	asm("mov %%esp,%0" : "=g" (sp));	\
-	sp;					\
-})
-
 static inline void *current_stack(void)
 {
-	return (void *)(current_stack_pointer & ~(THREAD_SIZE - 1));
+	return (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
 }
 
 static inline int
@@ -103,7 +96,7 @@ execute_on_irq_stack(int overflow, struct irq_desc *desc, int irq)
 
 	/* Save the next esp at the bottom of the stack */
 	prev_esp = (u32 *)irqstk;
-	*prev_esp = current_stack_pointer;
+	*prev_esp = current_stack_pointer();
 
 	if (unlikely(overflow))
 		call_on_stack(print_stack_overflow, isp);
@@ -156,7 +149,7 @@ void do_softirq_own_stack(void)
 
 	/* Push the previous esp onto the stack */
 	prev_esp = (u32 *)irqstk;
-	*prev_esp = current_stack_pointer;
+	*prev_esp = current_stack_pointer();
 
 	call_on_stack(__do_softirq, isp);
 }
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index ff307b548ed3..6deaf7e97009 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -145,6 +145,13 @@ static inline bool test_and_clear_restore_sigmask(void)
 #error "no set_restore_sigmask() provided and default one won't work"
 #endif
 
+#ifndef CONFIG_HAVE_ARCH_SCHEDULE_ALLOWED
+static inline bool arch_schedule_allowed(void)
+{
+	return true;
+}
+#endif
+
 #endif	/* __KERNEL__ */
 
 #endif /* _LINUX_THREAD_INFO_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 76768ee812b2..2714dc34695a 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -237,3 +237,6 @@ config ARCH_USE_QUEUE_RWLOCK
 config QUEUE_RWLOCK
 	def_bool y if ARCH_USE_QUEUE_RWLOCK
 	depends on SMP
+
+config HAVE_ARCH_SCHEDULE_ALLOWED
+       bool
\ No newline at end of file
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240157c13ddc..e51ab65a9750 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2705,8 +2705,12 @@ static inline void schedule_debug(struct task_struct *prev)
 	 * Test if we are atomic. Since do_exit() needs to call into
 	 * schedule() atomically, we ignore that path. Otherwise whine
 	 * if we are scheduling when we should not.
+	 *
+	 * If architectural conditions for scheduling are not met,
+	 * complain even if we are in do_exit.
 	 */
-	if (unlikely(in_atomic_preempt_off() && prev->state != TASK_DEAD))
+	if (unlikely((in_atomic_preempt_off() && prev->state != TASK_DEAD) ||
+		     !arch_schedule_allowed()))
 		__schedule_bug(prev);
 	rcu_sleep_check();
 
@@ -7200,10 +7204,12 @@ static inline int preempt_count_equals(int preempt_offset)
 void __might_sleep(const char *file, int line, int preempt_offset)
 {
 	static unsigned long prev_jiffy;	/* ratelimiting */
+	bool arch_ok;
 
 	rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */
+	arch_ok = arch_schedule_allowed();
 	if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
-	     !is_idle_task(current)) ||
+	     !is_idle_task(current) && arch_ok) ||
 	    system_state != SYSTEM_RUNNING || oops_in_progress)
 		return;
 	if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
@@ -7214,8 +7220,8 @@ void __might_sleep(const char *file, int line, int preempt_offset)
 		"BUG: sleeping function called from invalid context at %s:%d\n",
 			file, line);
 	printk(KERN_ERR
-		"in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
-			in_atomic(), irqs_disabled(),
+		"in_atomic(): %d, irqs_disabled(): %d, arch_schedule_allowed: %d, pid: %d, name: %s\n",
+			in_atomic(), irqs_disabled(), (int)arch_ok,
 			current->pid, current->comm);
 
 	debug_show_held_locks(current);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* RE: [PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack
  2014-11-18 23:15 [PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack Andy Lutomirski
                   ` (2 preceding siblings ...)
  2014-11-18 23:15 ` [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep Andy Lutomirski
@ 2014-11-19 18:29 ` Luck, Tony
  2014-11-19 22:15 ` [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks Luck, Tony
  4 siblings, 0 replies; 29+ messages in thread
From: Luck, Tony @ 2014-11-19 18:29 UTC (permalink / raw)
  To: Andy Lutomirski, Borislav Petkov, x86, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Andi Kleen

> NB: Tony has seen odd behavior when stress-testing injected
> machine checks with this series applied.  I suspect that
> it's a bug in something else, possibly his BIOS.  Bugs in
> this series shouldn't be ruled out, though.

v3 did 3.5x better than earlier ones ... survived overnight but died at 91724
injection/consumption/recovery cycles just now. Different symptom,
instead of losing some cpus, there was a fatal machine check (PCC=1
and OVER=1 bits set in the machine check bank). This might be from a
known issue.
Not sure if this was due to some improvement in the code, or because
I changed the system configuration by pulling out all the memory except
for that on memory controller 0 on node 0. Our BIOS team had told me
they'd seen some instability in the injection code on fully populated
systems.

I did instrument the synchronization in mce_start(). I was a bit worried
that with ever increasing numbers of cpus the 100ns delay between
pounding on atomic ops on mce_callin might not be enough. But it
seems we are not in trouble yet. Slowest synchronization recorded
took 1.8M TSC cycles. Mean is 500K cycles.  So my gut feeling that
the one second timeout was very conservative is correct.

-Tony

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-18 23:15 ` [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep Andy Lutomirski
@ 2014-11-19 18:40   ` Linus Torvalds
  2014-11-19 19:23     ` Andy Lutomirski
  2014-11-19 19:29     ` Andi Kleen
  0 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2014-11-19 18:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck, Andi Kleen

On Tue, Nov 18, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On x86, sleeping while on an IST or irq stack has a surprisingly
> good chance of working, but it can also fail dramatically.  Add an
> arch hook to allow schedule and __might_sleep to catch sleeping on
> the wrong stack.

Why doesn't the normal in_interrupt() test catch this?

          Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-19 18:40   ` Linus Torvalds
@ 2014-11-19 19:23     ` Andy Lutomirski
  2014-11-19 19:29     ` Andi Kleen
  1 sibling, 0 replies; 29+ messages in thread
From: Andy Lutomirski @ 2014-11-19 19:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck, Andi Kleen

On Wed, Nov 19, 2014 at 10:40 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Nov 18, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On x86, sleeping while on an IST or irq stack has a surprisingly
>> good chance of working, but it can also fail dramatically.  Add an
>> arch hook to allow schedule and __might_sleep to catch sleeping on
>> the wrong stack.
>
> Why doesn't the normal in_interrupt() test catch this?

It catches scheduling on an irq stack, assuming that all of the
irq_count stuff is working correctly.  I don't think it catches
sleeping on an IST stack.

--Andy

>
>           Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-19 18:40   ` Linus Torvalds
  2014-11-19 19:23     ` Andy Lutomirski
@ 2014-11-19 19:29     ` Andi Kleen
  2014-11-19 19:44       ` Linus Torvalds
  1 sibling, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2014-11-19 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck, Andi Kleen

On Wed, Nov 19, 2014 at 10:40:10AM -0800, Linus Torvalds wrote:
> On Tue, Nov 18, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On x86, sleeping while on an IST or irq stack has a surprisingly
> > good chance of working, but it can also fail dramatically.  Add an
> > arch hook to allow schedule and __might_sleep to catch sleeping on
> > the wrong stack.
> 
> Why doesn't the normal in_interrupt() test catch this?

The exception handlers which use the IST stacks don't necessarily 
set irq count. Maybe they should.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-19 19:29     ` Andi Kleen
@ 2014-11-19 19:44       ` Linus Torvalds
  2014-11-19 23:04         ` Andy Lutomirski
  2016-02-29  5:27         ` Andy Lutomirski
  0 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2014-11-19 19:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Wed, Nov 19, 2014 at 11:29 AM, Andi Kleen <andi@firstfloor.org> wrote:
>
> The exception handlers which use the IST stacks don't necessarily
> set irq count. Maybe they should.

Hmm. I think they should. Since they clearly must not schedule, as
they use a percpu stack.

Which exceptions use IST?

[ grep grep ]

Looks like stack, doublefault, nmi, debug and mce. And yes, I really
think they should all raise the irq count if they don't already.
Rather than add random arch-specific "let's check that we're on the
right stack" code to the might-sleep stuff, just use the one we have.

                    Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks
  2014-11-18 23:15 [PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack Andy Lutomirski
                   ` (3 preceding siblings ...)
  2014-11-19 18:29 ` [PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack Luck, Tony
@ 2014-11-19 22:15 ` Luck, Tony
  4 siblings, 0 replies; 29+ messages in thread
From: Luck, Tony @ 2014-11-19 22:15 UTC (permalink / raw)
  To: Borislav Petkov, x86, Linus Torvalds, Andy Lutomirski
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen

We now switch to the kernel stack when a machine check interrupts
during user mode.  This means that we can perform recovery actions
in the tail of do_machine_check(). So say goodbye to TIF_MCE_NOTIFY,
mce_save_info(), mce_find_info() and mce_notify_process()

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

Obviously this is dependent on Andy's patch series in
  git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git x86/paranoid

 arch/x86/include/asm/mce.h         |   1 -
 arch/x86/include/asm/thread_info.h |   4 +-
 arch/x86/kernel/cpu/mcheck/mce.c   | 106 ++++++++-----------------------------
 arch/x86/kernel/signal.c           |   6 ---
 4 files changed, 23 insertions(+), 94 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 958b90f761e5..be5a51a4a219 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -187,7 +187,6 @@ enum mcp_flags {
 void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
 int mce_notify_irq(void);
-void mce_notify_process(void);
 
 DECLARE_PER_CPU(struct mce, injectm);
 
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 05701f132473..55a63f114a74 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -75,7 +75,6 @@ struct thread_info {
 #define TIF_SYSCALL_EMU		6	/* syscall emulation active */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
-#define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
@@ -100,7 +99,6 @@ struct thread_info {
 #define _TIF_SYSCALL_EMU	(1 << TIF_SYSCALL_EMU)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
-#define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
@@ -140,7 +138,7 @@ struct thread_info {
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
-	(_TIF_SIGPENDING | _TIF_MCE_NOTIFY | _TIF_NOTIFY_RESUME |	\
+	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |				\
 	 _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE)
 
 /* flags to check in __switch_to() */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 61a9668cebfd..66efc536f81d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -956,51 +956,6 @@ static void mce_clear_state(unsigned long *toclear)
 }
 
 /*
- * Need to save faulting physical address associated with a process
- * in the machine check handler some place where we can grab it back
- * later in mce_notify_process()
- */
-#define	MCE_INFO_MAX	16
-
-struct mce_info {
-	atomic_t		inuse;
-	struct task_struct	*t;
-	__u64			paddr;
-	int			restartable;
-} mce_info[MCE_INFO_MAX];
-
-static void mce_save_info(__u64 addr, int c)
-{
-	struct mce_info *mi;
-
-	for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++) {
-		if (atomic_cmpxchg(&mi->inuse, 0, 1) == 0) {
-			mi->t = current;
-			mi->paddr = addr;
-			mi->restartable = c;
-			return;
-		}
-	}
-
-	mce_panic("Too many concurrent recoverable errors", NULL, NULL);
-}
-
-static struct mce_info *mce_find_info(void)
-{
-	struct mce_info *mi;
-
-	for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++)
-		if (atomic_read(&mi->inuse) && mi->t == current)
-			return mi;
-	return NULL;
-}
-
-static void mce_clear_info(struct mce_info *mi)
-{
-	atomic_set(&mi->inuse, 0);
-}
-
-/*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
  *
@@ -1037,6 +992,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	DECLARE_BITMAP(toclear, MAX_NR_BANKS);
 	DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
 	char *msg = "Unknown";
+	u64 recover_paddr = ~0ull;
+	int flags = MF_ACTION_REQUIRED;
 
 	this_cpu_inc(mce_exception_count);
 
@@ -1155,9 +1112,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		if (no_way_out)
 			mce_panic("Fatal machine check on current CPU", &m, msg);
 		if (worst == MCE_AR_SEVERITY) {
-			/* schedule action before return to userland */
-			mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
-			set_thread_flag(TIF_MCE_NOTIFY);
+			recover_paddr = m.addr;
+			if (!(m.mcgstatus & MCG_STATUS_RIPV))
+				flags |= MF_MUST_KILL;
 		} else if (kill_it) {
 			force_sig(SIGBUS, current);
 		}
@@ -1168,6 +1125,23 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
 out:
 	sync_core();
+
+	if (recover_paddr == ~0ull)
+		return;
+
+	pr_err("Uncorrected hardware memory error in user-access at %llx",
+		 recover_paddr);
+	/*
+	 * We must call memory_failure() here even if the current process is
+	 * doomed. We still need to mark the page as poisoned and alert any
+	 * other users of the page.
+	 */
+	local_irq_enable();
+	if (memory_failure(recover_paddr >> PAGE_SHIFT, MCE_VECTOR, flags) < 0) {
+		pr_err("Memory error not recovered");
+		force_sig(SIGBUS, current);
+	}
+	local_irq_disable();
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
@@ -1185,42 +1159,6 @@ int memory_failure(unsigned long pfn, int vector, int flags)
 #endif
 
 /*
- * Called in process context that interrupted by MCE and marked with
- * TIF_MCE_NOTIFY, just before returning to erroneous userland.
- * This code is allowed to sleep.
- * Attempt possible recovery such as calling the high level VM handler to
- * process any corrupted pages, and kill/signal current process if required.
- * Action required errors are handled here.
- */
-void mce_notify_process(void)
-{
-	unsigned long pfn;
-	struct mce_info *mi = mce_find_info();
-	int flags = MF_ACTION_REQUIRED;
-
-	if (!mi)
-		mce_panic("Lost physical address for unconsumed uncorrectable error", NULL, NULL);
-	pfn = mi->paddr >> PAGE_SHIFT;
-
-	clear_thread_flag(TIF_MCE_NOTIFY);
-
-	pr_err("Uncorrected hardware memory error in user-access at %llx",
-		 mi->paddr);
-	/*
-	 * We must call memory_failure() here even if the current process is
-	 * doomed. We still need to mark the page as poisoned and alert any
-	 * other users of the page.
-	 */
-	if (!mi->restartable)
-		flags |= MF_MUST_KILL;
-	if (memory_failure(pfn, MCE_VECTOR, flags) < 0) {
-		pr_err("Memory error not recovered");
-		force_sig(SIGBUS, current);
-	}
-	mce_clear_info(mi);
-}
-
-/*
  * Action optional processing happens here (picking up
  * from the list of faulting pages that do_machine_check()
  * placed into the "ring").
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index ed37a768d0fc..2a33c8f68319 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -740,12 +740,6 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 {
 	user_exit();
 
-#ifdef CONFIG_X86_MCE
-	/* notify userspace of pending MCEs */
-	if (thread_info_flags & _TIF_MCE_NOTIFY)
-		mce_notify_process();
-#endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
-
 	if (thread_info_flags & _TIF_UPROBE)
 		uprobe_notify_resume(regs);
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-19 19:44       ` Linus Torvalds
@ 2014-11-19 23:04         ` Andy Lutomirski
  2014-11-19 23:23           ` Linus Torvalds
  2016-02-29  5:27         ` Andy Lutomirski
  1 sibling, 1 reply; 29+ messages in thread
From: Andy Lutomirski @ 2014-11-19 23:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Wed, Nov 19, 2014 at 11:44 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 11:29 AM, Andi Kleen <andi@firstfloor.org> wrote:
>>
>> The exception handlers which use the IST stacks don't necessarily
>> set irq count. Maybe they should.
>
> Hmm. I think they should. Since they clearly must not schedule, as
> they use a percpu stack.
>
> Which exceptions use IST?
>
> [ grep grep ]
>
> Looks like stack, doublefault, nmi, debug and mce. And yes, I really
> think they should all raise the irq count if they don't already.
> Rather than add random arch-specific "let's check that we're on the
> right stack" code to the might-sleep stuff, just use the one we have.
>

Does that include nmi?  I'm a bit afraid of touching that code.

It's certainly easy enough to bump irq_count in the paranoid entries.

--Andy

>                     Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-19 23:04         ` Andy Lutomirski
@ 2014-11-19 23:23           ` Linus Torvalds
  2014-11-19 23:32             ` Thomas Gleixner
  2014-11-19 23:49             ` Andy Lutomirski
  0 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2014-11-19 23:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Wed, Nov 19, 2014 at 3:04 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Does that include nmi?  I'm a bit afraid of touching that code.

NMI is kind of special, since it's really not supposed to touch
'current' or anything like that, and that's how we do preempt-count
(and that's where irq-count is) right now.

I would prefer to have preempt_count be a percpu variable rather than
a per-thread one, but there are historical reasons for that horror. Oh
well.

> It's certainly easy enough to bump irq_count in the paranoid entries.

It definitely shouldn't be done from the assembly code. Usually it's
"irq_enter/exit()" that does it, but for NMI you'd presumably just do
it explicitly from do_nmi() or similar. But NMI really is nastier than
other cases, see above.

(That said, I thought we did it anyway, but now that I look I can't
find it. So I was probably just smoking something)

             Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-19 23:23           ` Linus Torvalds
@ 2014-11-19 23:32             ` Thomas Gleixner
  2014-11-19 23:42               ` Linus Torvalds
  2014-11-19 23:49             ` Andy Lutomirski
  1 sibling, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2014-11-19 23:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andi Kleen, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Peter Zijlstra, Oleg Nesterov, Tony Luck

On Wed, 19 Nov 2014, Linus Torvalds wrote:

> On Wed, Nov 19, 2014 at 3:04 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Does that include nmi?  I'm a bit afraid of touching that code.
> 
> NMI is kind of special, since it's really not supposed to touch
> 'current' or anything like that, and that's how we do preempt-count
> (and that's where irq-count is) right now.
> 
> I would prefer to have preempt_count be a percpu variable rather than
> a per-thread one, but there are historical reasons for that horror. Oh
> well.

preempt_count is a per cpu variable at least on x86 since:

  commit c2daa3bed53a 'sched, x86: Provide a per-cpu preempt_count
  implementation'
    
> > It's certainly easy enough to bump irq_count in the paranoid entries.
> 
> It definitely shouldn't be done from the assembly code. Usually it's
> "irq_enter/exit()" that does it, but for NMI you'd presumably just do
> it explicitly from do_nmi() or similar. But NMI really is nastier than
> other cases, see above.

do_nmi()
  nmi_enter()
    preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); 

> (That said, I thought we did it anyway, but now that I look I can't
> find it. So I was probably just smoking something)

Certainly something you should not smoke again :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-19 23:32             ` Thomas Gleixner
@ 2014-11-19 23:42               ` Linus Torvalds
  0 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2014-11-19 23:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Andi Kleen, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Peter Zijlstra, Oleg Nesterov, Tony Luck

On Wed, Nov 19, 2014 at 3:32 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> preempt_count is a per cpu variable at least on x86 since:

Oh, goodie. I hadn't tracked that it went in, I just remembered the
discussion. I should have looked.

And apparently we do it in nmi_enter too, so NMI is already ok. Don't
know about the others.

                Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-19 23:23           ` Linus Torvalds
  2014-11-19 23:32             ` Thomas Gleixner
@ 2014-11-19 23:49             ` Andy Lutomirski
  2014-11-19 23:59               ` Linus Torvalds
  1 sibling, 1 reply; 29+ messages in thread
From: Andy Lutomirski @ 2014-11-19 23:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Wed, Nov 19, 2014 at 3:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 3:04 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Does that include nmi?  I'm a bit afraid of touching that code.
>
> NMI is kind of special, since it's really not supposed to touch
> 'current' or anything like that, and that's how we do preempt-count
> (and that's where irq-count is) right now.
>
> I would prefer to have preempt_count be a percpu variable rather than
> a per-thread one, but there are historical reasons for that horror. Oh
> well.
>
>> It's certainly easy enough to bump irq_count in the paranoid entries.
>
> It definitely shouldn't be done from the assembly code. Usually it's
> "irq_enter/exit()" that does it, but for NMI you'd presumably just do
> it explicitly from do_nmi() or similar. But NMI really is nastier than
> other cases, see above.

My only real objection is that it's going to be ugly and error prone.
It'll have to be something like:

if (!user_mode_vm(regs))
    irq_enter();

...

if (!user_mode_vm(regs))
    irq_exit();

because the whole point of this series is to make the IST entries not
be atomic when they come from userspace.  We can wrap this in an
inline cond_ist_enter/cond_ist_exit or whatever, but still, blech.

Tony already sent the follow-up patch that will make do_machine_check
enable interrupts if it came from userspace.

--Andy

>
> (That said, I thought we did it anyway, but now that I look I can't
> find it. So I was probably just smoking something)
>
>              Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-19 23:49             ` Andy Lutomirski
@ 2014-11-19 23:59               ` Linus Torvalds
  2014-11-20  0:13                 ` Andy Lutomirski
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2014-11-19 23:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Wed, Nov 19, 2014 at 3:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> My only real objection is that it's going to be ugly and error prone.
> It'll have to be something like:

No.

> because the whole point of this series is to make the IST entries not
> be atomic when they come from userspace.

Andy,  you need to lay off the drugs.

NMI absolutely *has* to be atomic. The whole "oh, there's a per-core
NMI flag and it disables all other NMI's and interrupts" kind of
enforces that.

Trust me. Talking about being able to preempt the NMI handler is just
crazy talk.

                    Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-19 23:59               ` Linus Torvalds
@ 2014-11-20  0:13                 ` Andy Lutomirski
  2014-11-20  0:37                   ` Linus Torvalds
  2014-11-20  7:45                   ` Ingo Molnar
  0 siblings, 2 replies; 29+ messages in thread
From: Andy Lutomirski @ 2014-11-20  0:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Wed, Nov 19, 2014 at 3:59 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 3:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> My only real objection is that it's going to be ugly and error prone.
>> It'll have to be something like:
>
> No.
>
>> because the whole point of this series is to make the IST entries not
>> be atomic when they come from userspace.
>
> Andy,  you need to lay off the drugs.
>

No drugs, just imprecision.  This series doesn't change NMI handling
at all.  It only changes machine_check int3, debug, and stack_segment.
(Why is #SS using IST stacks anyway?)

So my point stands: if machine_check is going to be conditionally
atomic, then that condition needs to be expressed somewhere.  And
machine_check really does need to sleep, and it does so today.  It's
just that, in current kernels, it pretends to be atomic, but then it
fakes everyone out before returning (look for sync_regs in entry_64.S)
and becomes non-atomic part-way through, mediated by a special TIF
flag.  I think that the current conditional sync_regs thing is even
crazier than switching stacks depending on regs->cs, so I'm trying to
reduce craziness.

With this series applied, machine_check is honest: if it came from
user space, it doesn't even pretend to be atomic.

And I'm not even making these things preemptable by default.  They
still run with irqs off.  They're just allowed to turn irqs off *if
user_mode_vm(regs)* if they want.

> NMI absolutely *has* to be atomic. The whole "oh, there's a per-core
> NMI flag and it disables all other NMI's and interrupts" kind of
> enforces that.
>
> Trust me. Talking about being able to preempt the NMI handler is just
> crazy talk.

I engage in crazy talk all the time, but I'm not *that* crazy.

--Andy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-20  0:13                 ` Andy Lutomirski
@ 2014-11-20  0:37                   ` Linus Torvalds
  2014-11-20  0:46                     ` Andy Lutomirski
  2014-11-20  7:45                   ` Ingo Molnar
  1 sibling, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2014-11-20  0:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Wed, Nov 19, 2014 at 4:13 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> No drugs, just imprecision.  This series doesn't change NMI handling
> at all.  It only changes machine_check int3, debug, and stack_segment.
> (Why is #SS using IST stacks anyway?)

.. ok, we were talking about adding an explicit preemption count to
nmi, and then you wanted to make that conditional, that kind of
freaked me out.

> So my point stands: if machine_check is going to be conditionally
> atomic, then that condition needs to be expressed somewhere.

I'd still prefer to keep that knowledge in one place, rather than
adding *another* completely ad-hoc thing in addition to what we
already have.

Also, I really don't think it should be about the particular stack
you're using. Sure, if a debug fault happens in user space, the fault
handler could sleep if it runs on the regular stack, but our
"might_sleep()" are about catching things that *could* be problematic,
even if the sleep never happens. And so, might_sleep() _should_
actually trigger, even if it's not using the IST stack, because *if*
the debug exception happened in kernel space, then we should warn.

So I'd actually *prefer* to have special hacks that perhaps then
"undo" the preemption count if the code expressly tests for "did this
happen in user space, then I know I'm safe". But then it's an
*explicit* thing, not something that just magically works because
nobody even thought about it, and the trap happened in user space.

See the argument? I'd *rather* see code like

   /* Magic */
   if (user_mode(regs)) {
       .. verify that we're using the normal kernel stack
       .. enable interrupts, enable preemption
       .. this is the explicit special case and it is aware
       .. of being special
   }

even if on the face of it it looks hacky. But an *explicit* hack is
preferable to something that just "happens" to work only for the
user-mode case.

                   Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-20  0:37                   ` Linus Torvalds
@ 2014-11-20  0:46                     ` Andy Lutomirski
  2014-11-20  1:09                       ` Linus Torvalds
  2014-11-20 10:28                       ` Borislav Petkov
  0 siblings, 2 replies; 29+ messages in thread
From: Andy Lutomirski @ 2014-11-20  0:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Wed, Nov 19, 2014 at 4:37 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 4:13 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> No drugs, just imprecision.  This series doesn't change NMI handling
>> at all.  It only changes machine_check int3, debug, and stack_segment.
>> (Why is #SS using IST stacks anyway?)
>
> .. ok, we were talking about adding an explicit preemption count to
> nmi, and then you wanted to make that conditional, that kind of
> freaked me out.

I guess I jumped around in the conversation a bit...

>
>> So my point stands: if machine_check is going to be conditionally
>> atomic, then that condition needs to be expressed somewhere.
>
> I'd still prefer to keep that knowledge in one place, rather than
> adding *another* completely ad-hoc thing in addition to what we
> already have.
>
> Also, I really don't think it should be about the particular stack
> you're using. Sure, if a debug fault happens in user space, the fault
> handler could sleep if it runs on the regular stack, but our
> "might_sleep()" are about catching things that *could* be problematic,
> even if the sleep never happens. And so, might_sleep() _should_
> actually trigger, even if it's not using the IST stack, because *if*
> the debug exception happened in kernel space, then we should warn.
>
> So I'd actually *prefer* to have special hacks that perhaps then
> "undo" the preemption count if the code expressly tests for "did this
> happen in user space, then I know I'm safe". But then it's an
> *explicit* thing, not something that just magically works because
> nobody even thought about it, and the trap happened in user space.
>
> See the argument? I'd *rather* see code like
>
>    /* Magic */
>    if (user_mode(regs)) {
>        .. verify that we're using the normal kernel stack
>        .. enable interrupts, enable preemption
>        .. this is the explicit special case and it is aware
>        .. of being special
>    }
>
> even if on the face of it it looks hacky. But an *explicit* hack is
> preferable to something that just "happens" to work only for the
> user-mode case.

So we'd do, in do_machine_check:

irq_enter();

do atomic stuff;

ist_stop_being_atomic(regs);
local_irq_enable();
...
local_irq_disable();
ist_start_being_atomic_again();

irq_exit();

and we'd have something like:

void ist_stop_being_atomic(struct pt_regs *regs)
{
  BUG_ON(!user_mode_vm(regs));
  --irq_count;
}

I'm very hesitant to use irq_enter for this, though.  I think we want
just the irq_count part.  Maybe ist_enter() and ist_exit()?  I think
that we really don't want to go anywhere near the accounting stuff in
irq_enter from an IST handler if !user_mode_vm(regs).  Doing it from
asm is somewhat less error prone, although I guess we already rely on
the IDT entries themselves being in sync with the paranoid idtentry
setting.

--Andy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-20  0:46                     ` Andy Lutomirski
@ 2014-11-20  1:09                       ` Linus Torvalds
  2014-11-20  1:11                         ` Andy Lutomirski
  2014-11-20 10:28                       ` Borislav Petkov
  1 sibling, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2014-11-20  1:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Wed, Nov 19, 2014 at 4:46 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I'm very hesitant to use irq_enter for this, though.  I think we want
> just the irq_count part.

Oh, I agree. I mentioned irq_enter/exit() just as the place the
preempt count increment normally happens. I have nothing against doing
it manually for these things that really aren't normal interrupts.

                  Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-20  1:09                       ` Linus Torvalds
@ 2014-11-20  1:11                         ` Andy Lutomirski
  0 siblings, 0 replies; 29+ messages in thread
From: Andy Lutomirski @ 2014-11-20  1:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Wed, Nov 19, 2014 at 5:09 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 4:46 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> I'm very hesitant to use irq_enter for this, though.  I think we want
>> just the irq_count part.
>
> Oh, I agree. I mentioned irq_enter/exit() just as the place the
> preempt count increment normally happens. I have nothing against doing
> it manually for these things that really aren't normal interrupts.
>

It may need to be considerably more complicated to keep RCU happy,
though.  In fact, I think that this code is already wrong.

I'll write something and cc Paul to get some feedback.

--Andy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-20  0:13                 ` Andy Lutomirski
  2014-11-20  0:37                   ` Linus Torvalds
@ 2014-11-20  7:45                   ` Ingo Molnar
  1 sibling, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2014-11-20  7:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Andi Kleen, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Peter Zijlstra, Oleg Nesterov, Tony Luck


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Wed, Nov 19, 2014 at 3:59 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Wed, Nov 19, 2014 at 3:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >>
> >> My only real objection is that it's going to be ugly and 
> >> error prone. It'll have to be something like:
> >
> > No.
> >
> >> because the whole point of this series is to make the IST 
> >> entries not be atomic when they come from userspace.
> >
> > Andy, you need to lay off the drugs.
> >
> 
> No drugs, just imprecision.  This series doesn't change NMI 
> handling at all.  It only changes machine_check int3, debug, 
> and stack_segment. (Why is #SS using IST stacks anyway?)

We made most of those preemptible in -rt and changed it away from 
the IST. I never got a good explanation from anyone for why they 
were IST in the first place - histeric accident or such.

Feel free to clean this up too!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-20  0:46                     ` Andy Lutomirski
  2014-11-20  1:09                       ` Linus Torvalds
@ 2014-11-20 10:28                       ` Borislav Petkov
  2014-11-20 23:25                         ` Andy Lutomirski
  1 sibling, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2014-11-20 10:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Andi Kleen, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Wed, Nov 19, 2014 at 04:46:29PM -0800, Andy Lutomirski wrote:
> So we'd do, in do_machine_check:
> 
> irq_enter();
> 
> do atomic stuff;
> 
> ist_stop_being_atomic(regs);
> local_irq_enable();
> ...
> local_irq_disable();
> ist_start_being_atomic_again();

Well, why would I want to go atomic again? We just do the minimally
needed atomic stuff, irq_exit() and then do the rest.

> 
> irq_exit();

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-20 10:28                       ` Borislav Petkov
@ 2014-11-20 23:25                         ` Andy Lutomirski
  0 siblings, 0 replies; 29+ messages in thread
From: Andy Lutomirski @ 2014-11-20 23:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Andi Kleen, Tony Luck, Peter Zijlstra,
	Linus Torvalds, the arch/x86 maintainers, Oleg Nesterov

On Nov 20, 2014 2:28 AM, "Borislav Petkov" <bp@alien8.de> wrote:
>
> On Wed, Nov 19, 2014 at 04:46:29PM -0800, Andy Lutomirski wrote:
> > So we'd do, in do_machine_check:
> >
> > irq_enter();
> >
> > do atomic stuff;
> >
> > ist_stop_being_atomic(regs);
> > local_irq_enable();
> > ...
> > local_irq_disable();
> > ist_start_being_atomic_again();
>
> Well, why would I want to go atomic again? We just do the minimally
> needed atomic stuff, irq_exit() and then do the rest.

Because ist_exit will get confused otherwise, and you still need to
call something on the way out for context tracking.

Yes, I could rearrange it a bit.  Will ponder.

--Andy

>
> >
> > irq_exit();
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2014-11-19 19:44       ` Linus Torvalds
  2014-11-19 23:04         ` Andy Lutomirski
@ 2016-02-29  5:27         ` Andy Lutomirski
  2016-05-24  1:23           ` Andy Lutomirski
  1 sibling, 1 reply; 29+ messages in thread
From: Andy Lutomirski @ 2016-02-29  5:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Wed, Nov 19, 2014 at 11:44 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 19, 2014 at 11:29 AM, Andi Kleen <andi@firstfloor.org> wrote:
>>
>> The exception handlers which use the IST stacks don't necessarily
>> set irq count. Maybe they should.
>
> Hmm. I think they should. Since they clearly must not schedule, as
> they use a percpu stack.
>
> Which exceptions use IST?
>
> [ grep grep ]
>
> Looks like stack, doublefault, nmi, debug and mce. And yes, I really
> think they should all raise the irq count if they don't already.
> Rather than add random arch-specific "let's check that we're on the
> right stack" code to the might-sleep stuff, just use the one we have.
>

Resurrecting an old thread:

The outcome of this discussion was that ist_enter now raises
HARDIRQ_COUNT.  I think this is causing a problem.  If a user program
enables TF, it generates a bunch of debug exceptions.  The handlers
raise the IRQ count and do stuff, and apparently some of that stuff
can raise a softirq.  (I have no idea where the softirq is being
raised.)  The softirq code notices that we're in_interrupt and doesn't
wake ksoftirqd because it thinks we're about to exit the interrupt and
process the softirq.  But we don't, which causes occasional warnings
and confuses things (and me!).

So how do we fix it?  If we stop raising HARDIRQ_COUNT (and apply
$SUBJECT?), then raise_softirq will wake ksoftirqd and life is good.
But this seems a bit silly, since, if we entered the ist exception
handler from a context with irqs on and softirqs enabled, we *could*
plausibly handle the softirq right away -- we're on an essentially
empty stack.  (Of course, it's a *small* stack, since it could be the
IST stack.)

Or we could just let ksoftirqd do its thing and stop raising
HARDIRQ_COUNT.  We could add a new preempt count field just for IST
(yuck).  We could try to hijack a different preempt count field
(NMI?).  But I kind of like the idea of just reinstating the original
patch of explicitly checking that we're on a safe stack in schedule
and __might_sleep, since that is the actual condition we care about.

--Andy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2016-02-29  5:27         ` Andy Lutomirski
@ 2016-05-24  1:23           ` Andy Lutomirski
  2016-05-24  1:48             ` Linus Torvalds
  0 siblings, 1 reply; 29+ messages in thread
From: Andy Lutomirski @ 2016-05-24  1:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Sun, Feb 28, 2016 at 9:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Nov 19, 2014 at 11:44 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Wed, Nov 19, 2014 at 11:29 AM, Andi Kleen <andi@firstfloor.org> wrote:
>>>
>>> The exception handlers which use the IST stacks don't necessarily
>>> set irq count. Maybe they should.
>>
>> Hmm. I think they should. Since they clearly must not schedule, as
>> they use a percpu stack.
>>
>> Which exceptions use IST?
>>
>> [ grep grep ]
>>
>> Looks like stack, doublefault, nmi, debug and mce. And yes, I really
>> think they should all raise the irq count if they don't already.
>> Rather than add random arch-specific "let's check that we're on the
>> right stack" code to the might-sleep stuff, just use the one we have.
>>
>
> Resurrecting an old thread:
>
> The outcome of this discussion was that ist_enter now raises
> HARDIRQ_COUNT.  I think this is causing a problem.  If a user program
> enables TF, it generates a bunch of debug exceptions.  The handlers
> raise the IRQ count and do stuff, and apparently some of that stuff
> can raise a softirq.  (I have no idea where the softirq is being
> raised.)  The softirq code notices that we're in_interrupt and doesn't
> wake ksoftirqd because it thinks we're about to exit the interrupt and
> process the softirq.  But we don't, which causes occasional warnings
> and confuses things (and me!).
>
> So how do we fix it?  If we stop raising HARDIRQ_COUNT (and apply
> $SUBJECT?), then raise_softirq will wake ksoftirqd and life is good.
> But this seems a bit silly, since, if we entered the ist exception
> handler from a context with irqs on and softirqs enabled, we *could*
> plausibly handle the softirq right away -- we're on an essentially
> empty stack.  (Of course, it's a *small* stack, since it could be the
> IST stack.)
>
> Or we could just let ksoftirqd do its thing and stop raising
> HARDIRQ_COUNT.  We could add a new preempt count field just for IST
> (yuck).  We could try to hijack a different preempt count field
> (NMI?).  But I kind of like the idea of just reinstating the original
> patch of explicitly checking that we're on a safe stack in schedule
> and __might_sleep, since that is the actual condition we care about.

Ping?  I can still trigger this fairly easily on 4.6.

--Andy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2016-05-24  1:23           ` Andy Lutomirski
@ 2016-05-24  1:48             ` Linus Torvalds
  2016-05-24  2:09               ` Andy Lutomirski
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2016-05-24  1:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Mon, May 23, 2016 at 6:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Or we could just let ksoftirqd do its thing and stop raising
>> HARDIRQ_COUNT.  We could add a new preempt count field just for IST
>> (yuck).  We could try to hijack a different preempt count field
>> (NMI?).  But I kind of like the idea of just reinstating the original
>> patch of explicitly checking that we're on a safe stack in schedule
>> and __might_sleep, since that is the actual condition we care about.
>
> Ping?  I can still trigger this fairly easily on 4.6.

.. I haven't seen a patch from you, last I saw that was kind of what I expected.

That said, I still despise your patch. Why can't you just fix
"in_interrupt()" and be done with it. The original patch was like 50
lines of changes for somethinig that feels like it should be a
one-liner.

And no, we don't add idiotic new config symbols for things like "I
have this one-liner trivial arch helper". What we do is to just test
for such a helper with "#ifdef" (and if it's a inline function we do
#define xyz xyz" so that the #ifdef works).

So the original patch in this thread is still off the table,
especially since there was absolutely no explanation for why it should
be such a crazy complicated thing.

What exactly is it you are nervous about scheduling in NMI's? I agree
that that would be disastrous, but it's not supposed to actually
happen.

              Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2016-05-24  1:48             ` Linus Torvalds
@ 2016-05-24  2:09               ` Andy Lutomirski
  2016-05-24  2:16                 ` Linus Torvalds
  0 siblings, 1 reply; 29+ messages in thread
From: Andy Lutomirski @ 2016-05-24  2:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Mon, May 23, 2016 at 6:48 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, May 23, 2016 at 6:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>> Or we could just let ksoftirqd do its thing and stop raising
>>> HARDIRQ_COUNT.  We could add a new preempt count field just for IST
>>> (yuck).  We could try to hijack a different preempt count field
>>> (NMI?).  But I kind of like the idea of just reinstating the original
>>> patch of explicitly checking that we're on a safe stack in schedule
>>> and __might_sleep, since that is the actual condition we care about.
>>
>> Ping?  I can still trigger this fairly easily on 4.6.
>
> .. I haven't seen a patch from you, last I saw that was kind of what I expected.
>
> That said, I still despise your patch. Why can't you just fix
> "in_interrupt()" and be done with it. The original patch was like 50
> lines of changes for somethinig that feels like it should be a
> one-liner.
>
> And no, we don't add idiotic new config symbols for things like "I
> have this one-liner trivial arch helper". What we do is to just test
> for such a helper with "#ifdef" (and if it's a inline function we do
> #define xyz xyz" so that the #ifdef works).
>
> So the original patch in this thread is still off the table,
> especially since there was absolutely no explanation for why it should
> be such a crazy complicated thing.
>
> What exactly is it you are nervous about scheduling in NMI's? I agree
> that that would be disastrous, but it's not supposed to actually
> happen.

It's not the NMIs I'm worried about -- they do crazy stuff, but it's
self-contained crazy stuff, and NMIs can *never* schedule, so they're
unlikely to have issues here.  It's the things that are a bit more
ambiguous.  For example, MCE handlers sometimes do schedule, and we
allow it if they use the right helpers and check the right conditions.
The original patch lets us print a big fat warning if they screw it
up.

I can't modify in_interrupt for the same reason that the current code
is broken: if in_interrupt() returns true, that's a promise that we'll
call invoke_softirq via irq_exit in a timely manner.  Doing this from
a weird ultra-atomic context that interrupted kernel code that had
IF=0 would be bad.  IOW, the whole in_interrupt() mechanism was
designed until the entirely reasonably assumption that interrupts only
happen when interrupts are on.  These special interrupt-like things
(NMI, MCE, debug) can happen asynchronously with interrupts *off*, and
the result is a mess.

What about this silly fix?  (Pardon the probable whitespace damage.)

commit b3e89c652bdf6a7d5a23b094ae921f193e62c534
Author: Andy Lutomirski <luto@kernel.org>
Date:   Mon May 23 19:07:21 2016 -0700

    x86/traps: Don't for in_interrupt() to return true in IST handlers

    Forcing in_interrupt() to return true if we're not in a bona fide
    interrupt confuses the softirq code.

    Cc: stable@vger.kernel.org
    Fixes: 959274753857 ("x86, traps: Track entry into and exit from
IST context")
    Signed-off-by: Andy Lutomirski <luto@kernel.org>

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d1590486204a..9c1d948d5d8b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -96,6 +96,19 @@ static inline void cond_local_irq_disable(struct
pt_regs *regs)
         local_irq_disable();
 }

+/*
+ * We want to cause in_atomic() to return true while in an IST handler
+ * so that attempts to schedule will warn.  We also want to detect buggy
+ * code that does preempt_enable(); preempt_disable() in an IST handler,
+ * so we want in_atomic to still return true if that happens.
+ *
+ * We cannot add use HARDIRQ_OFFSET or otherwise cause in_interrupt() to
+ * return true: the softirq code assumes that in_interrupt() only
+ * returns true if we will soon execute softirqs, and we can't do that
+ * if an IST entry interrupts kernel code with interrupts disabled.
+ */
+#define IST_OFFSET (3 * PREEMPT_OFFSET)
+
 void ist_enter(struct pt_regs *regs)
 {
     if (user_mode(regs)) {
@@ -116,7 +129,7 @@ void ist_enter(struct pt_regs *regs)
      * on x86_64 and entered from user mode, in which case we're
      * still atomic unless ist_begin_non_atomic is called.
      */
-    preempt_count_add(HARDIRQ_OFFSET);
+    preempt_count_add(IST_OFFSET);

     /* This code is a bit fragile.  Test it. */
     RCU_LOCKDEP_WARN(!rcu_is_watching(), "ist_enter didn't work");
@@ -124,7 +137,7 @@ void ist_enter(struct pt_regs *regs)

 void ist_exit(struct pt_regs *regs)
 {
-    preempt_count_sub(HARDIRQ_OFFSET);
+    preempt_count_sub(IST_OFFSET);

     if (!user_mode(regs))
         rcu_nmi_exit();

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
  2016-05-24  2:09               ` Andy Lutomirski
@ 2016-05-24  2:16                 ` Linus Torvalds
  0 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2016-05-24  2:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List, Peter Zijlstra, Oleg Nesterov,
	Tony Luck

On Mon, May 23, 2016 at 7:09 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> What about this silly fix?  (Pardon the probable whitespace damage.)

That looks fine to me, and has a reason for it.

That said, I'm not convinced about the preempt_enable/preempt_disable
excuse: that would be horribly buggy anyway, and has nothing to do
with IST's. Why protect against random insane bugs?

So the "3*" certainly doesn't hurt, but and I won't argue more against
it, but it looks pretty magical and made-up. Might as well just
add/subtract one. You can never protect against people who
intentionally write buggy code, and if it's protecting against
unintentional buggy code I don't see why the IST case is so special
when it's just a tiny fraction of code..

              Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-05-24  2:16 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-18 23:15 [PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack Andy Lutomirski
2014-11-18 23:15 ` [PATCH v3 1/3] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME Andy Lutomirski
2014-11-18 23:15 ` [PATCH v3 2/3] x86, entry: Switch stacks on a paranoid entry from userspace Andy Lutomirski
2014-11-18 23:15 ` [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep Andy Lutomirski
2014-11-19 18:40   ` Linus Torvalds
2014-11-19 19:23     ` Andy Lutomirski
2014-11-19 19:29     ` Andi Kleen
2014-11-19 19:44       ` Linus Torvalds
2014-11-19 23:04         ` Andy Lutomirski
2014-11-19 23:23           ` Linus Torvalds
2014-11-19 23:32             ` Thomas Gleixner
2014-11-19 23:42               ` Linus Torvalds
2014-11-19 23:49             ` Andy Lutomirski
2014-11-19 23:59               ` Linus Torvalds
2014-11-20  0:13                 ` Andy Lutomirski
2014-11-20  0:37                   ` Linus Torvalds
2014-11-20  0:46                     ` Andy Lutomirski
2014-11-20  1:09                       ` Linus Torvalds
2014-11-20  1:11                         ` Andy Lutomirski
2014-11-20 10:28                       ` Borislav Petkov
2014-11-20 23:25                         ` Andy Lutomirski
2014-11-20  7:45                   ` Ingo Molnar
2016-02-29  5:27         ` Andy Lutomirski
2016-05-24  1:23           ` Andy Lutomirski
2016-05-24  1:48             ` Linus Torvalds
2016-05-24  2:09               ` Andy Lutomirski
2016-05-24  2:16                 ` Linus Torvalds
2014-11-19 18:29 ` [PATCH v3 0/3] Handle IST interrupts from userspace on the normal stack Luck, Tony
2014-11-19 22:15 ` [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).